[autoparallel] handled illegal sharding strategy #1728

FrankLeeeee · 2022-10-18T07:00:27Z

What is the problem?

There is seldom a systematic way to filter out the illegal sharding strategy generated by the StrategyGenerator. Illegal sharding strategies can be generated when:

a tensor cannot be sharded at all, e.g. a tensor is of the shape [1, 4] and we want to shard the dimension 0
the logical sharding spec cannot be applied to the physical shape, e.g. for logical shape [4, 4], physical shape [1, 2, 2, 4] and logical sharding spec [S, S], we cannot have a physical sharding spec [S, R, R, S]

These sharding strategies are allowed by default in the current implementation and thus will lead to wrong result.

What does this PR do?

This PR designed and implemented a systematic and hierarchical way of handling illegal sharding strategy. The illegal ones are captured in three layers:

ShardingSpec: when a ShardingSpec is instantiated, it will automatically check whether the specs are correct. If not, it will throw ShardingSpecException.
StrategyGenerator: for a StrategyGenerator, we need to implement one method for one strategy. This method must be decorated with ignore_sharding_exception, this decorator will capture the ShardingSpecException and return None upon exception. These None values will be removed automatically later.
NodeHandler: During logical-physical sharding spec conversion in the post_process method, the developer needs to manually catch the ShardingSpecException. The NodeHandler will check for the validity of the sharding strategy in register_strategy method (currently not implemented in this PR as it will cause many tests to fail, I will put up a separate PR to deal with this).

In this way, we can ensure each node has the correct sharding strategies.

A summary of the code change

ShardingSpecException is defined in the sharding_spec.py so that every exception we throw has better semantics.
In this PR, the APIs of the StrategyGenerator is refactored by introducing an additional collate_strategies method. This method is introduced so that we don't have to manually remove illegal sharding strategy and update cost in every generator. This will be taken over by the generate method and code redundancy is removed. Therefore, for every childStrategyGenerator, the generate method is changed to collate_strategies method, and generate method only exists in the parent class.

Every strategy method in the child StrategyGenerator implementation is decorated with ignore_sharding_exception. Two changes are included for this decorator.
- This function is renamed from exception_handler because we don't catch the general exception, but only sharding exception
- This function now returns None explicitly if exception occurs.
The validate method will be called inside __init__ method of the StrategyGenerator class. validate is defined previously but never called anywhere, so now it is called to do first-hand checking.
As an example, I added stricter test code in the test_linear_node_handler.py to ensure all generated sharding strategies are valid.

colossalai/auto_parallel/tensor_shard/node_handler/dot_handler.py

colossalai/auto_parallel/tensor_shard/node_handler/strategy/strategy_generator.py

FrankLeeeee added the Run Build and Test label Oct 18, 2022

YuliangLiu0306 reviewed Oct 19, 2022

View reviewed changes

colossalai/auto_parallel/tensor_shard/node_handler/dot_handler.py Show resolved Hide resolved

YuliangLiu0306 reviewed Oct 19, 2022

View reviewed changes

colossalai/auto_parallel/tensor_shard/node_handler/strategy/strategy_generator.py Show resolved Hide resolved

YuliangLiu0306 approved these changes Oct 19, 2022

View reviewed changes

[autoparallel] handled illegal sharding strategy

48b52e9

FrankLeeeee force-pushed the hotfix/complete-node-handler-testing branch from 841ebc6 to 48b52e9 Compare October 19, 2022 03:44

polish code

1bb757a

FrankLeeeee merged commit eee8490 into hpcaitech:main Oct 19, 2022

FrankLeeeee mentioned this pull request Oct 19, 2022

[autoparallel] handled illegal strategy in node handler #1743

Merged

FrankLeeeee deleted the hotfix/complete-node-handler-testing branch January 26, 2023 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoparallel] handled illegal sharding strategy #1728

[autoparallel] handled illegal sharding strategy #1728

FrankLeeeee commented Oct 18, 2022 •

edited

Loading

[autoparallel] handled illegal sharding strategy #1728

[autoparallel] handled illegal sharding strategy #1728

Conversation

FrankLeeeee commented Oct 18, 2022 • edited Loading

What is the problem?

What does this PR do?

A summary of the code change

FrankLeeeee commented Oct 18, 2022 •

edited

Loading