Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] fix initialization #4

Merged
merged 32 commits into from
Jan 21, 2023
Merged

Conversation

szhengac
Copy link
Contributor

Description

  1. Remove redundant dist group creation
  2. Remove group destroy which cause nccl error
  3. Add user defined init func.

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let @chhzh123 double check.
Also please rebase onto the latest main branch so that the CI should work.

@chhzh123
Copy link
Contributor

Is the long initialization time mainly because of the dist group creation and destruction?

@szhengac
Copy link
Contributor Author

Is the long initialization time mainly because of the dist group creation and destruction?

Mainly because of group creation. Group destruction causes nccl error for gpt example

@szhengac szhengac force-pushed the consolidation branch 3 times, most recently from a6e8142 to 885f763 Compare January 19, 2023 19:14
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I'll approve when the comments are addressed.
@chhzh123 PTAL.

@chhzh123
Copy link
Contributor

LGTM. I'll approve when the interface is finalized.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just nits.

@@ -51,5 +53,50 @@ def forward(self, x):
assert ("stage2.linear.weight", 2) in tie_weights[0]


def test_analyze_tie_ranks():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just reuse the above test, as you also call analyze_tie_weights in this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, the above test uses 3 stages, but we can only use 2 stages when there are only 2 gpus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I just realized that this test may not pass CI because we don't have DeepSpeed installed in the docker image...can you fake a topology? This also addresses the 2 GPU issue.

Comment on lines +960 to +961
for _, _ in tie_ranks:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lint does not pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. What's the lint error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tie_ranks not used. this tie_ranks is a placeholder for you to add additional functionality

@szhengac szhengac merged commit e5735e6 into awslabs:main Jan 21, 2023
@szhengac szhengac deleted the consolidation branch January 21, 2023 02:01
chhzh123 pushed a commit to chhzh123/slapo that referenced this pull request Jan 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants