-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] fix initialization #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let @chhzh123 double check.
Also please rebase onto the latest main branch so that the CI should work.
Is the long initialization time mainly because of the dist group creation and destruction? |
Mainly because of group creation. Group destruction causes nccl error for gpt example |
a6e8142
to
885f763
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. I'll approve when the comments are addressed.
@chhzh123 PTAL.
LGTM. I'll approve when the interface is finalized. |
3bde9aa
to
11e669a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just nits.
@@ -51,5 +53,50 @@ def forward(self, x): | |||
assert ("stage2.linear.weight", 2) in tie_weights[0] | |||
|
|||
|
|||
def test_analyze_tie_ranks(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just reuse the above test, as you also call analyze_tie_weights
in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, the above test uses 3 stages, but we can only use 2 stages when there are only 2 gpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh I just realized that this test may not pass CI because we don't have DeepSpeed installed in the docker image...can you fake a topology? This also addresses the 2 GPU issue.
for _, _ in tie_ranks: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lint does not pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. What's the lint error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tie_ranks not used. this tie_ranks is a placeholder for you to add additional functionality
Co-authored-by: Cody Yu <[email protected]>
Description
Checklist