-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing MuP #1061
base: main
Are you sure you want to change the base?
Fixing MuP #1061
Changes from all commits
f85352c
a24ba2d
e6ae3dc
faff65e
4ebfb86
fe70494
b5308f3
57ddc14
a34a6c6
05ff7df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,19 +26,36 @@ | |
"mup-rp-embedding-mult": 1.0, | ||
``` | ||
|
||
## Install package | ||
|
||
``` | ||
cd mup | ||
pip install -e . | ||
``` | ||
|
||
## Generate base shapes | ||
|
||
1. Set use-mup to true | ||
2. Set save-base-shapes to true | ||
3. Run once. gpt-neox will instantiate a base model and a delta model, then save one file per rank named <base-shapes-file>.<rank>. gpt-neox will exit immediately. | ||
4. Set save-base-shapes to false | ||
|
||
## Generate coord check plots (optional) | ||
## Testing the implementation | ||
|
||
The most simple test is to use the coordinate check: | ||
1. Keep use-mup true | ||
2. Set coord-check to true | ||
3. Run once. gpt-neox will output jpg images similar to https://github.com/microsoft/mutransformers/blob/main/README.md#coord-check. gpt-neox will exit immediately | ||
3. Run once. gpt-neox will output jpg images similar to those below and exit immediately | ||
4. Set coord-check to false | ||
What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal | ||
![](mup/figures/coord_check_up.0.jpg) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. |
||
<font size="1"> *Healthy coordinate check*</font> | ||
![](mup/figures/coord_check_sp.0.jpg) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. |
||
<font size="1"> *Something's wrong*</font> | ||
|
||
A second kind of test is to pick any configuration and learning rate (that doesn't lead to diverging training) and simply run a few different experiments fixing everything except for the width. Since with mup wider is always better the results should look like the figure below | ||
![](mup/figures/width_check.png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is missing a text alternative. This is a problem for people using screen readers. |
||
<font size="1"> *Healthy training*</font> | ||
|
||
## Tune mup hyperparameters and LR | ||
|
||
|
@@ -47,3 +64,10 @@ The values under `mup hp search` were added and correspond to appendix F.4 from | |
## Transfer | ||
|
||
With the best LR set and the best mup HPs set, revert the value of hidden-size in the scaled-up config and run again. | ||
|
||
## Usage under distributed setting | ||
|
||
The code is setup so that each individual rank takes care of its own piece of model and dumps a different shape file to be picked up for training. The easiest way to do the right thing is to generate the base shapes with the same number of devices and same tensor/pipe parallelism that should be used later on. Consider also the following | ||
- Data parallelism: nothing changes for mup, you can copy paste a base_shape N times for each data-parallel rank | ||
- Pipe parallelism: still nothing changes but different ranks need to deal with different layers so check above | ||
- **Tensor parallelism: has a huge effect on mup**. Column parallel layers get chopped on the input dimension changing the actual width of the parameter. Think carefully about what you are doing if you are not sticking to what's written above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image is missing a text alternative. This is a problem for people using screen readers.