Tensor parallelism for Mixture of Experts #63

siddharth9820 · 2022-07-06T18:47:21Z

Note: This is in conjunction with this PR on the deepspeed repo.

This PR adds tensor parallelism for non-experts. This combined with ZeRO-2 allows us to scale to roughly 2x larger base models than ZeRO-2. When tensor parallelism is enabled only for non-experts, there are duplicate tokens at each gate. It is important to drop the duplicates before they reach the experts, otherwise we run into convergence issues. In megatron/mpu/mappings.py, we provide new autograd functions that drop these tokens and gather them, namely _DropTokens and _AllGatherFromModelParallelRegion.
In the current implementation, we drop tokens right before the AlltoAll and gather them right after the AlltoAll. These calls are done in the Deepspeed codebase.

Update: This PR now supports tensor parallelism for experts as well. This can be enabled by passing the --enable-expert-tensor-parallelism argument.

siddharth9820 · 2022-07-06T19:24:36Z

Comparing loss curves with no tensor parallelism

megatron/mpu/mappings.py

megatron/model/transformer.py

megatron/mpu/layers.py

siddharth9820 · 2022-07-20T01:56:11Z

Here are the loss curves when tensor parallelism is used for both the experts and non experts.

Note: these were buggy and were fixed subsequently

megatron/mpu/layers.py

megatron/model/transformer.py

megatron/mpu/layers.py

…oe-tensor-parallelism

[merge]: into `microsoft-main` $\leftarrow$ from `hzheng-data-fix`

siddharth9820 added 3 commits June 29, 2022 12:26

add tensor parallelism

86dba34

TP for non-experts, drop token before a2a

aa08fc1

dropping tokens after gating

44a1203

siddharth9820 requested review from RezaYazdaniAminabadi, ShadenSmith, arashb, awan-10, cli99, conglongli, duli2012, eltonzheng, jeffra, minjiaz, mrwyattii, samyam, tjruwase, xiaoxiawu-microsoft and yaozhewei as code owners July 6, 2022 18:47

remove commented code

92051b3

siddharth9820 mentioned this pull request Jul 6, 2022

Tensor parallelism for Mixture of Experts deepspeedai/DeepSpeed#2074

Merged

siddharth9820 added 3 commits July 7, 2022 00:04

remove spurious changes

441173d

remove commented code

7ed0611

remove spurious code

2f29eb5

remove blank line

1aa8ce0

awan-10 reviewed Jul 11, 2022

View reviewed changes

megatron/mpu/mappings.py Outdated Show resolved Hide resolved

siddharth9820 added 4 commits July 13, 2022 01:14

change to schedules

e3b0105

modified checks in mpu/layers

3f0cbe1

better named flag

e79106a

migrate code to deepspeed

30e2234

siddharth9820 added 6 commits July 13, 2022 22:21

remove blank lines

ef3f77c

remove blank lines

96b1cb6

remove blank lines

961bf8d

restore mappings.py

4089188

restore mappings.py

e1a345c

remove unnecessary code

18c0d84

awan-10 reviewed Jul 15, 2022

View reviewed changes

megatron/model/transformer.py Outdated Show resolved Hide resolved

awan-10 reviewed Jul 15, 2022

View reviewed changes

megatron/mpu/layers.py Outdated Show resolved Hide resolved

siddharth9820 mentioned this pull request Jul 19, 2022

Fixing the MoE training when using model-parallelism #22

Closed

restructure code and introduce tensor parallelism for experts

1619965

siddharth9820 changed the title ~~Tensor parallelism for Non-Experts~~ Tensor parallelism for Mixture of Experts Jul 20, 2022

siddharth9820 commented Jul 20, 2022

View reviewed changes

megatron/mpu/layers.py Outdated Show resolved Hide resolved

siddharth9820 added 4 commits July 21, 2022 22:11

correct ep_size

066632b

set ep size correctly

c5e0f40

correctly set ep_size

8398bb0

remove client side code that sets ep_size

3aa05d3

conglongli approved these changes Jul 26, 2022

View reviewed changes

megatron/model/transformer.py Outdated Show resolved Hide resolved

correct ep_size

e20fc2e

siddharth9820 requested a review from awan-10 July 26, 2022 18:57

awan-10 reviewed Jul 26, 2022

View reviewed changes

megatron/mpu/layers.py Outdated Show resolved Hide resolved

awan-10 approved these changes Jul 26, 2022

View reviewed changes

siddharth9820 added 3 commits July 27, 2022 01:25

small fix

92e8839

small fix

ad593a0

Merge branch 'main' of github.com:microsoft/Megatron-DeepSpeed into m…

3e361a4

…oe-tensor-parallelism

siddharth9820 merged commit 222d899 into main Aug 1, 2022

siddharth9820 deleted the moe-tensor-parallelism branch August 1, 2022 00:42

hyoo pushed a commit to hyoo/Megatron-DeepSpeed that referenced this pull request Apr 21, 2023

Update README.md (deepspeedai#63)

e3463fd

conglongli mentioned this pull request Jul 20, 2023

why does MoE not support TP and PP #151

Open

saforem2 added a commit to saforem2/Megatron-DeepSpeed that referenced this pull request Nov 15, 2024

Merge pull request deepspeedai#63 from argonne-lcf/hzheng-data-fix

40db8c2

[merge]: into `microsoft-main` $\leftarrow$ from `hzheng-data-fix`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor parallelism for Mixture of Experts #63

Tensor parallelism for Mixture of Experts #63

siddharth9820 commented Jul 6, 2022 •

edited

Loading

Uh oh!

siddharth9820 commented Jul 6, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddharth9820 commented Jul 20, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tensor parallelism for Mixture of Experts #63

Tensor parallelism for Mixture of Experts #63

Conversation

siddharth9820 commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddharth9820 commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddharth9820 commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

siddharth9820 commented Jul 6, 2022 •

edited

Loading

siddharth9820 commented Jul 6, 2022 •

edited

Loading

siddharth9820 commented Jul 20, 2022 •

edited

Loading