parallelism goes brrr #37877

NouamaneTazi · 2025-04-29T22:22:27Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-04-29T22:35:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-04-30T11:50:27Z

cc @Cyrilvallez for TP!

ArthurZucker

🚀

src/transformers/modeling_utils.py

src/transformers/integrations/tensor_parallel.py

NouamaneTazi · 2025-05-20T07:42:28Z

Thanks @S1ro1 for pointing me to this PR. I am trying to understand the PR. Couple of first thoughts, I see ReplicateParallel being used here, we initially introduced it in the PR (#36132) however replaced it with use of implicit_replication torch API. Any reason to bring that back here? 🤔

not really, feel free to do the fix 🙏🏼

…ane/nanotron

ArthurZucker · 2025-05-20T10:23:56Z

I just need to test this with TP

manueldeprada · 2025-05-20T16:06:17Z

This breaks tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_causal_lm_can_accept_kwargs
@NouamaneTazi can you have a look?

manueldeprada · 2025-05-20T16:09:58Z

Its worse, it breaks all these:

FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_causal_lm_can_accept_kwargs - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_cpu_offload - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_disk_offload_bin - AssertionError: ValueError not raised
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_disk_offload_safetensors - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_model_parallel_beam_search - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_model_parallelism - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'

maybe we should revert change and merge after testing on mamba, zamba, jamba, mamba2, falcon_mamba...?

@NouamaneTazi @ArthurZucker

This reverts commit 1c2f36b.

* Revert "Protect ParallelInterface" This reverts commit cb513e3. * Revert "parallelism goes brrr (#37877)" This reverts commit 1c2f36b. * Empty commit

* accept custom device_mesh * fix device_map * assert that num_heads % tp_size == 0 * todo. * ReplicateParallel * handle tied weights * handle dtensor in save_pretrained with safe_serialization * tp test works * doesnt work * fix shard_and_distribute_module's rank should be local_rank * tp=4 is correct * dp+tp is broken * todo allreduce with dtensors on another dim is annoying * workaround to sync dp grads when using dtensors * loading a checkpoint works * wandb and compare losses with different tp/dp * cleaning * cleaning * . * . * logs * CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention * DP=2 TP=2 now works even with tied embeddings * model.parameters() and model.module.parameters() are empty.. * reformat sanity_check_tensor_sync * set atol=1e-4 for CP to pass * try populate _parameters from named_modules * refactors TP2 DP2 works CP2 DP2 works * is_causal=True and pack sequences, no attn mask, and preshuffle dataset * fix packing * CP=4 doesn't work * fix labels and position_ids for CP * DP CP works with transformers 🥳🥳🥳 * refactor * add example cp * fixup * revert sdpa changes * example cleared * add CP, DP to the mesh init * nit * clean * use `ALL_PARALLEL_STYLES` * style * FSDP works * log on 1 rank * . * fix? * FSDP1 also has .parameters() bug * reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay * . * style and fixup * move stuff around * fix tests * style * let's make it a check * warning should be an info --------- Co-authored-by: Arthur Zucker <[email protected]>

* Revert "Protect ParallelInterface" This reverts commit cb513e3. * Revert "parallelism goes brrr (huggingface#37877)" This reverts commit 1c2f36b. * Empty commit

ArthurZucker · 2025-05-21T06:12:31Z

We reverted it mega sorry everyone, our tests should have triggered the whole CI, they did not.

ydshieh · 2025-05-21T13:41:25Z

~~@ArthurZucker In this particular case, the failing tests are only triggered on machines with GPU, or they are only failing on such machines but pass on CPU machines.~~

OK, so test fetcher have to be checked too

* accept custom device_mesh * fix device_map * assert that num_heads % tp_size == 0 * todo. * ReplicateParallel * handle tied weights * handle dtensor in save_pretrained with safe_serialization * tp test works * doesnt work * fix shard_and_distribute_module's rank should be local_rank * tp=4 is correct * dp+tp is broken * todo allreduce with dtensors on another dim is annoying * workaround to sync dp grads when using dtensors * loading a checkpoint works * wandb and compare losses with different tp/dp * cleaning * cleaning * . * . * logs * CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention * DP=2 TP=2 now works even with tied embeddings * model.parameters() and model.module.parameters() are empty.. * reformat sanity_check_tensor_sync * set atol=1e-4 for CP to pass * try populate _parameters from named_modules * refactors TP2 DP2 works CP2 DP2 works * is_causal=True and pack sequences, no attn mask, and preshuffle dataset * fix packing * CP=4 doesn't work * fix labels and position_ids for CP * DP CP works with transformers 🥳🥳🥳 * refactor * add example cp * fixup * revert sdpa changes * example cleared * add CP, DP to the mesh init * nit * clean * use `ALL_PARALLEL_STYLES` * style * FSDP works * log on 1 rank * . * fix? * FSDP1 also has .parameters() bug * reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay * . * style and fixup * move stuff around * fix tests * style * let's make it a check * warning should be an info --------- Co-authored-by: Arthur Zucker <[email protected]>

accept custom device_mesh

3d90a99

NouamaneTazi added 2 commits April 30, 2025 11:02

fix device_map

df1eaee

assert that num_heads % tp_size == 0

b929886

NouamaneTazi added 6 commits April 30, 2025 15:35

todo.

1df751b

ReplicateParallel

5887ffc

handle tied weights

924ccee

handle dtensor in save_pretrained with safe_serialization

cfacec5

tp test works

9833305

doesnt work

7d7b363

ArthurZucker added Tensor Parallel Core: Modeling Internals of the library; Models. labels May 1, 2025

ArthurZucker reviewed May 1, 2025

View reviewed changes

S1ro1 reviewed May 1, 2025

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

NouamaneTazi added 8 commits May 1, 2025 22:22

fix shard_and_distribute_module's rank should be local_rank

11f02a5

tp=4 is correct

317c027

dp+tp is broken

f3b4ae8

todo allreduce with dtensors on another dim is annoying

f6a49ee

workaround to sync dp grads when using dtensors

eaa6592

loading a checkpoint works

7c6219b

wandb and compare losses with different tp/dp

6ceabe0

cleaning

a9a1592

NouamaneTazi requested review from ArthurZucker and S1ro1 May 2, 2025 17:52

NouamaneTazi marked this pull request as ready for review May 2, 2025 17:52

cleaning

4e323a5

qubvel reviewed May 2, 2025

View reviewed changes

src/transformers/integrations/tensor_parallel.py Outdated Show resolved Hide resolved

NouamaneTazi added 2 commits May 2, 2025 18:00

.

7f327b1

.

c3e5c5e

.

ba01287

S1ro1 mentioned this pull request May 16, 2025

Tensor parallel docs #38178

Merged

ArthurZucker mentioned this pull request May 20, 2025

feat: support indivisible shards for TP model loading and TPlizing. #37220

Merged

5 tasks

ArthurZucker added 6 commits May 20, 2025 11:14

style and fixup

677ce53

move stuff around

81c21de

Merge branch 'main' of github.com:huggingface/transformers into nouam…

656277c

…ane/nanotron

fix tests

e27ddb8

style

d702d94

let's make it a check

5083c0b

warning should be an info

67a8182

ArthurZucker enabled auto-merge (squash) May 20, 2025 14:11

ArthurZucker disabled auto-merge May 20, 2025 14:22

ArthurZucker merged commit 1c2f36b into main May 20, 2025
21 checks passed

ArthurZucker deleted the nouamane/nanotron branch May 20, 2025 14:22

LysandreJik added a commit that referenced this pull request May 20, 2025

Revert "parallelism goes brrr (#37877)"

b5fdc3b

This reverts commit 1c2f36b.

LysandreJik restored the nouamane/nanotron branch May 20, 2025 20:20

LysandreJik added a commit that referenced this pull request May 20, 2025

Revert parallelism temporarily (#38240)

711d78d

* Revert "Protect ParallelInterface" This reverts commit cb513e3. * Revert "parallelism goes brrr (#37877)" This reverts commit 1c2f36b. * Empty commit

LysandreJik added a commit that referenced this pull request May 20, 2025

Revert parallelism temporarily (#38240)

eaa3016

* Revert "Protect ParallelInterface" This reverts commit cb513e3. * Revert "parallelism goes brrr (#37877)" This reverts commit 1c2f36b. * Empty commit

cynricfu mentioned this pull request May 22, 2025

[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API deepspeedai/DeepSpeed#7298

Closed

ydshieh mentioned this pull request Jun 10, 2025

Add support for context parallelism #35983

Open

zucchini-nlp mentioned this pull request Jul 16, 2025

Qwen2.5-VL Sharding error when using Tensor Parallelism #39399

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallelism goes brrr #37877

parallelism goes brrr #37877

Uh oh!

NouamaneTazi commented Apr 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

Rocketknight1 commented Apr 30, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

NouamaneTazi commented May 20, 2025

Uh oh!

ArthurZucker commented May 20, 2025

Uh oh!

Uh oh!

manueldeprada commented May 20, 2025

Uh oh!

manueldeprada commented May 20, 2025

Uh oh!

ArthurZucker commented May 21, 2025

Uh oh!

ydshieh commented May 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

parallelism goes brrr #37877

parallelism goes brrr #37877

Uh oh!

Conversation

NouamaneTazi commented Apr 29, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

Rocketknight1 commented Apr 30, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NouamaneTazi commented May 20, 2025

Uh oh!

ArthurZucker commented May 20, 2025

Uh oh!

Uh oh!

manueldeprada commented May 20, 2025

Uh oh!

manueldeprada commented May 20, 2025

Uh oh!

ArthurZucker commented May 21, 2025

Uh oh!

ydshieh commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ydshieh commented May 21, 2025 •

edited

Loading