Skip to content

Conversation

@NouamaneTazi
Copy link
Member

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member

cc @Cyrilvallez for TP!

@ArthurZucker ArthurZucker added Tensor Parallel Core: Modeling Internals of the library; Models. labels May 1, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@NouamaneTazi NouamaneTazi requested review from ArthurZucker and S1ro1 May 2, 2025 17:52
@NouamaneTazi NouamaneTazi marked this pull request as ready for review May 2, 2025 17:52
@S1ro1 S1ro1 mentioned this pull request May 16, 2025
@NouamaneTazi
Copy link
Member Author

Thanks @S1ro1 for pointing me to this PR. I am trying to understand the PR. Couple of first thoughts, I see ReplicateParallel being used here, we initially introduced it in the PR (#36132) however replaced it with use of implicit_replication torch API. Any reason to bring that back here? 🤔

not really, feel free to do the fix 🙏🏼

@ArthurZucker
Copy link
Collaborator

I just need to test this with TP

@ArthurZucker ArthurZucker enabled auto-merge (squash) May 20, 2025 14:11
@ArthurZucker ArthurZucker disabled auto-merge May 20, 2025 14:22
@ArthurZucker ArthurZucker merged commit 1c2f36b into main May 20, 2025
21 checks passed
@ArthurZucker ArthurZucker deleted the nouamane/nanotron branch May 20, 2025 14:22
@manueldeprada
Copy link
Contributor

This breaks tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_causal_lm_can_accept_kwargs
@NouamaneTazi can you have a look?

@manueldeprada
Copy link
Contributor

Its worse, it breaks all these:

FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_causal_lm_can_accept_kwargs - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_cpu_offload - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_disk_offload_bin - AssertionError: ValueError not raised
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_disk_offload_safetensors - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_model_parallel_beam_search - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
FAILED tests/models/mamba/test_modeling_mamba.py::MambaModelTest::test_model_parallelism - AttributeError: 'MambaModel' object has no attribute 'hf_device_map'

maybe we should revert change and merge after testing on mamba, zamba, jamba, mamba2, falcon_mamba...?

@NouamaneTazi @ArthurZucker

LysandreJik added a commit that referenced this pull request May 20, 2025
@LysandreJik LysandreJik restored the nouamane/nanotron branch May 20, 2025 20:20
LysandreJik added a commit that referenced this pull request May 20, 2025
* Revert "Protect ParallelInterface"

This reverts commit cb513e3.

* Revert "parallelism goes brrr (#37877)"

This reverts commit 1c2f36b.

* Empty commit
LysandreJik added a commit that referenced this pull request May 20, 2025
* Revert "Protect ParallelInterface"

This reverts commit cb513e3.

* Revert "parallelism goes brrr (#37877)"

This reverts commit 1c2f36b.

* Empty commit
faaany pushed a commit to faaany/transformers that referenced this pull request May 21, 2025
* accept custom device_mesh

* fix device_map

* assert that num_heads % tp_size == 0

* todo.

* ReplicateParallel

* handle tied weights

* handle dtensor in save_pretrained with safe_serialization

* tp test works

* doesnt work

* fix shard_and_distribute_module's rank should be local_rank

* tp=4 is correct

* dp+tp is broken

* todo allreduce with dtensors on another dim is annoying

* workaround to sync dp grads when using dtensors

* loading a checkpoint works

* wandb and compare losses with different tp/dp

* cleaning

* cleaning

* .

* .

* logs

* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention

* DP=2 TP=2 now works even with tied embeddings

* model.parameters() and model.module.parameters() are empty..

* reformat sanity_check_tensor_sync

* set atol=1e-4 for CP to pass

* try populate _parameters from named_modules

* refactors
TP2 DP2 works
CP2 DP2 works

* is_causal=True and pack sequences, no attn mask, and preshuffle dataset

* fix packing

* CP=4 doesn't work

* fix labels and position_ids for CP

* DP CP works with transformers 🥳🥳🥳

* refactor

* add example cp

* fixup

* revert sdpa changes

* example cleared

* add CP, DP to the mesh init

* nit

* clean

* use `ALL_PARALLEL_STYLES`

* style

* FSDP works

* log on 1 rank

* .

* fix?

* FSDP1 also has .parameters() bug

* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay

* .

* style and fixup

* move stuff around

* fix tests

* style

* let's make it a check

* warning should be an info

---------

Co-authored-by: Arthur Zucker <[email protected]>
faaany pushed a commit to faaany/transformers that referenced this pull request May 21, 2025
* Revert "Protect ParallelInterface"

This reverts commit cb513e3.

* Revert "parallelism goes brrr (huggingface#37877)"

This reverts commit 1c2f36b.

* Empty commit
@ArthurZucker
Copy link
Collaborator

We reverted it mega sorry everyone, our tests should have triggered the whole CI, they did not.

@ydshieh
Copy link
Collaborator

ydshieh commented May 21, 2025

@ArthurZucker In this particular case, the failing tests are only triggered on machines with GPU, or they are only failing on such machines but pass on CPU machines.

OK, so test fetcher have to be checked too

xvyv99 pushed a commit to xvyv99/transformers that referenced this pull request May 21, 2025
* accept custom device_mesh

* fix device_map

* assert that num_heads % tp_size == 0

* todo.

* ReplicateParallel

* handle tied weights

* handle dtensor in save_pretrained with safe_serialization

* tp test works

* doesnt work

* fix shard_and_distribute_module's rank should be local_rank

* tp=4 is correct

* dp+tp is broken

* todo allreduce with dtensors on another dim is annoying

* workaround to sync dp grads when using dtensors

* loading a checkpoint works

* wandb and compare losses with different tp/dp

* cleaning

* cleaning

* .

* .

* logs

* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention

* DP=2 TP=2 now works even with tied embeddings

* model.parameters() and model.module.parameters() are empty..

* reformat sanity_check_tensor_sync

* set atol=1e-4 for CP to pass

* try populate _parameters from named_modules

* refactors
TP2 DP2 works
CP2 DP2 works

* is_causal=True and pack sequences, no attn mask, and preshuffle dataset

* fix packing

* CP=4 doesn't work

* fix labels and position_ids for CP

* DP CP works with transformers 🥳🥳🥳

* refactor

* add example cp

* fixup

* revert sdpa changes

* example cleared

* add CP, DP to the mesh init

* nit

* clean

* use `ALL_PARALLEL_STYLES`

* style

* FSDP works

* log on 1 rank

* .

* fix?

* FSDP1 also has .parameters() bug

* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay

* .

* style and fixup

* move stuff around

* fix tests

* style

* let's make it a check

* warning should be an info

---------

Co-authored-by: Arthur Zucker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core: Modeling Internals of the library; Models. Tensor Parallel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants