[WIP][Need help and discussion] : basic llama tensor parallel#32597
Closed
SeungyounShin wants to merge 0 commit intohuggingface:mainfrom
Closed
[WIP][Need help and discussion] : basic llama tensor parallel#32597SeungyounShin wants to merge 0 commit intohuggingface:mainfrom
SeungyounShin wants to merge 0 commit intohuggingface:mainfrom
Conversation
SeungyounShin
commented
Aug 11, 2024
Author
There was a problem hiding this comment.
pytorch tensor parallel can't recognize keyword args.
Collaborator
|
I think you need to scale the Rotary Embeddings by the tp_mesh.size(). Since you have a tp size of 2, you're seeing the scale off by that factor, since the self_attn.rotary_emb has no parallelize plan, it's not accounting for the changes. |
Collaborator
|
if you look at torch titan, they do some reshaping for broadcast for rope https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py#L112 |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR addresses an issue encountered when running the following command:
The current implementation results in the following error:
Problem Description and Discussion Points
Sequence Length:
It appears that the Tensor Parallel approach requires the sequence length to be evenly divisible, which is not currently handled in the existing implementation (though this doesn't apply in inference mode).
Potential Solution - Accelerate:
Given the benefits of Tensor Parallel in training, especially when compared to other Distributed Data Parallel methods like DeepSpeed and FSDP, I'm considering submitting a PR to the
acceleratelibrary. However, it’s important to note that the current structure of the transformers model may need to be adjusted to fully realize these benefits.Root Cause of the Error:
The error seems to stem from the fact that positional embeddings are applied immediately after token embeddings. This results in an incompatibility with Tensor Parallel, causing the misalignment seen in the error.
Request for Assistance:
Addressing this issue might require significant changes to the codebase. As such, I would greatly appreciate any feedback, guidance, or assistance in this matter.
As a recent graduate, I've observed that many are now using 2-4 nodes with 8-way GPUs. In these setups, Data Parallel (DP) methods like DeepSpeed and FSDP often suffer from high ring latency(many limited to memory constraint of the device). I believe Tensor Parallel, coupled with DP across nodes, could become a dominant approach in the near future. I’m eager to discuss and collaborate on making this approach compatible with the Transformers library or a similar framework.
Before submitting
Thank you in advance for your time and consideration. I look forward to any suggestions or feedback.
cc. @amyeroberts