Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RoPE positional embedding doesn't work when sequence_parallel is enabled for GPT models #6153

Closed
barry-jin opened this issue Mar 8, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@barry-jin
Copy link

Describe the bug

When enabling rope position_embedding_type and sequence_parallel in NeMo GPT pretraining, the training will give runtime error:

  File "/workspace/NeMo/nemo/collections/nlp/modules/common/megatron/rotary_pos_embedding.py", line 59, in apply_rotary_pos_emb
    t = (t * freqs.cos()) + (_rotate_half(t) * freqs.sin())
RuntimeError: The size of tensor a (2048) must match the size of tensor b (1024) at non-singleton dimension 0

Steps/Code to reproduce bug

Follow the tutorial to train GPT model with model.position_embedding_type=rope and model.sequence_parallel=True

Expected behavior

Expect there will be no runtime error when both options are enabled.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

@barry-jin barry-jin added the bug Something isn't working label Mar 8, 2023
@barry-jin barry-jin changed the title [Bug] RoPE positional embedding doesn't work when sequence_parallel is enabled [Bug] RoPE positional embedding doesn't work when sequence_parallel is enabled for GPT models Mar 8, 2023
@okuchaiev
Copy link
Member

Can you post your model config?

@MaximumEntropy
Copy link
Contributor

Thanks! This was just fixed here - #6178.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants