Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

Open
jahatef opened this issue Oct 15, 2024 · 1 comment
Open

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

jahatef opened this issue Oct 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@jahatef
Copy link
Collaborator

jahatef commented Oct 15, 2024

Describe the bug
When using an rwkv config ( to avoid running into the issue from #1305 )

I get the issue:

Traceback (most recent call last):
  File "/home/hatef.4/neox/gpt-neox/train.py", line 35, in <module>
    main()
  File "/home/hatef.4/neox/gpt-neox/train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 296, in pretrain
    iteration = train(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1465, in train
    loss_dict, skipped_iter = train_step(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1277, in train_step
    reduced_loss = train_step_pipe(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1374, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 362, in train_batch
    self._exec_schedule(sched)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1345, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 277, in _exec_reduce_grads
    self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/engine.py", line 1898, in allreduce_gradients
    assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled

To Reproduce
Steps to reproduce the behavior:

  1. install latest DeeperSpeed
  2. run rwkv/170M.yml

Proposed solution
Merging DeeperSpeed with upstream would work, but will need to fix #1306 first.

@jahatef jahatef added the bug Something isn't working label Oct 15, 2024
@cafeii
Copy link

cafeii commented Nov 18, 2024

same bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants