DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

jahatef · 2024-10-15T20:05:24Z

Describe the bug
When using an rwkv config ( to avoid running into the issue from #1305 )

I get the issue:

Traceback (most recent call last):
  File "/home/hatef.4/neox/gpt-neox/train.py", line 35, in <module>
    main()
  File "/home/hatef.4/neox/gpt-neox/train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 296, in pretrain
    iteration = train(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1465, in train
    loss_dict, skipped_iter = train_step(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1277, in train_step
    reduced_loss = train_step_pipe(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1374, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 362, in train_batch
    self._exec_schedule(sched)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1345, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 277, in _exec_reduce_grads
    self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/engine.py", line 1898, in allreduce_gradients
    assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled

To Reproduce
Steps to reproduce the behavior:

install latest DeeperSpeed
run rwkv/170M.yml

Proposed solution
Merging DeeperSpeed with upstream would work, but will need to fix #1306 first.

The text was updated successfully, but these errors were encountered:

cafeii · 2024-11-18T02:55:56Z

same bug

jahatef added the bug Something isn't working label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

jahatef commented Oct 15, 2024

cafeii commented Nov 18, 2024

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

Comments

jahatef commented Oct 15, 2024

cafeii commented Nov 18, 2024