You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using an rwkv config ( to avoid running into the issue from #1305 )
I get the issue:
Traceback (most recent call last):
File "/home/hatef.4/neox/gpt-neox/train.py", line 35, in <module>
main()
File "/home/hatef.4/neox/gpt-neox/train.py", line 31, in main
pretrain(neox_args=neox_args)
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 296, in pretrain
iteration = train(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1465, in train
loss_dict, skipped_iter = train_step(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1277, in train_step
reduced_loss = train_step_pipe(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1374, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 362, in train_batch
self._exec_schedule(sched)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1345, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 277, in _exec_reduce_grads
self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/engine.py", line 1898, in allreduce_gradients
assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled
To Reproduce
Steps to reproduce the behavior:
install latest DeeperSpeed
run rwkv/170M.yml
Proposed solution
Merging DeeperSpeed with upstream would work, but will need to fix #1306 first.
The text was updated successfully, but these errors were encountered:
Describe the bug
When using an rwkv config ( to avoid running into the issue from #1305 )
I get the issue:
To Reproduce
Steps to reproduce the behavior:
rwkv/170M.yml
Proposed solution
Merging DeeperSpeed with upstream would work, but will need to fix #1306 first.
The text was updated successfully, but these errors were encountered: