Conversation
|
I don't know why. it only occurs in multi-card case. single card does not have such issue. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
the script is changed from https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py. just make it work in Habana. |
|
the reward modeling compute_loss is a little different from normal. see https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py#L268-L271. not sure if this is the cause of the issue |
|
@regisss , no, i've never seen this behavior before. |
|
@regisss, will you help merge the PR? I am enabling RLHF (PPO) in Gaudi2, basic function is working now for reward modeling and reinforcement learning, and performance is optimistic. later I would like to clean the code and upload the PPO and DPO related example to optimum-habana. |
|
@sywangyi , please file a jira to Habana with a simple test case to reproduce the problem. We need to investigate the root cause before we merge any workaround. |
|
@mandy-li have filed a jira in habana jira system |
|
still see this issue in SW release 1.13. Per @mandy-li,we could merge it as WA and remove it once the problem is fixed in Synapse, I test by my side, not see performance regress in finetune and inference side. |
|
@mandy-li could you comment on this? |
yes, I didn't see the perf degradation |
|
Sounds good! And then I'll merge it! |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
I am enabling RLHF in habana, when enable reward model finetuning in 8 gaudi2 card using DDP. error happened in backward.
code like
https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/ppo_pipeline/reward_modeling.py
command like
python ../instruction/gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py --model_name_or_path meta-llama/Llama-2-7b-hf --log_level info --num_train_epochs 3 --use_habana --output_dir output --ddp_find_unused_parameters True --logging_steps 10 --use_lazy_mode --evaluation_strategy="steps"
error like
Traceback (most recent call last):
File "/root/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/ppo_pipeline/reward_modeling.py", line 475, in
trainer.train()
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 504, in train
return inner_training_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 837, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1361, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1989, in backward
loss.backward(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 498, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/kernels/RotaryPosEmbeddingHelper.py", line 157, in backward
cos, sin, position_ids = ctx.saved_tensors
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [HPUBFloat16Type [1, 1, 512, 128]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
0%| | 0/354 [00:07<?, ?it/s]
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[3851,1],5]
Exit code: 1