Skip to content

add seed in DPO for reproduce the training result. disable hpu graph …#646

Merged
regisss merged 2 commits into
mainfrom
add_seed
Jan 22, 2024
Merged

add seed in DPO for reproduce the training result. disable hpu graph …#646
regisss merged 2 commits into
mainfrom
add_seed

Conversation

@sywangyi
Copy link
Copy Markdown
Collaborator

…for training if gradient_checkpointing is used

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

…for training if gradient_checkpointing is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi requested a review from regisss as a code owner January 18, 2024 09:22
@sywangyi
Copy link
Copy Markdown
Collaborator Author

@libinta @regisss please help review the PR.
gradient_checkpointing+ HPU graph for training
coredump like
Traceback (most recent call last):
File "/intel-extension-for-transformers/optimum-habana/examples/trl/dpo.py", line 223, in
dpo_trainer.train()
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 496, in train
return inner_training_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 857, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1379, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 981, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 926, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/trl/trainer/dpo_trainer.py", line 405, in concatenated_forward
all_logits = model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1521, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1530, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 936, in forward
return self.cache_insert(input_id, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 911, in cache_insert
graph_model.init_hpu_graph(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 796, in init_hpu_graph
self.hpu_graph = make_graphed_callables(self, tensor_args, allow_unused_input=self.allow_unused_input,
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 266, in make_graphed_callables
grad_inputs = torch.autograd.grad(outputs=tuple(o for o in static_outputs if o.requires_grad),
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 345, in grad
return handle_torch_function(
File "/usr/local/lib/python3.10/dist-packages/torch/overrides.py", line 1577, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in torch_function
return super().torch_function(func, types, new_args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
0%| | 0/1000 [00:07<?, ?it/s]

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@libinta libinta added synapse1.14 run-test Run CI for PRs from external contributors labels Jan 20, 2024
@regisss regisss merged commit 5cf7c95 into main Jan 22, 2024
@regisss regisss deleted the add_seed branch January 22, 2024 22:27
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants