add seed in DPO for reproduce the training result. disable hpu graph … by sywangyi · Pull Request #646 · huggingface/optimum-habana

sywangyi · 2024-01-18T09:22:20Z

…for training if gradient_checkpointing is used

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…for training if gradient_checkpointing is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi · 2024-01-18T09:23:29Z

@libinta @regisss please help review the PR.
gradient_checkpointing+ HPU graph for training
coredump like
Traceback (most recent call last):
File "/intel-extension-for-transformers/optimum-habana/examples/trl/dpo.py", line 223, in
dpo_trainer.train()
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 496, in train
return inner_training_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 857, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1379, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 981, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 926, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/trl/trainer/dpo_trainer.py", line 405, in concatenated_forward
all_logits = model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1521, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1530, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 936, in forward
return self.cache_insert(input_id, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 911, in cache_insert
graph_model.init_hpu_graph(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 796, in init_hpu_graph
self.hpu_graph = make_graphed_callables(self, tensor_args, allow_unused_input=self.allow_unused_input,
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 266, in make_graphed_callables
grad_inputs = torch.autograd.grad(outputs=tuple(o for o in static_outputs if o.requires_grad),
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 345, in grad
return handle_torch_function(
File "/usr/local/lib/python3.10/dist-packages/torch/overrides.py", line 1577, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in torch_function
return super().torch_function(func, types, new_args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
0%| | 0/1000 [00:07<?, ?it/s]

HuggingFaceDocBuilderDev · 2024-01-18T09:27:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

add seed in DPO for reproduce the training result. disable hpu graph …

65d7a68

…for training if gradient_checkpointing is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi requested a review from regisss as a code owner January 18, 2024 09:22

libinta approved these changes Jan 20, 2024

View reviewed changes

libinta added synapse1.14 run-test Run CI for PRs from external contributors labels Jan 20, 2024

Refinements

fd7b0e1

regisss approved these changes Jan 22, 2024

View reviewed changes

regisss merged commit 5cf7c95 into main Jan 22, 2024

regisss deleted the add_seed branch January 22, 2024 22:27

jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024

Add seed in DPO to reproduce training results (huggingface#646)

94d5257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add seed in DPO for reproduce the training result. disable hpu graph …#646

add seed in DPO for reproduce the training result. disable hpu graph …#646
regisss merged 2 commits into
mainfrom
add_seed

sywangyi commented Jan 18, 2024

Uh oh!

sywangyi commented Jan 18, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sywangyi commented Jan 18, 2024

What does this PR do?

Before submitting

Uh oh!

sywangyi commented Jan 18, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants