Skip to content

fix dpo graph compile error in evaluation#630

Merged
regisss merged 1 commit into
mainfrom
graph_error_dpo
Jan 9, 2024
Merged

fix dpo graph compile error in evaluation#630
regisss merged 1 commit into
mainfrom
graph_error_dpo

Conversation

@sywangyi
Copy link
Copy Markdown
Collaborator

@sywangyi sywangyi commented Jan 9, 2024

fix graph compilation issue in evaluation

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi requested a review from regisss as a code owner January 9, 2024 13:10
@sywangyi
Copy link
Copy Markdown
Collaborator Author

sywangyi commented Jan 9, 2024

@libinta

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

What is the error you got? A wrong dtype in the graph?

@sywangyi
Copy link
Copy Markdown
Collaborator Author

sywangyi commented Jan 9, 2024

LGTM

What is the error you got? A wrong dtype in the graph?

File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3066, in evaluate
output = eval_loop(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1119, in evaluation_loop
initial_output = super().evaluation_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1578, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1051, in prediction_step
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="eval")
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 948, in get_batch_loss_metrics
) = self.concatenated_forward(self.ref_model, batch)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/trl/trainer/dpo_trainer.py", line 406, in concatenated_forward
all_logits = model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1521, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1530, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 676, in forward
return wrapped_hpugraph_forward(cache, stream, orig_fwd, args, kwargs, disable_tensor_cache, asynchronous, dry_run, max_graphs, hash_with_views)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 586, in wrapped_hpugraph_forward
cached.graph.replay(cached.asynchronous)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 47, in replay
_hpu_C.replay(self.hpu_graph, asynchronous)
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].

Graph compile failure in evaluation

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 9, 2024

Thanks, the error message is not very explicit 😁

@regisss regisss merged commit 4e67153 into main Jan 9, 2024
@regisss regisss deleted the graph_error_dpo branch January 9, 2024 14:18
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
…gingface#2223) (huggingface#630)

Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants