Skip to content

Conversation

@yeonsily
Copy link

@yeonsily yeonsily commented Apr 1, 2025

This is custom op change as PR #786 follow-up.

Removed RobertaEmbedding class from model file and implemented it as CustomOp class in new file.
forward_cuda() is the original forward function and forward_hpu() is our specific change.

@yeonsily
Copy link
Author

yeonsily commented Apr 3, 2025

@kzawora-intel As @michalkuligowski is off now, can you please review this PR? Thanks.

@yeonsily
Copy link
Author

yeonsily commented Apr 8, 2025

@michalkuligowski @kzawora-intel Can you please advise me how you want to change this PR? Our customer waits for roberta embedding enablement.

@michalkuligowski
Copy link

/run-gaudi-tests

@yeonsily
Copy link
Author

yeonsily commented Apr 10, 2025

From those two failed test logs, I see that they were actually passed but somehow couldn't exit the process normally with this message.

"Received notify event: Due to an error on node g3-srv179-c03w-idc a jira ticket https://jira.habana-labs.com/browse/SW-225420 was opened, your resource vllm-fork-996-79cqyb8h7e-tfjob might be effected"

I think they are not real issues. And the same PR for v1.21.0-next branch which is #1049, all CI passed on it.

michalkuligowski pushed a commit that referenced this pull request Apr 16, 2025
Same PR as #996.
Just for v1.21.0_next branch.
@michalkuligowski
Copy link

/run-gaudi-tests

@yeonsily yeonsily force-pushed the dev/enable_roberta_embedding2 branch from 2b3ca17 to 81cd1ba Compare April 22, 2025 21:23
@michalkuligowski
Copy link

/run-gaudi-tests

@michalkuligowski
Copy link

/skip-gaudi-tests due to test passing:
2025-04-24T00:14:59Z tensorflow The!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

2025-04-24T00:15:06Z tensorflow === PASSED MODEL: Meta-Llama-3.2-11B-Vision-Instruct-mss.yaml ===

INFO Received notify event: your resource vllm-fork-996-q2mlrdm3tk-tfjob will reach its max duration in 30 minutes and will be deleted

WARNING received notify kill event: your resource vllm-fork-996-q2mlrdm3tk-tfjob has reached it's max duration 1h0m0s, it's going to be destroyed

WARNING workload removed from cluster

SUCCESS successfully removed failed workload vllm-fork-996-q2mlrdm3tk-tfjob
Finished: 2025-04-24T01:10:22Z
workload removed from cluster
Logs are available at https://logs-browser.k8s-infra.habana-labs.com/files/vllm-fork-996-q2mlrdm3tk-tfjob
Error: Process completed with exit code 1.

@michalkuligowski michalkuligowski merged commit f191153 into habana_main Apr 24, 2025
39 of 40 checks passed
@michalkuligowski michalkuligowski deleted the dev/enable_roberta_embedding2 branch April 24, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants