-
Notifications
You must be signed in to change notification settings - Fork 271
Run Llama2 with torch.compile on Gaudi2 #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -406,6 +406,7 @@ def generate( | |||||
| negative_prompt_ids: Optional[torch.Tensor] = None, | ||||||
| negative_prompt_attention_mask: Optional[torch.Tensor] = None, | ||||||
| lazy_mode: Optional[bool] = False, | ||||||
| torch_compile: Optional[bool] = False, | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For normal training, eval, predict models are wrapped within accelerator.prepare_model() call, adding new code for generate() may not be aligned. @regisss any idea how direct model.generate() calls are handled in transformers for compile mode, I tried to search there but did not find anything.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the trainer, the link with Accelerate is made here:
And then in Accelerate it happens here:
It was introduced in #465.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Outside of the trainer, Transformers recommends to simply use: https://huggingface.co/docs/transformers/v4.36.1/en/perf_torch_compile
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As suggested, I would create 'get_torch_compiled_model()' in text-generation/utils.py. And this will be called inside setup_model() in text-generation/utils.py. |
||||||
| hpu_graphs: Optional[bool] = False, | ||||||
| profiling_warmup_steps: Optional[int] = 0, | ||||||
| profiling_steps: Optional[int] = 0, | ||||||
|
|
@@ -474,6 +475,8 @@ def generate( | |||||
| Attention_mask for `negative_prompt_ids`. | ||||||
| lazy_mode (`bool`, *optional*, defaults to `False`): | ||||||
| Whether the run is executed in lazy mode or not (i.e. eager mode). | ||||||
| torch_compile (`bool`, *optional*, defaults to `False`): | ||||||
| Whether the run is executed with torch.compile model or not. | ||||||
| hpu_graphs (`bool`, *optional*, defaults to `False`): | ||||||
| Whether to use HPU graphs for inference. | ||||||
| profiling_warmup_steps (`int`, *optional*, defaults to 0): | ||||||
|
|
@@ -513,6 +516,10 @@ def generate( | |||||
| raise ValueError( | ||||||
| "`hpu_graphs` is True but `lazy_mode` is False. HPU graphs require `lazy_mode` to be set to True." | ||||||
| ) | ||||||
| if torch_compile and (lazy_mode or hpu_graphs): | ||||||
| raise ValueError( | ||||||
| "`torch_compile` is True. This requires both `lazy_mode` and `hpu_graphs` to be set to False." | ||||||
| ) | ||||||
|
|
||||||
| # priority: `generation_config` argument > `model.generation_config` (the default generation config) | ||||||
| if generation_config is None: | ||||||
|
|
@@ -838,6 +845,7 @@ def generate( | |||||
| synced_gpus=synced_gpus, | ||||||
| streamer=streamer, | ||||||
| lazy_mode=lazy_mode, | ||||||
| torch_compile=torch_compile, | ||||||
| ignore_eos=generation_config.ignore_eos, | ||||||
| profiling_warmup_steps=profiling_warmup_steps, | ||||||
| profiling_steps=profiling_steps, | ||||||
|
|
@@ -1214,6 +1222,7 @@ def greedy_search( | |||||
| synced_gpus: bool = False, | ||||||
| streamer: Optional["BaseStreamer"] = None, | ||||||
| lazy_mode: Optional[bool] = False, | ||||||
| torch_compile: Optional[bool] = False, | ||||||
| ignore_eos: Optional[bool] = False, | ||||||
| profiling_warmup_steps: Optional[int] = 0, | ||||||
| profiling_steps: Optional[int] = 0, | ||||||
|
|
@@ -1265,6 +1274,8 @@ def greedy_search( | |||||
| through `streamer.put(token_ids)` and the streamer is responsible for any further processing. | ||||||
| lazy_mode (`bool`, *optional*, defaults to `False`): | ||||||
| Whether the run is executed in lazy mode or not (i.e. eager mode). | ||||||
| torch_compile (`bool`, *optional*, defaults to `False`): | ||||||
| Whether the run is executed with torch.compile model or not. | ||||||
| ignore_eos (`bool`, *optional*, defaults to `False`): | ||||||
| Whether to ignore finished sequences (faster in lazy mode and with HPU graphs) or not (eager mode). | ||||||
| profiling_warmup_steps (`int`, *optional*, defaults to 0): | ||||||
|
|
@@ -1403,14 +1414,26 @@ def greedy_search( | |||||
|
|
||||||
| hpu_graphs_kwargs = self._get_hpu_graphs_kwargs(model_kwargs) | ||||||
|
|
||||||
| # forward pass to get next token | ||||||
| outputs = self( | ||||||
| **model_inputs, | ||||||
| return_dict=True, | ||||||
| output_attentions=output_attentions, | ||||||
| output_hidden_states=output_hidden_states, | ||||||
| **hpu_graphs_kwargs, | ||||||
| ) | ||||||
| if torch_compile: | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wrapping model only for greedy_search does not look right, it should probably be done in generate() so that it works for other modes (such as beam_search also),
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not even sure we should do it in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @regisss thanks for your comments, we will check if we can go with adding get_torch_compiled_model in text-generation/utils.py
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. I would create 'get_torch_compiled_model' in text-generation/utils.py. |
||||||
| # apply torch.compile | ||||||
| compiled_model = torch.compile(self, backend="aot_hpu_inference_backend") | ||||||
| # forward pass to get next token | ||||||
| outputs = compiled_model( | ||||||
| **model_inputs, | ||||||
| return_dict=True, | ||||||
| output_attentions=output_attentions, | ||||||
| output_hidden_states=output_hidden_states, | ||||||
| **hpu_graphs_kwargs, | ||||||
| ) | ||||||
| else: | ||||||
| # forward pass to get next token | ||||||
| outputs = self( | ||||||
| **model_inputs, | ||||||
| return_dict=True, | ||||||
| output_attentions=output_attentions, | ||||||
| output_hidden_states=output_hidden_states, | ||||||
| **hpu_graphs_kwargs, | ||||||
| ) | ||||||
|
|
||||||
| if synced_gpus and this_peer_finished: | ||||||
| continue # don't waste resources running the code we don't need | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be aligned with Transformers and GaudiTrainingArguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I would change.