fix beamsearch crash and incorrect output in decode-only model and en…#627
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
c982ec6 to
105e0e2
Compare
|
because the static shape in self.generation_config is not set (see https://github.com/huggingface/optimum-habana/pull/627/files#diff-284d6c109e6788a4405e302113203f6b7e98be68afc7dd3d85b89c6329b9275eL540-R541) if generation_config is passed explicitly with static_shapes=True, the issue is hidden in optimum-habana text generation example. I use transformer pipeline to run in gaudi without generation_config pass in generate(), so find the bug |
6a05a32 to
54f780a
Compare
|
provide a simple test before the fix, the output is after the fix, the output is |
|
provide a simple test for summerization using encoder-decoder model. from transformers import pipeline
import habana_frameworks.torch
import torch
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi
adapt_transformers_to_gaudi()
generation_kwargs = dict(do_sample=False, num_beams=4, use_cache=True, max_new_tokens=32)
generator = pipeline(
"summarization",
model="facebook/bart-large-cnn",
torch_dtype=torch.bfloat16,
device="hpu",
**generation_kwargs,
)
print(generator(["(CNN)The Palestinian Authority officially became", "DeepSpeed is a machine learning"], batch_size=2))
without the fix. it will coredump |
|
@sywangyi It seems the PR "breaks" regular generation. Running on main returns but on your branch I get |
54f780a to
a7721d5
Compare
upload a commit to fix it. |
|
I see perf drop 1.082(main) vs 0.972 (on your branch) python3 /root/optimum-habana/examples/summarization/run_summarization.py --model_name_or_path t5-3b --do_predict --predict_with_generate --dataset_name cnn_dailymail --dataset_config 3.0.0 --use_habana --per_device_eval_batch_size 2 --gaudi_config_name Habana/t5 --generation_num_beams 4 --ignore_pad_token_for_loss False --pad_to_max_length --use_hpu_graphs_for_inference --use_lazy_mode --max_predict_samples 200 --bf16 --bf16_full_eval --output_dir /tmp/tmpccx5bpdn |
Hi, the difference is caused by the change in
|
Understood. |
Also, when ever user defines their own stop criteria, perf degradation is expected based on whether they support static_shapes (or) not. However, in default case, we should make it static_shapes for better perf. I will also brainstorm on ways to avoid slicing. |
|
@bhargaveede and @regisss the is max length criteria that we use to determine if the generation has ended, you could see it use input_ids.shape[-1] and compare with max_length. if you want to use static shape. we need to pass token idx in. override the max length criteria in transformer. user defined criteria also need the token idx, because input_ids.shape contain pad length which is not reliable. what's your point? |
Cant we using StaticMaxLengthCriteria? |
|
now the the StoppingCriteria contained more than one criterias. for example. for text generation. it contains [<transformers.generation.stopping_criteria.MaxLengthCriteria object at 0x7fc31fc862f0>, <optimum.habana.transformers.generation.utils.StaticMaxLengthCriteria object at 0x7fc31fc86320>]. once one of the criteria meet exit requirment. the generaion will end. and if user defined criteria like end in some specific "ids", correct logic is to get the latest generation ids and comparing. see if meet the ending requirement. |
Got it. I think both StaticMaxLengthCriteria and MaxLengthCritieria shouldn't be together. If it's there then we should avoid it. |
|
@sywangyi @bhargaveede Ideally, we could remove the
Would that solve this issue and work for you? |
|
agree, we should overide MaxLengthCriteria, pass token_idx in and remove StaticMaxLengthCriteria, which makes things simple. |
a7721d5 to
7ff1fb2
Compare
|
Changes look fine to me. I will check locally if t5 and bart are fine with the patch. |
… encode-decode model Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
7ff1fb2 to
1902958
Compare
|
Hmm still seeing some throughput regressions, for example with BLOOMZ-7b: I get 37.98 tokens/s instead of 41.52 on Gaudi1, and 106.42 instead of 130.1 on Gaudi2. |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
yes, I reproduce it and find it's related with token_idx, because it's a hpu tensor? upload a PR to fix it. |
|
could you try by your side, and see if it's working now @regisss ? |
All regression tests passed, except the torch.compile one that seems to be broken: Here is the command to reproduce it (torch.compile is available on Gaudi2 only): It comes from this line in Transformers: https://github.com/huggingface/transformers/blob/345b9b1a6a308a1fa6559251eb33ead2211240ac/src/transformers/generation/utils.py#L1139 |
|
Hi, @regisss , I use you command, and see different error in gaudi 2 Also there's code like in this PR |
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>


…code-decode model
found with "do_sample=False, num_beams=4" in generation. test model include
decode only:"gpt2" "tiiuae/falcon-7b-instruct" "distilgpt2"
encode decoder: "facebook/bart-large-cnn" "sshleifer/distilbart-cnn-12-6" "philschmid/bart-large-cnn-samsum"