The cross attention in the gpt_attention_plugin seems not to be working #500

robosina · 2023-11-29T23:46:09Z

Regarding the issue described at #466, I am experiencing a discrepancy after the first token generation (the first token is generated correctly). However, after that, the logits of the next token change slightly due to cross attention. I would like to know if there is any possible way to determine what is indirectly affecting the cross attention output. I need to mention the following scenario:

If I have 3 input tokens, the logits for the 4th token are correct, but after that, the logits are not correct.
Now, suppose I pass the 4 tokens; the logits for the 5th token are exactly correct, but the 6th is not.
and ...

I suspect there might be an issue with kv caching.

symphonylyh · 2023-12-01T01:12:11Z

Hi @robosina , we've noticed a very similar issue as you described above as well. Our test cases were on t5-small with task prompt prefix such as translate English to German: and summarize:, for these cases the issue didn't manifest probably because the model can stably handle the prompt. But when we test other models such as BART or with non-prompt text, we obsereved the step N+1 cross attention output deviation too.

Noticeable: this doesn't happen in release/0.5.0 though. So there must be a subtle bug and we're locating that among all the changes/features since release/0.5.0. Thanks and we'll update you once we fix it.

Do you mind sharing which model you tested on?

symphonylyh · 2023-12-01T06:57:25Z

@robosina we just fixed locally for a similar problem as you posted here. To minimize the turnaround time, would you mind manually apply the change to this line:

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp

Line 266 in 711a28d

    
           const int cyclic_kv_cache_length = reinterpret_cast<const int*>(inputs[getHostMaxKvCacheLengthIdx()])[0];

.

const int cyclic_kv_cache_length = isCrossAttention() ? max_encoder_context_len : reinterpret_cast<const int*>(inputs[getHostMaxKvCacheLengthIdx()])[0];

The issue was we set the max_kv_cache_length correct but didin't do the same to cyclic_kv_cache_length :(

Once you verified, please let us know and close the issue.

robosina · 2023-12-01T12:04:53Z

@symphonylyh, Thanks, it was a good catch. Yes, this was the issue, and it's fixed now after patching.

The issue was we set the max_kv_cache_length correct but didin't do the same to cyclic_kv_cache_length :(

:)) It's fine, we all have such minor forgetfulness.

jdemouth-nvidia · 2023-12-01T12:07:16Z

FYI, I've just merged @symphonylyh 's fix into our main internal development branch. It'll be published to our main/dev branch on GitHub in a few hours.

robosina · 2023-12-01T12:19:55Z

@jdemouth-nvidia @symphonylyh Thanks for the great support

robosina · 2023-12-02T18:37:42Z

Hi @symphonylyh, I just wanted to provide an update. After testing, I found that when using beam search, token generation stops after the 44th token. Until that point, the logits I receive are almost exactly similar to those in PyTorch. However, at the 44th step, the second beam generates strange logits, which I believe are causing the halt in token generation. This issue does not occur with greedy search, where the generation process works fine. Could you please take a look at this? I am trying to see if I can provide something reproducible. Before doing so, I wanted to know if it's possible to briefly check the core codes or maybe get a hint.

symphonylyh · 2023-12-04T23:31:28Z

@robosina , probably you case fall under one of the known issues here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#reminders? i.e., when you have batched, ragged input (and use input padding) + beam search, the generation has some problem and we're still investigating that. For the mean time, if you can use padding input and disable --remove_input_padding as a workaround

robosina · 2023-12-05T07:59:22Z

@symphonylyh, thank you. However, I am not testing it in batch mode since my batch size is one for now. Additionally, I don't have ragged input, and I am not using the --remove_input_padding option either. Therefore, it might be a new issue, right? It doesn't happen with all inputs, but in about 40% of the cases, it behaves strangely. Thank you again.

symphonylyh · 2023-12-05T08:51:11Z

@robosina I see, then it's a different issue. So it happens when batch_size = 1 + beam search. I haven't seen such issue before. It's appreciated if you can provide a minimum reproducible example input and model name for it. Thanks

robosina · 2023-12-05T10:09:58Z

@symphonylyh Is it possible to share this privately via email or through a similar method? If so, could you please provide me with your email, or alternatively, my email is [email protected]. Thanks

jdemouth-nvidia · 2023-12-05T21:27:11Z

@robosina it’s not possible for us to accept proprietary models without having a proper NDA in place. If you could open source your code with a permissive license, it’d be a lot easier. We don’t need you to provide us with the full code, just a way to reproduce the problem. Thanks.

aramfaghfouri · 2023-12-07T19:10:46Z

Hi @jdemouth-nvidia,
We have an NDA with Nvidia. Please let us know how we can proceed.

Thank you.

symphonylyh · 2023-12-13T08:56:51Z

@robosina @aramfaghfouri did you have a NVIDIA contact when you signed the NDA? Is it possible that you can get connected to the contact and ask them to route the requests to TRT-LLM team? That should be the right way to proceed without disclosing your organization and model information on Github.

btw, I recently found a fix that may or may not be relevant to your model (do you model have the same rescale_before_lm_head like T5?): #474 (comment)

symphonylyh self-assigned this Nov 30, 2023

This was referenced Dec 1, 2023

Cross Attention layer issue with beam size larger than 1 #466

Closed

Have trouble in using flan-t5-xxl #474

Closed

robosina closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cross attention in the gpt_attention_plugin seems not to be working #500

The cross attention in the gpt_attention_plugin seems not to be working #500

robosina commented Nov 29, 2023 •

edited

Loading

symphonylyh commented Dec 1, 2023

symphonylyh commented Dec 1, 2023

robosina commented Dec 1, 2023

jdemouth-nvidia commented Dec 1, 2023

robosina commented Dec 1, 2023 •

edited

Loading

robosina commented Dec 2, 2023 •

edited

Loading

symphonylyh commented Dec 4, 2023

robosina commented Dec 5, 2023

symphonylyh commented Dec 5, 2023 •

edited

Loading

robosina commented Dec 5, 2023 •

edited

Loading

jdemouth-nvidia commented Dec 5, 2023

aramfaghfouri commented Dec 7, 2023

symphonylyh commented Dec 13, 2023 •

edited

Loading

The cross attention in the gpt_attention_plugin seems not to be working #500

The cross attention in the gpt_attention_plugin seems not to be working #500

Comments

robosina commented Nov 29, 2023 • edited Loading

symphonylyh commented Dec 1, 2023

symphonylyh commented Dec 1, 2023

robosina commented Dec 1, 2023

jdemouth-nvidia commented Dec 1, 2023

robosina commented Dec 1, 2023 • edited Loading

robosina commented Dec 2, 2023 • edited Loading

symphonylyh commented Dec 4, 2023

robosina commented Dec 5, 2023

symphonylyh commented Dec 5, 2023 • edited Loading

robosina commented Dec 5, 2023 • edited Loading

jdemouth-nvidia commented Dec 5, 2023

aramfaghfouri commented Dec 7, 2023

symphonylyh commented Dec 13, 2023 • edited Loading

robosina commented Nov 29, 2023 •

edited

Loading

robosina commented Dec 1, 2023 •

edited

Loading

robosina commented Dec 2, 2023 •

edited

Loading

symphonylyh commented Dec 5, 2023 •

edited

Loading

robosina commented Dec 5, 2023 •

edited

Loading

symphonylyh commented Dec 13, 2023 •

edited

Loading