Skip to content

The cross attention in the gpt_attention_plugin seems not to be working #500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
robosina opened this issue Nov 29, 2023 · 13 comments
Closed
Assignees

Comments

@robosina
Copy link

robosina commented Nov 29, 2023

Regarding the issue described at #466, I am experiencing a discrepancy after the first token generation (the first token is generated correctly). However, after that, the logits of the next token change slightly due to cross attention. I would like to know if there is any possible way to determine what is indirectly affecting the cross attention output. I need to mention the following scenario:

  • If I have 3 input tokens, the logits for the 4th token are correct, but after that, the logits are not correct.
  • Now, suppose I pass the 4 tokens; the logits for the 5th token are exactly correct, but the 6th is not.
  • and ...

I suspect there might be an issue with kv caching.

@symphonylyh symphonylyh self-assigned this Nov 30, 2023
@symphonylyh
Copy link
Collaborator

Hi @robosina , we've noticed a very similar issue as you described above as well. Our test cases were on t5-small with task prompt prefix such as translate English to German: and summarize:, for these cases the issue didn't manifest probably because the model can stably handle the prompt. But when we test other models such as BART or with non-prompt text, we obsereved the step N+1 cross attention output deviation too.

Noticeable: this doesn't happen in release/0.5.0 though. So there must be a subtle bug and we're locating that among all the changes/features since release/0.5.0. Thanks and we'll update you once we fix it.

Do you mind sharing which model you tested on?

@symphonylyh
Copy link
Collaborator

@robosina we just fixed locally for a similar problem as you posted here. To minimize the turnaround time, would you mind manually apply the change to this line:

const int cyclic_kv_cache_length = reinterpret_cast<const int*>(inputs[getHostMaxKvCacheLengthIdx()])[0];
. const int cyclic_kv_cache_length = isCrossAttention() ? max_encoder_context_len : reinterpret_cast<const int*>(inputs[getHostMaxKvCacheLengthIdx()])[0];

The issue was we set the max_kv_cache_length correct but didin't do the same to cyclic_kv_cache_length :(

Once you verified, please let us know and close the issue.

@robosina
Copy link
Author

robosina commented Dec 1, 2023

@symphonylyh, Thanks, it was a good catch. Yes, this was the issue, and it's fixed now after patching.

The issue was we set the max_kv_cache_length correct but didin't do the same to cyclic_kv_cache_length :(

:)) It's fine, we all have such minor forgetfulness.

@robosina robosina closed this as completed Dec 1, 2023
@jdemouth-nvidia
Copy link
Collaborator

FYI, I've just merged @symphonylyh 's fix into our main internal development branch. It'll be published to our main/dev branch on GitHub in a few hours.

@robosina
Copy link
Author

robosina commented Dec 1, 2023

@jdemouth-nvidia @symphonylyh Thanks for the great support

@robosina
Copy link
Author

robosina commented Dec 2, 2023

Hi @symphonylyh, I just wanted to provide an update. After testing, I found that when using beam search, token generation stops after the 44th token. Until that point, the logits I receive are almost exactly similar to those in PyTorch. However, at the 44th step, the second beam generates strange logits, which I believe are causing the halt in token generation. This issue does not occur with greedy search, where the generation process works fine. Could you please take a look at this? I am trying to see if I can provide something reproducible. Before doing so, I wanted to know if it's possible to briefly check the core codes or maybe get a hint.

@symphonylyh
Copy link
Collaborator

@robosina , probably you case fall under one of the known issues here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#reminders? i.e., when you have batched, ragged input (and use input padding) + beam search, the generation has some problem and we're still investigating that. For the mean time, if you can use padding input and disable --remove_input_padding as a workaround

@robosina
Copy link
Author

robosina commented Dec 5, 2023

@symphonylyh, thank you. However, I am not testing it in batch mode since my batch size is one for now. Additionally, I don't have ragged input, and I am not using the --remove_input_padding option either. Therefore, it might be a new issue, right? It doesn't happen with all inputs, but in about 40% of the cases, it behaves strangely. Thank you again.

@symphonylyh
Copy link
Collaborator

symphonylyh commented Dec 5, 2023

@robosina I see, then it's a different issue. So it happens when batch_size = 1 + beam search. I haven't seen such issue before. It's appreciated if you can provide a minimum reproducible example input and model name for it. Thanks

@robosina
Copy link
Author

robosina commented Dec 5, 2023

@symphonylyh Is it possible to share this privately via email or through a similar method? If so, could you please provide me with your email, or alternatively, my email is [email protected]. Thanks

@jdemouth-nvidia
Copy link
Collaborator

@robosina it’s not possible for us to accept proprietary models without having a proper NDA in place. If you could open source your code with a permissive license, it’d be a lot easier. We don’t need you to provide us with the full code, just a way to reproduce the problem. Thanks.

@aramfaghfouri
Copy link

Hi @jdemouth-nvidia,
We have an NDA with Nvidia. Please let us know how we can proceed.

Thank you.

@symphonylyh
Copy link
Collaborator

symphonylyh commented Dec 13, 2023

@robosina @aramfaghfouri did you have a NVIDIA contact when you signed the NDA? Is it possible that you can get connected to the contact and ask them to route the requests to TRT-LLM team? That should be the right way to proceed without disclosing your organization and model information on Github.

btw, I recently found a fix that may or may not be relevant to your model (do you model have the same rescale_before_lm_head like T5?): #474 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants