-
Notifications
You must be signed in to change notification settings - Fork 1.4k
The cross attention in the gpt_attention_plugin seems not to be working #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @robosina , we've noticed a very similar issue as you described above as well. Our test cases were on t5-small with task prompt prefix such as Noticeable: this doesn't happen in Do you mind sharing which model you tested on? |
@robosina we just fixed locally for a similar problem as you posted here. To minimize the turnaround time, would you mind manually apply the change to this line:
const int cyclic_kv_cache_length = isCrossAttention() ? max_encoder_context_len : reinterpret_cast<const int*>(inputs[getHostMaxKvCacheLengthIdx()])[0];
The issue was we set the Once you verified, please let us know and close the issue. |
@symphonylyh, Thanks, it was a good catch. Yes, this was the issue, and it's fixed now after patching.
:)) It's fine, we all have such minor forgetfulness. |
FYI, I've just merged @symphonylyh 's fix into our main internal development branch. It'll be published to our main/dev branch on GitHub in a few hours. |
@jdemouth-nvidia @symphonylyh Thanks for the great support |
Hi @symphonylyh, I just wanted to provide an update. After testing, I found that when using beam search, token generation stops after the 44th token. Until that point, the logits I receive are almost exactly similar to those in PyTorch. However, at the 44th step, the second beam generates strange logits, which I believe are causing the halt in token generation. This issue does not occur with greedy search, where the generation process works fine. Could you please take a look at this? I am trying to see if I can provide something reproducible. Before doing so, I wanted to know if it's possible to briefly check the core codes or maybe get a hint. |
@robosina , probably you case fall under one of the known issues here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#reminders? i.e., when you have batched, ragged input (and use input padding) + beam search, the generation has some problem and we're still investigating that. For the mean time, if you can use padding input and disable --remove_input_padding as a workaround |
@symphonylyh, thank you. However, I am not testing it in batch mode since my batch size is one for now. Additionally, I don't have ragged input, and I am not using the --remove_input_padding option either. Therefore, it might be a new issue, right? It doesn't happen with all inputs, but in about 40% of the cases, it behaves strangely. Thank you again. |
@robosina I see, then it's a different issue. So it happens when batch_size = 1 + beam search. I haven't seen such issue before. It's appreciated if you can provide a minimum reproducible example input and model name for it. Thanks |
@symphonylyh Is it possible to share this privately via email or through a similar method? If so, could you please provide me with your email, or alternatively, my email is [email protected]. Thanks |
@robosina it’s not possible for us to accept proprietary models without having a proper NDA in place. If you could open source your code with a permissive license, it’d be a lot easier. We don’t need you to provide us with the full code, just a way to reproduce the problem. Thanks. |
Hi @jdemouth-nvidia, Thank you. |
@robosina @aramfaghfouri did you have a NVIDIA contact when you signed the NDA? Is it possible that you can get connected to the contact and ask them to route the requests to TRT-LLM team? That should be the right way to proceed without disclosing your organization and model information on Github. btw, I recently found a fix that may or may not be relevant to your model (do you model have the same rescale_before_lm_head like T5?): #474 (comment) |
Regarding the issue described at #466, I am experiencing a discrepancy after the first token generation (the first token is generated correctly). However, after that, the logits of the next token change slightly due to cross attention. I would like to know if there is any possible way to determine what is indirectly affecting the cross attention output. I need to mention the following scenario:
I suspect there might be an issue with kv caching.
The text was updated successfully, but these errors were encountered: