-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
t5 model ,the inference result are wrong when the batch size > 1 #1847
Comments
It seems that you're using a very outdated version (0.7.0), could u please try the latest main branch code? |
@nv-guomingz Yes, due to historical reasons, we developed on version 0.7.0. We have not seen anyone report this issue in the issues section, and it is possible that this problem still exists in the newer versions. Therefore, we hope to address this problem in version 0.7.0. Have you ever encountered a similar issue? |
For me, I can't recall there's such issue for T5 on 0.7.0 version. Would u please provide us step-by-step instructions for reproducing such issue? I still suggest you try the latest release whl instead of using 0.7.0 to see if the issue still exists or not. If so, we'll file a bug for internal tracking and investigating. |
@nv-guomingz Just now, we used TensorRT LLM version 0.90 and converted T5-large using the official example (example/enc_dec/). First, we followed the official example to convert the weights to float16. |
Could you try the latest version TRT_LLM 0.11+ |
Hi @0xd8b @1096125073 , can you please provide your trt-llm version, runtime type (python or pybind of C++), model name, TP/PP setup, beam search, reproducible input examples (English preferred)? Because on our end we wasn't seeing any issue with BS>1 |
sorry, I did not provide detailed information earlier. here is the related information:
--world_size=1
|
The BERT plugin has a parameter: relative_attention_bias: Tensor = None We passed a pre-constructed relative_attention_bias with the shape [num_heads, max_seq_len, max_seq_len], which resulted in the attention calculation outputting all zeros on even layers. This issue does not occur in the T5 example in example/enc because it uses the implicit mode. |
This makes more sense. Have you tried run the T5 example w/o the implicit mode? We have tested this before in earlier versions. If you can reproduce on the T5-explicit mode, please let us know |
yes But when I tried on four A100 cards, the result was correct(4tp 1pp) |
@1096125073 you case is a different issue. Actually it is EXPECTED. For pybind of C++ runtime, we didn't support PP yet. It's only TP support. Because we haven't seen much use cases of enc-dec models using PP for deployment. If this is really needed for your case, would you mind open a new issue and raise this feature request? |
hi,thanks a lot,i will open a new issue later. |
I've found the problem; it's due to the data type of the operation data. In the Encoder and Decoder, there is a parameter called encoder_input_lengths. The documentation of the function did not specify the exact data type required for this parameter, so we did not pay attention to it. We constructed this variable using encoder_input_lengths = torch.sum(attention_mask, dim=-1), but the default data type for the torch.sum method is torch.int64. However, the requirement in TensorRT is torch.int32. This data type issue is not a problem when the batch_size is 1, but it causes the phenomenon I described earlier when the batch_size is greater than 1. |
Hi, I tried run. py in py_session and the result was correct, but why is the triton backend incorrect? Are there any parameters that can be adjusted. |
Hi @1096125073, it's because PP is only support in Python runtime at this point. There are several runtime choices for enc-dec:
Current status is, (1) supports TP + PP, (2) and (3) are the same and only supports TP due to the reason mentioned above (not much PP use cases for enc-dec, so it's on the roadmap but not at high priority). Does this help? By the way, do you mean Again, feel free to open a new issue regarding enc-dec C++ runtime for PP, and explain your need for PP there so we can prioritize accordingly. |
System Info
A100 Tensorrt_llm 0.7.0
Who can help?
@byshiue @sy
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Converted the T5-large model according to the official example, using GPT and BERT plugins with float16 precision. Inference works correctly when batch size is 1.
When batch size > 1, e.g., batch size = 4, we observed that the self-attention results are correct for odd batch indices, but the output of self-attention is all zeros for even batch indices.
We debugged the decoder separately (using the T5 encoder output from HF as the input for the decoder) and found that self-attention in the decoder works correctly. However, in the cross-attention, the results are correct for odd batch indices, but the output is all zeros for even batch indices.
The above phenomenon only occurs when using the BERT and GPT plugins; it does not occur in plain TensorRT mode.
Expected behavior
When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.
actual behavior
When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.
additional notes
When the batch size is greater than 1, inference in the T5 family models behaves abnormally.
The text was updated successfully, but these errors were encountered: