t5 model ,the inference result are wrong when the batch size > 1 #1847

0xd8b · 2024-06-26T14:50:11Z

System Info

A100 Tensorrt_llm 0.7.0

Who can help?

@byshiue @sy

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Converted the T5-large model according to the official example, using GPT and BERT plugins with float16 precision. Inference works correctly when batch size is 1.
When batch size > 1, e.g., batch size = 4, we observed that the self-attention results are correct for odd batch indices, but the output of self-attention is all zeros for even batch indices.
We debugged the decoder separately (using the T5 encoder output from HF as the input for the decoder) and found that self-attention in the decoder works correctly. However, in the cross-attention, the results are correct for odd batch indices, but the output is all zeros for even batch indices.

The above phenomenon only occurs when using the BERT and GPT plugins; it does not occur in plain TensorRT mode.

Expected behavior

When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.

actual behavior

When the batch size is greater than 1, using the BERT and GPT plugins in the T5 model shows significant abnormalities, where certain dimensions of the attention output are entirely zeros.

additional notes

When the batch size is greater than 1, inference in the T5 family models behaves abnormally.

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-06-26T14:57:40Z

It seems that you're using a very outdated version (0.7.0), could u please try the latest main branch code?

0xd8b · 2024-06-26T15:07:02Z

@nv-guomingz Yes, due to historical reasons, we developed on version 0.7.0. We have not seen anyone report this issue in the issues section, and it is possible that this problem still exists in the newer versions. Therefore, we hope to address this problem in version 0.7.0. Have you ever encountered a similar issue?

nv-guomingz · 2024-06-26T15:19:20Z

For me, I can't recall there's such issue for T5 on 0.7.0 version.

Would u please provide us step-by-step instructions for reproducing such issue?

I still suggest you try the latest release whl instead of using 0.7.0 to see if the issue still exists or not.

If so, we'll file a bug for internal tracking and investigating.

0xd8b · 2024-06-26T16:12:24Z

@nv-guomingz Just now, we used TensorRT LLM version 0.90 and converted T5-large using the official example (example/enc_dec/).

First, we followed the official example to convert the weights to float16.
Then, we used build.py to build the engine with batch_size=4, using the GPT plugin and the BERT plugin, with float16 precision, keeping everything else consistent with the official example.
We used run.py for inference. The inference results are abnormal in the even batch dimensions and correct in the odd batch dimensions. We have identified that the issue lies with the self-attention output in the BERT plugin being abnormal. In the GPT plugin, the self-attention output is normal, but the cross-attention output is abnormal, with the values in the even batch dimensions being all zeros. This might be a bug in the plugin.

hijkzzz · 2024-06-27T00:55:56Z

Could you try the latest version TRT_LLM 0.11+
see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

1096125073 · 2024-06-28T09:51:10Z

i have the same issuse when use gpt_attention plugin
6 A10
2tp 3pp

symphonylyh · 2024-06-28T18:03:23Z

Hi @0xd8b @1096125073 , can you please provide your trt-llm version, runtime type (python or pybind of C++), model name, TP/PP setup, beam search, reproducible input examples (English preferred)? Because on our end we wasn't seeing any issue with BS>1
Examples:
0.10.0, pybind of C++, google/t5-large, TP=1 PP=1, no beam search, ["xxx", "yyy", "zzz"]
And if you can reproduce your issue on TP=1 PP=1, please provide an example under this config -- it's easier for debug

0xd8b · 2024-06-28T20:18:29Z

sorry, I did not provide detailed information earlier. here is the related information:

First, we fine-tuned the t5-large network without changing the decoder's architecture.
We used float32 precision during training and float16 precision during engine conversion.
The engine conversion configurations are as follows:

tensorrt_llm versions: 0.7.0 and 0.9.0

--world_size=1
--tp_size=1
--pp_size=1
--gpus_per_node=8
--parallel_build=False
--weight_from_pytorch_ckpt=False
--engine_name="t5-small"
--debug_mode=False
--timing_cache="model.cache"
--model_type="t5"
--dtype="float16"
--logits_dtype="float16"
--log_level="info"
--max_batch_size=4
--max_encoder_input_len=1500
--max_decoder_input_len=1
--max_output_len=200
--max_beam_width=1
--use_bert_attention_plugin="float16"
--use_gpt_attention_plugin="float16"
--use_gemm_plugin="float16"
--use_layernorm_plugin=False
--use_rmsnorm_plugin=False
--use_lookup_plugin=False
--enable_qk_half_accum=False
--builder_opt=None
--remove_input_padding=False
--random_seed=None
--use_parallel_embedding=False
--embedding_sharding_dim=0
--use_custom_all_reduce=False
--strongly_typed=True
--gather_all_token_logits=False

We constructed an encoder_output with the shape [1, 545, 1024], using float16, and repeated it four times along the batch dimension, resulting in an encoder_output with the shape [4, 545, 1024] as input for the decoder. The final decoder output predictions for batch=0 and batch=2 were the same, and batch=1 and batch=3 were the same. As mentioned above, the attention layer output is all zeros in even dimensions. Additionally, we modified the C++ code mentioned here in versions 0.7.0 and 0.9.0: Flan t5 xxl result large difference #1343.

0xd8b · 2024-06-30T16:11:52Z

The BERT plugin has a parameter: relative_attention_bias: Tensor = None
The relative attention bias can have the shape [num_heads, max_seq_len, max_seq_len], or the relative attention embedding table for implicit mode, [num_heads, num_buckets].

We passed a pre-constructed relative_attention_bias with the shape [num_heads, max_seq_len, max_seq_len], which resulted in the attention calculation outputting all zeros on even layers. This issue does not occur in the T5 example in example/enc because it uses the implicit mode.

symphonylyh · 2024-06-30T18:22:33Z

The BERT plugin has a parameter: relative_attention_bias: Tensor = None The relative attention bias can have the shape [num_heads, max_seq_len, max_seq_len], or the relative attention embedding table for implicit mode, [num_heads, num_buckets].

We passed a pre-constructed relative_attention_bias with the shape [num_heads, max_seq_len, max_seq_len], which resulted in the attention calculation outputting all zeros on even layers. This issue does not occur in the T5 example in example/enc because it uses the implicit mode.

This makes more sense. Have you tried run the T5 example w/o the implicit mode? We have tested this before in earlier versions. If you can reproduce on the T5-explicit mode, please let us know

1096125073 · 2024-07-01T01:15:13Z

Hi @0xd8b @1096125073 , can you please provide your trt-llm version, runtime type (python or pybind of C++), model name, TP/PP setup, beam search, reproducible input examples (English preferred)? Because on our end we wasn't seeing any issue with BS>1 Examples: 0.10.0, pybind of C++, google/t5-large, TP=1 PP=1, no beam search, ["xxx", "yyy", "zzz"] And if you can reproduce your issue on TP=1 PP=1, please provide an example under this config -- it's easier for debug

yes
0.9.0, pybind of C++, Private model, similar to llama,use gpt_attention plugin, TP=2 PP=3, no beam search
for example input: "how are you?" with batch size 4
the outputs are:

But when I tried on four A100 cards, the result was correct（4tp 1pp）
this is so wired.

symphonylyh · 2024-07-01T06:33:46Z

@1096125073 you case is a different issue. Actually it is EXPECTED. For pybind of C++ runtime, we didn't support PP yet. It's only TP support. Because we haven't seen much use cases of enc-dec models using PP for deployment.

If this is really needed for your case, would you mind open a new issue and raise this feature request?

1096125073 · 2024-07-01T06:53:20Z

@1096125073 you case is a different issue. Actually it is EXPECTED. For pybind of C++ runtime, we didn't support PP yet. It's only TP support. Because we haven't seen much use cases of enc-dec models using PP for deployment.

If this is really needed for your case, would you mind open a new issue and raise this feature request?

hi,thanks a lot,i will open a new issue later.
Actually, I first used the Triton Backend to find inconsistent output before using run.py

0xd8b · 2024-07-01T07:38:50Z

I've found the problem; it's due to the data type of the operation data. In the Encoder and Decoder, there is a parameter called encoder_input_lengths. The documentation of the function did not specify the exact data type required for this parameter, so we did not pay attention to it. We constructed this variable using encoder_input_lengths = torch.sum(attention_mask, dim=-1), but the default data type for the torch.sum method is torch.int64. However, the requirement in TensorRT is torch.int32. This data type issue is not a problem when the batch_size is 1, but it causes the phenomenon I described earlier when the batch_size is greater than 1.

1096125073 · 2024-07-01T08:13:41Z

@1096125073 you case is a different issue. Actually it is EXPECTED. For pybind of C++ runtime, we didn't support PP yet. It's only TP support. Because we haven't seen much use cases of enc-dec models using PP for deployment.

If this is really needed for your case, would you mind open a new issue and raise this feature request?

Hi, I tried run. py in py_session and the result was correct, but why is the triton backend incorrect? Are there any parameters that can be adjusted.

symphonylyh · 2024-07-01T17:54:01Z

@0xd8b good to hear that you have resolved the issue! yes it's indeed tricky for this int64/int32 thing. That's why we've put a caveat here from the past...

Next time it will be more straightforward that we know such interleaving problem is caused by int32/int32.

Closing the issue for now

symphonylyh · 2024-07-01T18:00:17Z

Hi @1096125073, it's because PP is only support in Python runtime at this point. There are several runtime choices for enc-dec:

python runtime, using examples/enc_dec/run.py or examples/run.py --use_py_session
pybind of cpp runtime, using examples/run.py
triton backend, which is calling the same underlying APIs as (2)

Current status is, (1) supports TP + PP, (2) and (3) are the same and only supports TP due to the reason mentioned above (not much PP use cases for enc-dec, so it's on the roadmap but not at high priority). Does this help?

By the way, do you mean examples/run.py --use_py_session can work for enc-dec. If I remember correctly, enc-dec's Python runtime can only be run via exampels/enc_dec/run.py

Again, feel free to open a new issue regarding enc-dec C++ runtime for PP, and explain your need for PP there so we can prioritize accordingly.

0xd8b added the bug Something isn't working label Jun 26, 2024

nv-guomingz added the waiting for feedback label Jun 26, 2024

hijkzzz assigned nv-guomingz and hijkzzz Jun 27, 2024

hijkzzz added Investigating and removed waiting for feedback labels Jun 28, 2024

symphonylyh closed this as completed Jul 1, 2024

0xd8b mentioned this issue Jul 3, 2024

"High GPU usage leads to NaN values in the encoder output of the T5 model (float16). #1511

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t5 model ,the inference result are wrong when the batch size > 1 #1847

t5 model ,the inference result are wrong when the batch size > 1 #1847

0xd8b commented Jun 26, 2024

nv-guomingz commented Jun 26, 2024

0xd8b commented Jun 26, 2024

nv-guomingz commented Jun 26, 2024 •

edited

Loading

0xd8b commented Jun 26, 2024

hijkzzz commented Jun 27, 2024

1096125073 commented Jun 28, 2024

symphonylyh commented Jun 28, 2024 •

edited

Loading

0xd8b commented Jun 28, 2024

0xd8b commented Jun 30, 2024

symphonylyh commented Jun 30, 2024

1096125073 commented Jul 1, 2024

symphonylyh commented Jul 1, 2024

1096125073 commented Jul 1, 2024

0xd8b commented Jul 1, 2024

1096125073 commented Jul 1, 2024

symphonylyh commented Jul 1, 2024

symphonylyh commented Jul 1, 2024 •

edited

Loading

t5 model ,the inference result are wrong when the batch size > 1 #1847

t5 model ,the inference result are wrong when the batch size > 1 #1847

Comments

0xd8b commented Jun 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Jun 26, 2024

0xd8b commented Jun 26, 2024

nv-guomingz commented Jun 26, 2024 • edited Loading

0xd8b commented Jun 26, 2024

hijkzzz commented Jun 27, 2024

1096125073 commented Jun 28, 2024

symphonylyh commented Jun 28, 2024 • edited Loading

0xd8b commented Jun 28, 2024

0xd8b commented Jun 30, 2024

symphonylyh commented Jun 30, 2024

1096125073 commented Jul 1, 2024

symphonylyh commented Jul 1, 2024

1096125073 commented Jul 1, 2024

0xd8b commented Jul 1, 2024

1096125073 commented Jul 1, 2024

symphonylyh commented Jul 1, 2024

symphonylyh commented Jul 1, 2024 • edited Loading

nv-guomingz commented Jun 26, 2024 •

edited

Loading

symphonylyh commented Jun 28, 2024 •

edited

Loading

symphonylyh commented Jul 1, 2024 •

edited

Loading