Whisper在Nvidia T4上模型转换失败 #709

harryfyodor · 2025-02-07T05:11:57Z

Describe the problem

在Nvidia T4上，trtllm-build whisper encoder失败。

Expected behavior: 正常通过。
Actual behavior: trtllm-build报错。

Steps to reproduce

Step 1

docker build . -f Dockerfile.server -t soar97/triton-whisper:24.09

Step 2

your_mount_dir=/mnt:/mnt
docker run -it --name "whisper-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-whisper:24.09

Step 3

wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt

INFERENCE_PRECISION=float16
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=8
checkpoint_dir=tllm_checkpoint
output_dir=whisper_large_v3

# Convert the large-v3 openai model into trtllm compatible checkpoint.
python3 convert_checkpoint.py \
                --output_dir $checkpoint_dir

# Build the large-v3 trtllm engines
trtllm-build --checkpoint_dir ${checkpoint_dir}/encoder \
              --output_dir ${output_dir}/encoder \
              --moe_plugin disable \
              --enable_xqa disable \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --gemm_plugin disable \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --max_input_len 3000 --max_seq_len=3000

Environment

GPU: Nvidia T4
Operating System: Ubuntu 20.04
Python Version: Python 3.10.12
Library Versions: TensorRT-LLM version: 0.15.0.dev2024101500

Error Log

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
0.15.0.dev2024101500
/workspace/convert_checkpoint.py:378: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(model_path, map_location='cpu')
Loaded model from assets/large-v3.pt
Converting encoder checkpoints...
Converting decoder checkpoints...
Total time of converting checkpoints: 00:00:12
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
[02/07/2025-04:25:38] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set gpt_attention_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set gemm_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set nccl_plugin to auto.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set lookup_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set lora_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set moe_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set context_fmha to True.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set remove_input_padding to True.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set reduce_fusion to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set enable_xqa to True.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set tokens_per_block to 64.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set multiple_profiles to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set paged_state to True.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set streamingllm to False.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set use_fused_mlp to True.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[02/07/2025-04:25:38] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[02/07/2025-04:25:38] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[02/07/2025-04:25:38] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[02/07/2025-04:25:38] [TRT-LLM] [I] Compute capability: (7, 5)
[02/07/2025-04:25:38] [TRT-LLM] [I] SM count: 40
[02/07/2025-04:25:38] [TRT-LLM] [I] SM clock: 1590 MHz
[02/07/2025-04:25:38] [TRT-LLM] [I] int4 TFLOPS: 260
[02/07/2025-04:25:38] [TRT-LLM] [I] int8 TFLOPS: 130
[02/07/2025-04:25:38] [TRT-LLM] [I] fp8 TFLOPS: 0
[02/07/2025-04:25:38] [TRT-LLM] [I] float16 TFLOPS: 65
[02/07/2025-04:25:38] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[02/07/2025-04:25:38] [TRT-LLM] [I] float32 TFLOPS: 8
[02/07/2025-04:25:38] [TRT-LLM] [I] Total Memory: 15 GiB
[02/07/2025-04:25:38] [TRT-LLM] [I] Memory clock: 5001 MHz
[02/07/2025-04:25:38] [TRT-LLM] [I] Memory bus width: 256
[02/07/2025-04:25:38] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[02/07/2025-04:25:38] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[02/07/2025-04:25:38] [TRT-LLM] [I] PCIe link width: 16
[02/07/2025-04:25:38] [TRT-LLM] [I] PCIe bandwidth: 16 GB/s
[02/07/2025-04:25:38] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[02/07/2025-04:25:38] [TRT-LLM] [I] Set dtype to float16.
[02/07/2025-04:25:38] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/07/2025-04:25:38] [TRT-LLM] [W] Overriding paged_state to False
[02/07/2025-04:25:38] [TRT-LLM] [I] Set paged_state to False.
[02/07/2025-04:25:38] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[02/07/2025-04:25:38] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[02/07/2025-04:25:38] [TRT-LLM] [W] max_num_tokens (3000) shouldn't be greater than max_seq_len * max_batch_size (3000), specifying to max_seq_len * max_batch_size (3000).
[02/07/2025-04:25:38] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[02/07/2025-04:25:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 155, GPU 11180 (MiB)
[02/07/2025-04:25:52] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +954, GPU +190, now: CPU 1265, GPU 11370 (MiB)
[02/07/2025-04:25:53] [TRT-LLM] [I] Set nccl_plugin to None.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Unsupported architecture (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp:83)
1       0x7fc217dd9cd7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2       0x7fc217f73ade tensorrt_llm::kernels::FusedMHARunnerV2::FusedMHARunnerV2(tensorrt_llm::kernels::MHARunnerFixedParams) + 542
3       0x7fc1642fe8bf tensorrt_llm::plugins::BertAttentionPlugin::initialize() + 479
4       0x7fc1642fe581 tensorrt_llm::plugins::BertAttentionPlugin::clone() const + 177
5       0x7fc3402c0577 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xae3577) [0x7fc3402c0577]
6       0x7fc34022d48e /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa5048e) [0x7fc34022d48e]
7       0x7fc34ecfbd5a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0xfbd5a) [0x7fc34ecfbd5a]
8       0x7fc34ec4b35e /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x4b35e) [0x7fc34ec4b35e]
9       0x564308027b2e /usr/bin/python3(+0x15cb2e) [0x564308027b2e]
10      0x56430801e2db _PyObject_MakeTpCall + 603
11      0x56430803655b /usr/bin/python3(+0x16b55b) [0x56430803655b]
12      0x56430801634a _PyEval_EvalFrameDefault + 24890
13      0x56430802842c _PyFunction_Vectorcall + 124
14      0x564308011b93 _PyEval_EvalFrameDefault + 6531
15      0x564308036281 /usr/bin/python3(+0x16b281) [0x564308036281]
16      0x564308036f22 PyObject_Call + 290
17      0x564308012a6e _PyEval_EvalFrameDefault + 10334
18      0x56430802842c _PyFunction_Vectorcall + 124
19      0x56430801d51d _PyObject_FastCallDictTstate + 365
20      0x5643080332bc _PyObject_Call_Prepend + 92
21      0x56430814d6d0 /usr/bin/python3(+0x2826d0) [0x56430814d6d0]
22      0x56430801e2db _PyObject_MakeTpCall + 603
23      0x5643080174fa _PyEval_EvalFrameDefault + 29418
24      0x564308036281 /usr/bin/python3(+0x16b281) [0x564308036281]
25      0x564308036f22 PyObject_Call + 290
26      0x564308012a6e _PyEval_EvalFrameDefault + 10334
27      0x56430802842c _PyFunction_Vectorcall + 124
28      0x56430801d51d _PyObject_FastCallDictTstate + 365
29      0x5643080332bc _PyObject_Call_Prepend + 92
30      0x56430814d6d0 /usr/bin/python3(+0x2826d0) [0x56430814d6d0]
31      0x56430801e2db _PyObject_MakeTpCall + 603
32      0x5643080174fa _PyEval_EvalFrameDefault + 29418
33      0x564308036281 /usr/bin/python3(+0x16b281) [0x564308036281]
34      0x564308036f22 PyObject_Call + 290
35      0x564308012a6e _PyEval_EvalFrameDefault + 10334
36      0x56430802842c _PyFunction_Vectorcall + 124
37      0x56430801d51d _PyObject_FastCallDictTstate + 365
38      0x5643080332bc _PyObject_Call_Prepend + 92
39      0x56430814d6d0 /usr/bin/python3(+0x2826d0) [0x56430814d6d0]
40      0x564308036ebb PyObject_Call + 187
41      0x564308012a6e _PyEval_EvalFrameDefault + 10334
42      0x56430802842c _PyFunction_Vectorcall + 124
43      0x5643080108cc _PyEval_EvalFrameDefault + 1724
44      0x56430802842c _PyFunction_Vectorcall + 124
45      0x564308036f22 PyObject_Call + 290
46      0x564308012a6e _PyEval_EvalFrameDefault + 10334
47      0x56430802842c _PyFunction_Vectorcall + 124
48      0x564308036f22 PyObject_Call + 290
49      0x564308012a6e _PyEval_EvalFrameDefault + 10334
50      0x56430802842c _PyFunction_Vectorcall + 124
51      0x564308036f22 PyObject_Call + 290
52      0x564308012a6e _PyEval_EvalFrameDefault + 10334
53      0x56430802842c _PyFunction_Vectorcall + 124
54      0x5643080108cc _PyEval_EvalFrameDefault + 1724
55      0x56430800d016 /usr/bin/python3(+0x142016) [0x56430800d016]
56      0x5643081028b6 PyEval_EvalCode + 134
57      0x56430812d918 /usr/bin/python3(+0x262918) [0x56430812d918]
58      0x5643081271db /usr/bin/python3(+0x25c1db) [0x5643081271db]
59      0x56430812d665 /usr/bin/python3(+0x262665) [0x56430812d665]
60      0x56430812cb48 _PyRun_SimpleFileObject + 424
61      0x56430812c793 _PyRun_AnyFileObject + 67
62      0x56430811f2ce Py_RunMain + 702
63      0x5643080f570d Py_BytesMain + 45
64      0x7fc378b3bd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc378b3bd90]
65      0x7fc378b3be40 __libc_start_main + 128
66      0x5643080f5605 _start + 37
[iZwz90e7co3xxhxzlomma9Z:00172] *** Process received signal ***
[iZwz90e7co3xxhxzlomma9Z:00172] Signal: Aborted (6)
[iZwz90e7co3xxhxzlomma9Z:00172] Signal code:  (-6)
[iZwz90e7co3xxhxzlomma9Z:00172] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc378b54520]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fc378ba89fc]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fc378b54476]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fc378b3a7f3]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7fc33cf9bb9e]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fc33cfa720c]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7fc33cfa61e9]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7fc33cfa6959]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7fc378725884]
[iZwz90e7co3xxhxzlomma9Z:00172] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7fc3787262dd]
[iZwz90e7co3xxhxzlomma9Z:00172] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x768af9)[0x7fc217ddcaf9]
[iZwz90e7co3xxhxzlomma9Z:00172] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins19BertAttentionPlugin10initializeEv+0x1df)[0x7fc1642fe8bf]
[iZwz90e7co3xxhxzlomma9Z:00172] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZNK12tensorrt_llm7plugins19BertAttentionPlugin5cloneEv+0xb1)[0x7fc1642fe581]
[iZwz90e7co3xxhxzlomma9Z:00172] [13] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xae3577)[0x7fc3402c0577]
[iZwz90e7co3xxhxzlomma9Z:00172] [14] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa5048e)[0x7fc34022d48e]
[iZwz90e7co3xxhxzlomma9Z:00172] [15] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0xfbd5a)[0x7fc34ecfbd5a]
[iZwz90e7co3xxhxzlomma9Z:00172] [16] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x4b35e)[0x7fc34ec4b35e]
[iZwz90e7co3xxhxzlomma9Z:00172] [17] /usr/bin/python3(+0x15cb2e)[0x564308027b2e]
[iZwz90e7co3xxhxzlomma9Z:00172] [18] /usr/bin/python3(_PyObject_MakeTpCall+0x25b)[0x56430801e2db]
[iZwz90e7co3xxhxzlomma9Z:00172] [19] /usr/bin/python3(+0x16b55b)[0x56430803655b]
[iZwz90e7co3xxhxzlomma9Z:00172] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x613a)[0x56430801634a]
[iZwz90e7co3xxhxzlomma9Z:00172] [21] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x56430802842c]
[iZwz90e7co3xxhxzlomma9Z:00172] [22] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1983)[0x564308011b93]
[iZwz90e7co3xxhxzlomma9Z:00172] [23] /usr/bin/python3(+0x16b281)[0x564308036281]
[iZwz90e7co3xxhxzlomma9Z:00172] [24] /usr/bin/python3(PyObject_Call+0x122)[0x564308036f22]
[iZwz90e7co3xxhxzlomma9Z:00172] [25] /usr/bin/python3(_PyEval_EvalFrameDefault+0x285e)[0x564308012a6e]
[iZwz90e7co3xxhxzlomma9Z:00172] [26] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x56430802842c]
[iZwz90e7co3xxhxzlomma9Z:00172] [27] /usr/bin/python3(_PyObject_FastCallDictTstate+0x16d)[0x56430801d51d]
[iZwz90e7co3xxhxzlomma9Z:00172] [28] /usr/bin/python3(_PyObject_Call_Prepend+0x5c)[0x5643080332bc]
[iZwz90e7co3xxhxzlomma9Z:00172] [29] /usr/bin/python3(+0x2826d0)[0x56430814d6d0]
[iZwz90e7co3xxhxzlomma9Z:00172] *** End of error message ***
Aborted (core dumped)

Additional Information

我怀疑是trtllm 0.15.0 有问题，安装了最新版本，依然是core dumped，不过没有那么详细的报错。
我怀疑是fmha的问题，加了--context_fmha disable。tiny模型可以编通过，large模型编不通过，报错如下：

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
0.15.0.dev2024101500
/workspace/convert_checkpoint.py:378: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(model_path, map_location='cpu')
Loaded model from assets/large-v3.pt
Converting encoder checkpoints...
Converting decoder checkpoints...
Total time of converting checkpoints: 00:00:27
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
[02/07/2025-04:29:06] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gpt_attention_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set nccl_plugin to auto.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set lookup_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set lora_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set moe_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set remove_input_padding to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set reduce_fusion to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set enable_xqa to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set tokens_per_block to 64.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set multiple_profiles to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set paged_state to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set streamingllm to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_fused_mlp to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[02/07/2025-04:29:06] [TRT-LLM] [I] Compute capability: (7, 5)
[02/07/2025-04:29:06] [TRT-LLM] [I] SM count: 40
[02/07/2025-04:29:06] [TRT-LLM] [I] SM clock: 1590 MHz
[02/07/2025-04:29:06] [TRT-LLM] [I] int4 TFLOPS: 260
[02/07/2025-04:29:06] [TRT-LLM] [I] int8 TFLOPS: 130
[02/07/2025-04:29:06] [TRT-LLM] [I] fp8 TFLOPS: 0
[02/07/2025-04:29:06] [TRT-LLM] [I] float16 TFLOPS: 65
[02/07/2025-04:29:06] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[02/07/2025-04:29:06] [TRT-LLM] [I] float32 TFLOPS: 8
[02/07/2025-04:29:06] [TRT-LLM] [I] Total Memory: 15 GiB
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory clock: 5001 MHz
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory bus width: 256
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe link width: 16
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe bandwidth: 16 GB/s
[02/07/2025-04:29:07] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[02/07/2025-04:29:07] [TRT-LLM] [I] Set dtype to float16.
[02/07/2025-04:29:07] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/07/2025-04:29:07] [TRT-LLM] [W] Overriding paged_state to False
[02/07/2025-04:29:07] [TRT-LLM] [I] Set paged_state to False.
[02/07/2025-04:29:07] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[02/07/2025-04:29:07] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[02/07/2025-04:29:07] [TRT-LLM] [W] max_num_tokens (3000) shouldn't be greater than max_seq_len * max_batch_size (3000), specifying to max_seq_len * max_batch_size (3000).
[02/07/2025-04:29:07] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 155, GPU 11180 (MiB)
[02/07/2025-04:29:09] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +954, GPU +190, now: CPU 1265, GPU 11370 (MiB)
[02/07/2025-04:29:09] [TRT-LLM] [I] Set nccl_plugin to None.
[02/07/2025-04:29:10] [TRT-LLM] [I] Total time of constructing network from module object 3.3218817710876465 seconds
[02/07/2025-04:29:10] [TRT-LLM] [I] Total optimization profiles added: 1
[02/07/2025-04:29:10] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[02/07/2025-04:29:10] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[02/07/2025-04:29:10] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[02/07/2025-04:29:10] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[02/07/2025-04:29:10] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/07/2025-04:29:11] [TRT] [I] Compiler backend is used during engine build.
[02/07/2025-04:29:38] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/07/2025-04:29:38] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[02/07/2025-04:29:38] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 4: Internal Error (Internal error: plugin node WhisperEncoder/encoder_layers/2/attention/bert_attention_L4149/PLUGIN_V2_BertAttention_0 requires 41028384896 bytes of scratch space, but only 15655829504 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 580, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 422, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 389, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 382, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1200, in build
    engine = None if build_config.dry_run else builder.build_engine(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 204, in decorated
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 426, in build_engine
    assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.

我尝试了 https://github.com/shashikg/WhisperS2T/tree/main 的trtllm，可以生成。它的版本是0.8.0，比较老。

The text was updated successfully, but these errors were encountered:

yuekaizhang · 2025-02-17T02:09:26Z

@harryfyodor T4 的话试试老版本可以跑通不，https://github.com/k2-fsa/sherpa/tree/4a1ed3492e7529490f2f48a497ccb5de550d5ec6/triton/whisper。

最新版本的 tensorrt 不支持 t4 了，所以你先试试用 tiny 编译，如果 tiny 能编译通过，说明 trt-llm/trt 版本没问题。tiny 能通过，large 没办法通过，你发的错误像是显存不足。尝试看看哪些地方显存能调小一些，比如 batch size, beam width，token 长度这些设置小一点

cuongkn · 2025-02-21T18:03:45Z

@harryfyodor could you already fix that error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper在Nvidia T4上模型转换失败 #709

Whisper在Nvidia T4上模型转换失败 #709

harryfyodor commented Feb 7, 2025

yuekaizhang commented Feb 17, 2025

cuongkn commented Feb 21, 2025

Whisper在Nvidia T4上模型转换失败 #709

Whisper在Nvidia T4上模型转换失败 #709

Comments

harryfyodor commented Feb 7, 2025

Describe the problem

Steps to reproduce

Environment

Error Log

Additional Information

yuekaizhang commented Feb 17, 2025

cuongkn commented Feb 21, 2025