You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
0.15.0.dev2024101500
/workspace/convert_checkpoint.py:378: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
model = torch.load(model_path, map_location='cpu')
Loaded model from assets/large-v3.pt
Converting encoder checkpoints...
Converting decoder checkpoints...
Total time of converting checkpoints: 00:00:27
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024101500
[02/07/2025-04:29:06] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gpt_attention_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set nccl_plugin to auto.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set lookup_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set lora_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set moe_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set remove_input_padding to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set reduce_fusion to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set enable_xqa to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set tokens_per_block to 64.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set multiple_profiles to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set paged_state to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set streamingllm to False.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set use_fused_mlp to True.
[02/07/2025-04:29:06] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[02/07/2025-04:29:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[02/07/2025-04:29:06] [TRT-LLM] [I] Compute capability: (7, 5)
[02/07/2025-04:29:06] [TRT-LLM] [I] SM count: 40
[02/07/2025-04:29:06] [TRT-LLM] [I] SM clock: 1590 MHz
[02/07/2025-04:29:06] [TRT-LLM] [I] int4 TFLOPS: 260
[02/07/2025-04:29:06] [TRT-LLM] [I] int8 TFLOPS: 130
[02/07/2025-04:29:06] [TRT-LLM] [I] fp8 TFLOPS: 0
[02/07/2025-04:29:06] [TRT-LLM] [I] float16 TFLOPS: 65
[02/07/2025-04:29:06] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[02/07/2025-04:29:06] [TRT-LLM] [I] float32 TFLOPS: 8
[02/07/2025-04:29:06] [TRT-LLM] [I] Total Memory: 15 GiB
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory clock: 5001 MHz
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory bus width: 256
[02/07/2025-04:29:06] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe link width: 16
[02/07/2025-04:29:06] [TRT-LLM] [I] PCIe bandwidth: 16 GB/s
[02/07/2025-04:29:07] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[02/07/2025-04:29:07] [TRT-LLM] [I] Set dtype to float16.
[02/07/2025-04:29:07] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/07/2025-04:29:07] [TRT-LLM] [W] Overriding paged_state to False
[02/07/2025-04:29:07] [TRT-LLM] [I] Set paged_state to False.
[02/07/2025-04:29:07] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[02/07/2025-04:29:07] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[02/07/2025-04:29:07] [TRT-LLM] [W] max_num_tokens (3000) shouldn't be greater than max_seq_len * max_batch_size (3000), specifying to max_seq_len * max_batch_size (3000).
[02/07/2025-04:29:07] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 155, GPU 11180 (MiB)
[02/07/2025-04:29:09] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +954, GPU +190, now: CPU 1265, GPU 11370 (MiB)
[02/07/2025-04:29:09] [TRT-LLM] [I] Set nccl_plugin to None.
[02/07/2025-04:29:10] [TRT-LLM] [I] Total time of constructing network from module object 3.3218817710876465 seconds
[02/07/2025-04:29:10] [TRT-LLM] [I] Total optimization profiles added: 1
[02/07/2025-04:29:10] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[02/07/2025-04:29:10] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[02/07/2025-04:29:10] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[02/07/2025-04:29:10] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[02/07/2025-04:29:10] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/07/2025-04:29:11] [TRT] [I] Compiler backend is used during engine build.
[02/07/2025-04:29:38] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/07/2025-04:29:38] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[02/07/2025-04:29:38] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 4: Internal Error (Internal error: plugin node WhisperEncoder/encoder_layers/2/attention/bert_attention_L4149/PLUGIN_V2_BertAttention_0 requires 41028384896 bytes of scratch space, but only 15655829504 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
)
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 580, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 422, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 389, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 382, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1200, in build
engine = None if build_config.dry_run else builder.build_engine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 204, in decorated
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 426, in build_engine
assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.
Describe the problem
在Nvidia T4上,trtllm-build whisper encoder失败。
Steps to reproduce
Step 1
Step 2
Step 3
Environment
Error Log
Additional Information
--context_fmha disable
。tiny模型可以编通过,large模型编不通过,报错如下:The text was updated successfully, but these errors were encountered: