Succeeded in Python runtime, but failed in C++ runtime #2294

yjjuan · 2024-10-07T13:36:50Z

System Info

CPU architecture: x86_64
CPU/Host memory size (if known): 40G
GPU name :RTX 3060-6G
TensorRT-LLM branch or tag TensorRT-LLM commit: TensorRT-LLM version: 0.14.0.dev2024092400
Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used: cuda 12.4, tensorrt 10.3.0
Container used (if running TensorRT-LLM in a container): nvcr.io/nvidia/tensorrt:24.08-py3
NVIDIA driver version: 550.107.02
OS : ubuntu22.04

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

tried to run inference of tinyllama engine with C++ runtime based on executor API

root@dbedfb3d0654:/workspace/TensorRT-LLM/examples/cpp/executor/build# ./executorExampleBasic ../../../llama/tinyllama-engine/
[TensorRT-LLM][INFO] ckpt0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)
1       0x7fd1368a3c66 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fd136a5c227 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1047

Expected behavior

In contrast, the inference succeeded in python runtime with the same engine file

root@dbedfb3d0654:/workspace/TensorRT-LLM/examples# python3 run.py --engine_dir ./llama/tinyllama-engine/  --max_output_len 100 --tokenizer_dir ./llama/TinyLlama/TinyLlama_v1_1/ --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.14.0.dev2024092400
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 360.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.15 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.41 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 5.79 GiB, available: 1.43 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 960
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.29 GiB for max tokens in paged KV cache (61440).
[10/04/2024-05:05:27] [TRT-LLM] [I] Load engine takes: 1.5954880714416504 sec
Input [Text 0]: "<s> How do I count to nine in French?"
Output [Text 0 Beam 0]: "How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How"
[TensorRT-LLM][INFO] Refreshed the MPI local session

actual behavior

There is sth wrong with C++ runtime since python runtime works

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)

additional notes

Maybe there is some error in how to use C++ runtime.

The text was updated successfully, but these errors were encountered:

MartinMarciniszyn · 2024-10-16T08:37:25Z

@yjjuan , please provide a reproducer.

yjjuan · 2024-10-19T15:20:29Z

I built my trt engine with command: trtllm-build --checkpoint_dir TinyLlama_v1_1 --gemm_plugin float16 --output_dir tinyllama-engine/
Next, I tried to use your example script in cpp runtime to run tinyllamma : /executorExampleBasic ../../../llama/tinyllama-engine/

MartinMarciniszyn · 2024-10-21T13:18:38Z

@DanBlanaru , could you please try to reproduce this?

DanBlanaru · 2024-11-04T09:45:10Z

Thanks for the patience @yjjuan, a fix will be released with tomorrow's push to main.

yjjuan added the bug Something isn't working label Oct 7, 2024

Superjomn added the runtime label Oct 16, 2024

Superjomn assigned MartinMarciniszyn Oct 16, 2024

Superjomn added the triaged Issue has been triaged by maintainers label Oct 16, 2024

MartinMarciniszyn assigned yjjuan Oct 16, 2024

MartinMarciniszyn assigned DanBlanaru and unassigned yjjuan Oct 21, 2024

DanBlanaru closed this as completed Nov 4, 2024

kaiyux mentioned this issue Nov 5, 2024

Update TensorRT-LLM #2413

Merged

Shixiaowei02 mentioned this issue Dec 4, 2024

TensorRT-LLM Release 0.15.0 #2529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Succeeded in Python runtime, but failed in C++ runtime #2294

Succeeded in Python runtime, but failed in C++ runtime #2294

yjjuan commented Oct 7, 2024 •

edited

Loading

MartinMarciniszyn commented Oct 16, 2024

yjjuan commented Oct 19, 2024

MartinMarciniszyn commented Oct 21, 2024

DanBlanaru commented Nov 4, 2024

Succeeded in Python runtime, but failed in C++ runtime #2294

Succeeded in Python runtime, but failed in C++ runtime #2294

Comments

yjjuan commented Oct 7, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

MartinMarciniszyn commented Oct 16, 2024

yjjuan commented Oct 19, 2024

MartinMarciniszyn commented Oct 21, 2024

DanBlanaru commented Nov 4, 2024

yjjuan commented Oct 7, 2024 •

edited

Loading