Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Succeeded in Python runtime, but failed in C++ runtime #2294

Closed
2 of 4 tasks
yjjuan opened this issue Oct 7, 2024 · 4 comments
Closed
2 of 4 tasks

Succeeded in Python runtime, but failed in C++ runtime #2294

yjjuan opened this issue Oct 7, 2024 · 4 comments
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@yjjuan
Copy link

yjjuan commented Oct 7, 2024

System Info

  • CPU architecture: x86_64
  • CPU/Host memory size (if known): 40G
  • GPU name :RTX 3060-6G
  • TensorRT-LLM branch or tag TensorRT-LLM commit: TensorRT-LLM version: 0.14.0.dev2024092400
  • Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used: cuda 12.4, tensorrt 10.3.0
  • Container used (if running TensorRT-LLM in a container): nvcr.io/nvidia/tensorrt:24.08-py3
  • NVIDIA driver version: 550.107.02
  • OS : ubuntu22.04

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • tried to run inference of tinyllama engine with C++ runtime based on executor API
root@dbedfb3d0654:/workspace/TensorRT-LLM/examples/cpp/executor/build# ./executorExampleBasic ../../../llama/tinyllama-engine/
[TensorRT-LLM][INFO] ckpt0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)
1       0x7fd1368a3c66 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fd136a5c227 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1047

Expected behavior

  • In contrast, the inference succeeded in python runtime with the same engine file
root@dbedfb3d0654:/workspace/TensorRT-LLM/examples# python3 run.py --engine_dir ./llama/tinyllama-engine/  --max_output_len 100 --tokenizer_dir ./llama/TinyLlama/TinyLlama_v1_1/ --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.14.0.dev2024092400
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 360.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.15 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.41 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 5.79 GiB, available: 1.43 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 960
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.29 GiB for max tokens in paged KV cache (61440).
[10/04/2024-05:05:27] [TRT-LLM] [I] Load engine takes: 1.5954880714416504 sec
Input [Text 0]: "<s> How do I count to nine in French?"
Output [Text 0 Beam 0]: "How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How"
[TensorRT-LLM][INFO] Refreshed the MPI local session

actual behavior

  • There is sth wrong with C++ runtime since python runtime works
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)

additional notes

  • Maybe there is some error in how to use C++ runtime.
@yjjuan yjjuan added the bug Something isn't working label Oct 7, 2024
@Superjomn Superjomn added the triaged Issue has been triaged by maintainers label Oct 16, 2024
@MartinMarciniszyn
Copy link
Collaborator

@yjjuan , please provide a reproducer.

@yjjuan
Copy link
Author

yjjuan commented Oct 19, 2024

  • I built my trt engine with command: trtllm-build --checkpoint_dir TinyLlama_v1_1 --gemm_plugin float16 --output_dir tinyllama-engine/
  • Next, I tried to use your example script in cpp runtime to run tinyllamma : /executorExampleBasic ../../../llama/tinyllama-engine/

@MartinMarciniszyn
Copy link
Collaborator

@DanBlanaru , could you please try to reproduce this?

@DanBlanaru
Copy link
Collaborator

Thanks for the patience @yjjuan, a fix will be released with tomorrow's push to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants