Skip to content

Conversation

@cr7258
Copy link
Contributor

@cr7258 cr7258 commented May 1, 2025

What this PR does / why we need it

Add TensorRT-LLM as a backend, here are the output logs for TensorRT-LLM.

kubectl logs qwen2-0--5b-0
Defaulted container "model-runner" out of: model-runner, model-loader (init)
2025-05-01 14:44:15,167 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.18.0
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2090: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Loading Model: [1/2]	Loading HF model to memory
198it [00:01, 163.51it/s]
Time: 1.475s
Loading Model: [2/2]	Building TRT-LLM engine
Time: 37.698s
Loading model done.
Total latency: 39.172s
2025-05-01 14:45:04,175 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.18.0
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2090: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.18.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][WARNING] Fix optionalParams : KV cache reuse disabled because model was not built with paged context FMHA support
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 24
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 964 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 324.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.66 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.76 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.18 GiB, available: 11.89 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14609
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 10.70 GiB for max tokens in paged KV cache (934976).
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:     240.243.170.78:61890 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:61902 - "GET /health HTTP/1.1" 200 OK

Send an inference request.

kubectl port-forward qwen2-0--5b-0 8080:8080

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "Qwen/Qwen2-0.5B-Instruct",
        "messages":[{"role": "user", "content": "Who are you?"}]
    }'

# response
{
  "id": "chatcmpl-ecb2f4252cc04f7d9a6842de079487a3",
  "object": "chat.completion",
  "created": 1746111073,
  "model": "models--Qwen--Qwen2-0.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am an artificial intelligence designed to assist with a variety of tasks, including answering",
        "tool_calls": [
          
        ]
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 39,
    "completion_tokens": 16
  }
}

Which issue(s) this PR fixes

Fixes #205

Special notes for your reviewer

In this PR, I didn't add a preStop hook for TensorRT-LLM for graceful termination. The reason is as follows:

Currently, the latest image version of Triton Inference Server that supports TensorRT-LLM is nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3, which uses TensorRT-LLM version 0.18.0. However, TensorRT-LLM starts to support the metrics endpoint from version 0.19.0, which is in the Release Candidate. Once the new image is updated with metrics support, we can add the preStop hook.

Does this PR introduce a user-facing change?

add TensorRT-LLM as backend

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 1, 2025
@InftyAI-Agent InftyAI-Agent requested a review from kerthcet May 1, 2025 15:03
@cr7258
Copy link
Contributor Author

cr7258 commented May 1, 2025

/kind feature

@InftyAI-Agent InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 1, 2025
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, this is great! Thanks @cr7258
/lgtm
/approve

@kerthcet
Copy link
Member

kerthcet commented May 6, 2025

/lgtm
/approve

@InftyAI-Agent InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2025
@InftyAI-Agent InftyAI-Agent merged commit fe74a6d into InftyAI:main May 6, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. feature Categorizes issue or PR as related to a new feature. lgtm Looks good to me, indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add TensorRT-LLM support as another backend

3 participants