-
-
Notifications
You must be signed in to change notification settings - Fork 44
feat: add TensorRT-LLM as backend #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
/kind feature |
kerthcet
reviewed
May 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, this is great! Thanks @cr7258
/lgtm
/approve
|
/lgtm |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
approved
Indicates a PR has been approved by an approver from all required OWNERS files.
feature
Categorizes issue or PR as related to a new feature.
lgtm
Looks good to me, indicates that a PR is ready to be merged.
needs-priority
Indicates a PR lacks a label and requires one.
needs-triage
Indicates an issue or PR lacks a label and requires one.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it
Add TensorRT-LLM as a backend, here are the output logs for TensorRT-LLM.
Send an inference request.
kubectl port-forward qwen2-0--5b-0 8080:8080 curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "model": "Qwen/Qwen2-0.5B-Instruct", "messages":[{"role": "user", "content": "Who are you?"}] }' # response { "id": "chatcmpl-ecb2f4252cc04f7d9a6842de079487a3", "object": "chat.completion", "created": 1746111073, "model": "models--Qwen--Qwen2-0.5B-Instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "I am an artificial intelligence designed to assist with a variety of tasks, including answering", "tool_calls": [ ] }, "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 23, "total_tokens": 39, "completion_tokens": 16 } }Which issue(s) this PR fixes
Fixes #205
Special notes for your reviewer
In this PR, I didn't add a preStop hook for TensorRT-LLM for graceful termination. The reason is as follows:
Currently, the latest image version of Triton Inference Server that supports TensorRT-LLM is nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3, which uses TensorRT-LLM version 0.18.0. However, TensorRT-LLM starts to support the metrics endpoint from version 0.19.0, which is in the Release Candidate. Once the new image is updated with metrics support, we can add the preStop hook.
Does this PR introduce a user-facing change?