Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Easy, advanced inference platform for large language models on Kubernetes
## Key Features

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
Expand Down
52 changes: 52 additions & 0 deletions chart/templates/backends/tensorrt-llm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{{- if .Values.backendRuntime.enabled -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: tensorrt-llm
spec:
command:
- trtllm-serve
image: {{ .Values.backendRuntime.tensorrt_llm.image.repository }}
version: {{ .Values.backendRuntime.tensorrt_llm.image.tag }}
# Do not edit the preset argument name unless you know what you're doing.
# Free to add more arguments with your requirements.
recommendedConfigs:
- name: default
args:
- "{{`{{ .ModelPath }}`}}"
- --host
- "0.0.0.0"
- --port
- "8080"
resources:
requests:
cpu: 4
memory: 16Gi
limits:
cpu: 4
memory: 16Gi
startupProbe:
periodSeconds: 10
failureThreshold: 30
httpGet:
path: /health
port: 8080
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /health
port: 8080
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8080
{{- end }}
4 changes: 4 additions & 0 deletions chart/values.global.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ backendRuntime:
image:
repository: lmsysorg/sglang
tag: v0.4.5-cu121
tensorrt_llm:
image:
repository: nvcr.io/nvidia/tritonserver
tag: 25.03-trtllm-python-py3
tgi:
image:
repository: ghcr.io/huggingface/text-generation-inference
Expand Down
13 changes: 9 additions & 4 deletions docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@ We provide a set of examples to help you serve large language models, by default
- [Deploy models from ObjectStore](#deploy-models-from-objectstore)
- [Deploy models via SGLang](#deploy-models-via-sglang)
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
- [Deploy models via ollama](#ollama)
- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
- [Deploy models via ollama](#deploy-models-via-ollama)
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
- [Deploy multi-host inference](#multi-host-inference)
- [Deploy host models](#deploy-host-models)
- [Multi-Host Inference](#multi-host-inference)
- [Deploy Host Models](#deploy-host-models)
- [Envoy AI Gateway](#envoy-ai-gateway)

### Deploy models from Huggingface
Expand Down Expand Up @@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference

[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.

### Deploy models via TensorRT-LLM

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs, see [example](./tensorrt-llm/) here.

### Deploy models via text-generation-inference

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
Expand Down
25 changes: 25 additions & 0 deletions docs/examples/tensorrt-llm/playground.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: qwen2-0--5b
spec:
familyName: qwen2
source:
modelHub:
modelID: Qwen/Qwen2-0.5B-Instruct
inferenceConfig:
flavors:
- name: a10 # GPU type
limits:
nvidia.com/gpu: 1
---
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: qwen2-0--5b
spec:
replicas: 1
modelClaim:
modelName: qwen2-0--5b
backendRuntimeConfig:
backendName: tensorrt-llm
4 changes: 4 additions & 0 deletions site/content/en/docs/integrations/support-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt

[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.

## TensorRT-LLM

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

## Text-Generation-Inference

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
Expand Down
51 changes: 51 additions & 0 deletions test/config/backends/tensorrt-llm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: tensorrt-llm
spec:
command:
- trtllm-serve
image: nvcr.io/nvidia/tritonserver
version: 25.03-trtllm-python-py3
# Do not edit the preset argument name unless you know what you're doing.
# Free to add more arguments with your requirements.
recommendedConfigs:
- name: default
args:
- "{{`{{ .ModelPath }}`}}"
- --host
- "0.0.0.0"
- --port
- "8080"
sharedMemorySize: 2Gi
resources:
requests:
cpu: 4
memory: 16Gi
limits:
cpu: 4
memory: 16Gi
startupProbe:
periodSeconds: 10
failureThreshold: 30
httpGet:
path: /health
port: 8080
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /health
port: 8080
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8080
Loading