feat(llm-katan): Add Kubernetes deployment support#710
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
04e7542 to
62e5d54
Compare
|
Hey @noalimoy , i'll try catching u during the week to talks about this one |
|
Hi @Xunzhuo
Could you share more details on what exactly you're expecting for these sections? I want to be sure I'm implementing the intended scope. Specifically: Thanks! |
|
@noalimoy ideally we should replace all the base-model.yaml with qwen0.6B https://github.com/vllm-project/semantic-router/blob/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml |
1d38d8e to
82e4a5b
Compare
82e4a5b to
64b700c
Compare
64b700c to
7ba97ca
Compare
|
@yossiovadia PTAL, thanks |
- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude) - Implement initContainer for efficient model caching using PVC - Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables - Add deployment documentation with examples for Kind cluster / Minikube This enables running multiple llm-katan instances in Kubernetes, each serving different model aliases while sharing the same underlying model. The overlays (gpt35, claude) demonstrate multi-instance deployments where each instance exposes a different served model name (e.g., gpt-3.5-turbo, claude-3-haiku-20240307) via the API. The served model name now works via environment variables, enabling Kubernetes deployments to expose diffrent model name via the API. Signed-off-by: Noa Limoy <nlimoy@nlimoy-thinkpadp1gen7.raanaii.csb> Signed-off-by: noalimoy <nlimoy@redhat.com>
9705981 to
eef9b68
Compare
|
Hi @yossiovadia and @Xunzhuo I'm summarizing the in-depth discussion @Yossi and I had yesterday regarding Issue #278. Decision: Deployment Path Location After thorough review, we determined that the correct location for Rationale: While LLM Katan is designed for testing purposes and could logically fit under
Additional Context:
|
|
I see some failures, i think those are not related but some environmental, lets see if those are getting resolved first. |

Summary
This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.
Kubernetes Manifests (Kustomize-based)
llm-katan-system)Multi-Instance Support (Overlays)
gpt-3.5-turboaliasclaude-3-haiku-20240307aliasModel Caching Optimization
model-downloader) pre-downloads models to PVCpython:3.11-slim+hf downloadfor ~45MB lightweight initBug Fix (config.py)
YLLM_SERVED_MODEL_NAMEenvironment variable supportDocumentation
deploy/docs/README.md)Test Results
Deployment Validation (Kind Cluster)
Resources Created:
API Validation:
GPT35 instance
$ curl http://llm-katan-gpt35:8000/v1/models
{"data":[{"id":"gpt-3.5-turbo",...}]}
Claude instance
$ curl http://llm-katan-claude:8000/v1/models
{"data":[{"id":"claude-3-haiku-20240307",...}]}
Motivation
This implementation addresses the need for:
The Kustomize structure enables:
Related issue: #278