Skip to content

feat(llm-katan): Add Kubernetes deployment support#710

Merged
rootfs merged 2 commits intovllm-project:mainfrom
noalimoy:feat/add_kubernetes_support_for_llm_katan/278
Dec 11, 2025
Merged

feat(llm-katan): Add Kubernetes deployment support#710
rootfs merged 2 commits intovllm-project:mainfrom
noalimoy:feat/add_kubernetes_support_for_llm_katan/278

Conversation

@noalimoy
Copy link
Collaborator

Summary

This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.

Kubernetes Manifests (Kustomize-based)

  • Base deployment with security contexts and health probes
  • PersistentVolumeClaim (5Gi) for efficient model caching
  • Service (ClusterIP) exposing port 8000
  • Namespace isolation (llm-katan-system)

Multi-Instance Support (Overlays)

  • gpt35 overlay: Serves gpt-3.5-turbo alias
  • claude overlay: Serves claude-3-haiku-20240307 alias
  • Isolated PVCs per instance (prevents ReadWriteOnce conflicts)
  • Common labels component for consistent resource labeling

Model Caching Optimization

  • InitContainer (model-downloader) pre-downloads models to PVC
  • Smart caching: Skips download if model exists
  • Uses python:3.11-slim + hf download for ~45MB lightweight init
  • Main container starts instantly with cached model

Bug Fix (config.py)

  • Added YLLM_SERVED_MODEL_NAME environment variable support
  • Previously only worked via CLI arguments
  • Now enables Kubernetes env-based configuration

Documentation

  • Comprehensive deployment guide (deploy/docs/README.md)
  • Architecture explanation (Pod structure, storage, networking)
  • Kind cluster setup examples
  • Troubleshooting section with common issues

Test Results

Deployment Validation (Kind Cluster)

Resources Created:

  • Namespace: llm-katan-system
  • Deployments: llm-katan-gpt35, llm-katan-claude (both 1/1 Running)
  • Services: llm-katan-gpt35, llm-katan-claude (ClusterIP, port 8000)
  • PVCs: llm-katan-models-gpt35, llm-katan-models-claude (both 5Gi Bound)

API Validation:

Motivation

This implementation addresses the need for:

  • Cloud-native deployments: Production-ready Kubernetes manifests
  • Multi-instance testing: Run multiple model aliases simultaneously
  • Efficient resource usage: Model caching prevents redundant downloads
  • Testing flexibility: Easy overlay creation for new model aliases

The Kustomize structure enables:

  • Consistent base configuration
  • Environment-specific customization via overlays
  • Easy addition of new model aliases without base changes

Related issue: #278

@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit b2fc5c8
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/693aefb9e9dce90008086faa
😎 Deploy Preview https://deploy-preview-710--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 20, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 deploy

Owners: @rootfs, @Xunzhuo
Files changed:

  • deploy/kubernetes/llm-katan/README.md
  • deploy/kubernetes/llm-katan/base/deployment.yaml
  • deploy/kubernetes/llm-katan/base/kustomization.yaml
  • deploy/kubernetes/llm-katan/base/namespace.yaml
  • deploy/kubernetes/llm-katan/base/pvc.yaml
  • deploy/kubernetes/llm-katan/base/service.yaml
  • deploy/kubernetes/llm-katan/components/common/kustomization.yaml
  • deploy/kubernetes/llm-katan/overlays/claude/kustomization.yaml
  • deploy/kubernetes/llm-katan/overlays/gpt35/kustomization.yaml
  • deploy/kubernetes/llm-katan/verify-deployment.sh

📁 e2e-tests

Owners: @yossiovadia
Files changed:

  • e2e-tests/llm-katan/README.md
  • e2e-tests/llm-katan/llm_katan/config.py
  • e2e-tests/llm-katan/pyproject.toml

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/make/kube.mk

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@noalimoy noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 04e7542 to 62e5d54 Compare November 23, 2025 21:56
@yossiovadia
Copy link
Collaborator

Hey @noalimoy , i'll try catching u during the week to talks about this one

@noalimoy
Copy link
Collaborator Author

Hi @Xunzhuo
I'm working on this issue and need some clarification about the documentation requirements mentioned in it:

  • Examples for common use cases (CI/CD testing, multi-model setups)
  • Integration patterns with the main Semantic Router

Could you share more details on what exactly you're expecting for these sections? I want to be sure I'm implementing the intended scope.

Specifically:
Should this include adding a test-llm-katan make target to tools/make/kube.mk that will run in the CI workflow or sth else?

Thanks!

@Xunzhuo
Copy link
Member

Xunzhuo commented Nov 25, 2025

@noalimoy noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 1d38d8e to 82e4a5b Compare December 2, 2025 11:47
@noalimoy noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 82e4a5b to 64b700c Compare December 3, 2025 12:02
@rootfs rootfs requested a review from yossiovadia December 3, 2025 15:26
@noalimoy noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 64b700c to 7ba97ca Compare December 4, 2025 11:15
@noalimoy noalimoy marked this pull request as ready for review December 6, 2025 22:13
@rootfs
Copy link
Collaborator

rootfs commented Dec 6, 2025

@yossiovadia PTAL, thanks

- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude)
- Implement initContainer for efficient model caching using PVC
- Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables
- Add deployment documentation with examples for Kind cluster / Minikube

This enables running multiple llm-katan instances in Kubernetes, each
serving different model aliases while sharing the same underlying model.
The overlays (gpt35, claude) demonstrate multi-instance deployments where
each instance exposes a different served model name (e.g., gpt-3.5-turbo,
claude-3-haiku-20240307) via the API.

The served model name now works via environment variables, enabling
Kubernetes deployments to expose diffrent model name via the API.

Signed-off-by: Noa Limoy <nlimoy@nlimoy-thinkpadp1gen7.raanaii.csb>
Signed-off-by: noalimoy <nlimoy@redhat.com>
@noalimoy noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 9705981 to eef9b68 Compare December 11, 2025 13:13
@noalimoy
Copy link
Collaborator Author

noalimoy commented Dec 11, 2025

Hi @yossiovadia and @Xunzhuo

I'm summarizing the in-depth discussion @Yossi and I had yesterday regarding Issue #278.

Decision: Deployment Path Location

After thorough review, we determined that the correct location for llm-katan's Kubernetes deployment code is deploy/kubernetes/llm-katan/.

Rationale:

While LLM Katan is designed for testing purposes and could logically fit under e2e-tests/, we decided on deploy/kubernetes/ for the following reasons:

  1. Project Alignment: This path aligns with the existing project structure where Kubernetes cluster deployment configurations are located under deploy/kubernetes/ folder

  2. Future Integration: This decision connects with Yossi's vision to add Kubernetes support for a full Semantic Router deployment with LLM Katan integration. This will ensure proper integration with the current implementation of Issue Add Kubernetes Support for LLM Katan #278 (Add Kubernetes Support for LLM Katan)

  3. Consistency: Currently, full deployment options exist via Helm, Docker Compose, and OpenShift, but a pure Kubernetes manifest-based deployment with LLM Katan integration is missing. This PR lays the foundation for that future work.

Additional Context:

  • This creates a clear separation: deploy/kubernetes/llm-katan/ provides Kubernetes manifests for standalone llm-katan deployment, while future work can add integrated deployments
  • I've thoroughly tested the deployment with multi-instance configurations (GPT-3.5 and Claude overlays) and verified all endpoints work correctly
  • A follow-up issue will be created to add full Semantic Router with llm-kata integrated Kubernetes deployment, building on this foundation

@yossiovadia
Copy link
Collaborator

I see some failures, i think those are not related but some environmental, lets see if those are getting resolved first.

@rootfs rootfs merged commit c23d5b4 into vllm-project:main Dec 11, 2025
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants