Distributed Inference Platform

A minimal, reproducible distributed‑LLM serving stack that focuses on one thing: low‑latency generation at scale using Ray Serve and vLLM

Ray Serve deployment graphs for request routing
scripts/build_trt_engine.py stub that really builds engines with TensorRT‑LLM
Simplified Dockerfile and requirements.txt
Added Prometheus + Grafana example configs for observability

Quick start (single node, 1 GPU)

pip install -r requirements.txt
python -m inference_platform.serve --model meta-llama/Llama-3-8b-instruct
curl -X POST localhost:8000/generate -d '{"prompt":"Hello"}'

See docs/README.md for multi‑GPU and Kubernetes guides

Repository layout

inference_platform/
    serve.py             # Ray Serve deployment graph wrapping vLLM LLMServer
    engine_builder.py    # Convert HF checkpoint ➜ TensorRT‑LLM engine
k8s/
    rayserve-deployment.yaml
grafana/
    dashboards.json
scripts/
    build_trt_engine.py
Dockerfile
requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Inference Platform

Quick start (single node, 1 GPU)

Repository layout

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
inference_platform		inference_platform
k8s		k8s
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

hardikkgupta/distributed-inference

Folders and files

Latest commit

History

Repository files navigation

Distributed Inference Platform

Quick start (single node, 1 GPU)

Repository layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages