vllm-project · vllm-bot · May 23, 2025 · May 12, 2025 · May 12, 2025 · May 13, 2025
diff --git a/.gitignore b/.gitignore
@@ -77,11 +77,6 @@ instance/
 # Scrapy stuff:
 .scrapy
 
-# Sphinx documentation
-docs/_build/
-docs/source/getting_started/examples/
-docs/source/api/vllm
-
 # PyBuilder
 .pybuilder/
 target/
@@ -151,6 +146,7 @@ venv.bak/
 
 # mkdocs documentation
 /site
+docs/getting_started/examples
 
 # mypy
 .mypy_cache/

@@ -39,6 +39,7 @@ repos:
   rev: v0.9.29
   hooks:
   - id: pymarkdown
+    exclude: '.*\.inc\.md'
     args: [fix]
 - repo: https://github.com/rhysd/actionlint
   rev: v1.7.7

@@ -8,12 +8,8 @@ build:
   tools:
     python: "3.12"
 
-sphinx:
-  configuration: docs/source/conf.py
-  fail_on_warning: true
-
-# If using Sphinx, optionally build your docs in additional formats such as PDF
-formats: []
+mkdocs:
+  configuration: mkdocs.yaml
 
 # Optionally declare the Python requirements required to build your docs
 python:

@@ -0,0 +1,51 @@
+nav:
+  - Home: 
+    - vLLM: README.md
+    - Getting Started:
+      - getting_started/quickstart.md
+      - getting_started/installation
+    - Examples:
+      - LMCache: getting_started/examples/lmcache
+      - getting_started/examples/offline_inference
+      - getting_started/examples/online_serving
+      - getting_started/examples/other
+    - Roadmap: https://roadmap.vllm.ai
+    - Releases: https://github.com/vllm-project/vllm/releases
+  - User Guide:
+    - Inference and Serving:
+      - serving/offline_inference.md
+      - serving/openai_compatible_server.md
+      - serving/*
+      - serving/integrations
+    - Training: training
+    - Deployment:
+      - deployment/*
+      - deployment/frameworks
+      - deployment/integrations
+    - Performance: performance
+    - Models:
+      - models/supported_models.md
+      - models/generative_models.md
+      - models/pooling_models.md
+      - models/extensions
+    - Features:
+      - features/compatibility_matrix.md
+      - features/*
+      - features/quantization
+    - Other:
+      - getting_started/*
+  - Developer Guide:
+    - contributing/overview.md
+    - glob: contributing/*
+      flatten_single_child_sections: true
+    - contributing/model
+    - Design Documents:
+      - V0: design
+      - V1: design/v1
+  - API Reference:
+    - api/README.md
+    - glob: api/vllm/*
+      preserve_directory_names: true
+  - Community:
+    - community/*
+    - vLLM Blog: https://blog.vllm.ai
diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/README.md b/docs/README.md
@@ -1,43 +1,50 @@
-# vLLM documents
-
-## Build the docs
-
-- Make sure in `docs` directory
-
-```bash
-cd docs
-```
-
-- Install the dependencies:
-
-```bash
-pip install -r ../requirements/docs.txt
-```
-
-- Clean the previous build (optional but recommended):
-
-```bash
-make clean
-```
-
-- Generate the HTML documentation:
-
-```bash
-make html
-```
-
-## Open the docs with your browser
-
-- Serve the documentation locally:
-
-```bash
-python -m http.server -d build/html/
-```
-
-This will start a local server at http://localhost:8000. You can now open your browser and view the documentation.
-
-If port 8000 is already in use, you can specify a different port, for example:
-
-```bash
-python -m http.server 3000 -d build/html/
-```
+# Welcome to vLLM
+
+<figure markdown="span">
+  ![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM" class="no-scaled-link" width="60%" }
+</figure>
+
+<p style="text-align:center">
+<strong>Easy, fast, and cheap LLM serving for everyone
+</strong>
+</p>
+
+<p style="text-align:center">
+<script async defer src="https://buttons.github.io/buttons.js"></script>
+<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
+<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
+<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
+</p>
+
+vLLM is a fast and easy-to-use library for LLM inference and serving.
+
+Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
+
+vLLM is fast with:
+
+- State-of-the-art serving throughput
+- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
+- Continuous batching of incoming requests
+- Fast model execution with CUDA/HIP graph
+- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
+- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
+- Speculative decoding
+- Chunked prefill
+
+vLLM is flexible and easy to use with:
+
+- Seamless integration with popular HuggingFace models
+- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
+- Tensor parallelism and pipeline parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
+- Prefix caching support
+- Multi-lora support
+
+For more information, check out the following:
+
+- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
+- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
+- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
+- [vLLM Meetups][meetups]
diff --git a/docs/api/README.md b/docs/api/README.md
@@ -0,0 +1,107 @@
+# Summary
+
+[](){ #configuration }
+
+## Configuration
+
+API documentation for vLLM's configuration classes.
+
+- [vllm.config.ModelConfig][]
+- [vllm.config.CacheConfig][]
+- [vllm.config.TokenizerPoolConfig][]
+- [vllm.config.LoadConfig][]
+- [vllm.config.ParallelConfig][]
+- [vllm.config.SchedulerConfig][]
+- [vllm.config.DeviceConfig][]
+- [vllm.config.SpeculativeConfig][]
+- [vllm.config.LoRAConfig][]
+- [vllm.config.PromptAdapterConfig][]
+- [vllm.config.MultiModalConfig][]
+- [vllm.config.PoolerConfig][]
+- [vllm.config.DecodingConfig][]
+- [vllm.config.ObservabilityConfig][]
+- [vllm.config.KVTransferConfig][]
+- [vllm.config.CompilationConfig][]
+- [vllm.config.VllmConfig][]
+
+[](){ #offline-inference-api }
+
+## Offline Inference
+
+LLM Class.
+
+- [vllm.LLM][]
+
+LLM Inputs.
+
+- [vllm.inputs.PromptType][]
+- [vllm.inputs.TextPrompt][]
+- [vllm.inputs.TokensPrompt][]
+
+## vLLM Engines
+
+Engine classes for offline and online inference.
+
+- [vllm.LLMEngine][]
+- [vllm.AsyncLLMEngine][]
+
+## Inference Parameters
+
+Inference parameters for vLLM APIs.
+
+[](){ #sampling-params }
+[](){ #pooling-params }
+
+- [vllm.SamplingParams][]
+- [vllm.PoolingParams][]
+
+[](){ #multi-modality }
+
+## Multi-Modality
+
+vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
+
+Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
+via the `multi_modal_data` field in [vllm.inputs.PromptType][].
+
+Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal].
+
+- [vllm.multimodal.MULTIMODAL_REGISTRY][]
+
+### Inputs
+
+User-facing inputs.
+
+- [vllm.multimodal.inputs.MultiModalDataDict][]
+
+Internal data structures.
+
+- [vllm.multimodal.inputs.PlaceholderRange][]
+- [vllm.multimodal.inputs.NestedTensors][]
+- [vllm.multimodal.inputs.MultiModalFieldElem][]
+- [vllm.multimodal.inputs.MultiModalFieldConfig][]
+- [vllm.multimodal.inputs.MultiModalKwargsItem][]
+- [vllm.multimodal.inputs.MultiModalKwargs][]
+- [vllm.multimodal.inputs.MultiModalInputs][]
+
+### Data Parsing
+
+- [vllm.multimodal.parse][]
+
+### Data Processing
+
+- [vllm.multimodal.processing][]
+
+### Memory Profiling
+
+- [vllm.multimodal.profiling][]
+
+### Registry
+
+- [vllm.multimodal.registry][]
+
+## Model Development
+
+- [vllm.model_executor.models.interfaces_base][]
+- [vllm.model_executor.models.interfaces][]
+- [vllm.model_executor.models.adapters][]
@@ -0,0 +1,2 @@
+search:
+  boost: 0.5
diff --git a/...ributing/dockerfile-stages-dependency.png → ...ributing/dockerfile-stages-dependency.png b/...ributing/dockerfile-stages-dependency.png → ...ributing/dockerfile-stages-dependency.png
diff --git a/...deployment/anything-llm-chat-with-doc.png → ...deployment/anything-llm-chat-with-doc.png b/...deployment/anything-llm-chat-with-doc.png → ...deployment/anything-llm-chat-with-doc.png
diff --git a/...loyment/anything-llm-chat-without-doc.png → ...loyment/anything-llm-chat-without-doc.png b/...loyment/anything-llm-chat-without-doc.png → ...loyment/anything-llm-chat-without-doc.png
diff --git a/...sets/deployment/anything-llm-provider.png → ...sets/deployment/anything-llm-provider.png b/...sets/deployment/anything-llm-provider.png → ...sets/deployment/anything-llm-provider.png
diff --git a/...ts/deployment/anything-llm-upload-doc.png → ...ts/deployment/anything-llm-upload-doc.png b/...ts/deployment/anything-llm-upload-doc.png → ...ts/deployment/anything-llm-upload-doc.png
diff --git a/...ployment/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png b/...ployment/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png
diff --git a/...source/assets/deployment/chatbox-chat.png → docs/assets/deployment/chatbox-chat.png b/...source/assets/deployment/chatbox-chat.png → docs/assets/deployment/chatbox-chat.png
diff --git a/...ce/assets/deployment/chatbox-settings.png → docs/assets/deployment/chatbox-settings.png b/...ce/assets/deployment/chatbox-settings.png → docs/assets/deployment/chatbox-settings.png
diff --git a/docs/source/assets/deployment/dify-chat.png → docs/assets/deployment/dify-chat.png b/docs/source/assets/deployment/dify-chat.png → docs/assets/deployment/dify-chat.png
diff --git a/...assets/deployment/dify-create-chatbot.png → ...assets/deployment/dify-create-chatbot.png b/...assets/deployment/dify-create-chatbot.png → ...assets/deployment/dify-create-chatbot.png
diff --git a/...ource/assets/deployment/dify-settings.png → docs/assets/deployment/dify-settings.png b/...ource/assets/deployment/dify-settings.png → docs/assets/deployment/dify-settings.png
diff --git a/docs/source/assets/deployment/open_webui.png → docs/assets/deployment/open_webui.png b/docs/source/assets/deployment/open_webui.png → docs/assets/deployment/open_webui.png
diff --git a/...urce/assets/deployment/streamlit-chat.png → docs/assets/deployment/streamlit-chat.png b/...urce/assets/deployment/streamlit-chat.png → docs/assets/deployment/streamlit-chat.png
diff --git a/.../arch_overview/entrypoints.excalidraw.png → .../arch_overview/entrypoints.excalidraw.png b/.../arch_overview/entrypoints.excalidraw.png → .../arch_overview/entrypoints.excalidraw.png
diff --git a/...n/arch_overview/llm_engine.excalidraw.png → ...n/arch_overview/llm_engine.excalidraw.png b/...n/arch_overview/llm_engine.excalidraw.png → ...n/arch_overview/llm_engine.excalidraw.png
diff --git a/docs/source/assets/design/hierarchy.png → docs/assets/design/hierarchy.png b/docs/source/assets/design/hierarchy.png → docs/assets/design/hierarchy.png
diff --git a/.../assets/design/v1/metrics/intervals-1.png → .../assets/design/v1/metrics/intervals-1.png b/.../assets/design/v1/metrics/intervals-1.png → .../assets/design/v1/metrics/intervals-1.png
diff --git a/.../assets/design/v1/metrics/intervals-2.png → .../assets/design/v1/metrics/intervals-2.png b/.../assets/design/v1/metrics/intervals-2.png → .../assets/design/v1/metrics/intervals-2.png
diff --git a/.../assets/design/v1/metrics/intervals-3.png → .../assets/design/v1/metrics/intervals-3.png b/.../assets/design/v1/metrics/intervals-3.png → .../assets/design/v1/metrics/intervals-3.png
diff --git a/...sign/v1/prefix_caching/example-time-1.png → ...sign/v1/prefix_caching/example-time-1.png b/...sign/v1/prefix_caching/example-time-1.png → ...sign/v1/prefix_caching/example-time-1.png
diff --git a/...sign/v1/prefix_caching/example-time-3.png → ...sign/v1/prefix_caching/example-time-3.png b/...sign/v1/prefix_caching/example-time-3.png → ...sign/v1/prefix_caching/example-time-3.png
diff --git a/...sign/v1/prefix_caching/example-time-4.png → ...sign/v1/prefix_caching/example-time-4.png b/...sign/v1/prefix_caching/example-time-4.png → ...sign/v1/prefix_caching/example-time-4.png
diff --git a/...sign/v1/prefix_caching/example-time-5.png → ...sign/v1/prefix_caching/example-time-5.png b/...sign/v1/prefix_caching/example-time-5.png → ...sign/v1/prefix_caching/example-time-5.png
diff --git a/...sign/v1/prefix_caching/example-time-6.png → ...sign/v1/prefix_caching/example-time-6.png b/...sign/v1/prefix_caching/example-time-6.png → ...sign/v1/prefix_caching/example-time-6.png
diff --git a/...sign/v1/prefix_caching/example-time-7.png → ...sign/v1/prefix_caching/example-time-7.png b/...sign/v1/prefix_caching/example-time-7.png → ...sign/v1/prefix_caching/example-time-7.png
diff --git a/.../assets/design/v1/prefix_caching/free.png → .../assets/design/v1/prefix_caching/free.png b/.../assets/design/v1/prefix_caching/free.png → .../assets/design/v1/prefix_caching/free.png
diff --git a/...ets/design/v1/prefix_caching/overview.png → ...ets/design/v1/prefix_caching/overview.png b/...ets/design/v1/prefix_caching/overview.png → ...ets/design/v1/prefix_caching/overview.png
diff --git a/...s/features/disagg_prefill/abstraction.jpg → ...s/features/disagg_prefill/abstraction.jpg b/...s/features/disagg_prefill/abstraction.jpg → ...s/features/disagg_prefill/abstraction.jpg
diff --git a/...sets/features/disagg_prefill/overview.jpg → ...sets/features/disagg_prefill/overview.jpg b/...sets/features/disagg_prefill/overview.jpg → ...sets/features/disagg_prefill/overview.jpg
diff --git a/docs/source/assets/kernel/k_vecs.png → docs/assets/kernel/k_vecs.png b/docs/source/assets/kernel/k_vecs.png → docs/assets/kernel/k_vecs.png
diff --git a/docs/source/assets/kernel/key.png → docs/assets/kernel/key.png b/docs/source/assets/kernel/key.png → docs/assets/kernel/key.png
diff --git a/docs/source/assets/kernel/logits_vec.png → docs/assets/kernel/logits_vec.png b/docs/source/assets/kernel/logits_vec.png → docs/assets/kernel/logits_vec.png
diff --git a/docs/source/assets/kernel/q_vecs.png → docs/assets/kernel/q_vecs.png b/docs/source/assets/kernel/q_vecs.png → docs/assets/kernel/q_vecs.png
diff --git a/docs/source/assets/kernel/query.png → docs/assets/kernel/query.png b/docs/source/assets/kernel/query.png → docs/assets/kernel/query.png
diff --git a/docs/source/assets/kernel/v_vec.png → docs/assets/kernel/v_vec.png b/docs/source/assets/kernel/v_vec.png → docs/assets/kernel/v_vec.png
diff --git a/docs/source/assets/kernel/value.png → docs/assets/kernel/value.png b/docs/source/assets/kernel/value.png → docs/assets/kernel/value.png
diff --git a/...rce/assets/logos/vllm-logo-only-light.ico → docs/assets/logos/vllm-logo-only-light.ico b/...rce/assets/logos/vllm-logo-only-light.ico → docs/assets/logos/vllm-logo-only-light.ico
diff --git a/...rce/assets/logos/vllm-logo-only-light.png → docs/assets/logos/vllm-logo-only-light.png b/...rce/assets/logos/vllm-logo-only-light.png → docs/assets/logos/vllm-logo-only-light.png
diff --git a/...urce/assets/logos/vllm-logo-text-dark.png → docs/assets/logos/vllm-logo-text-dark.png b/...urce/assets/logos/vllm-logo-text-dark.png → docs/assets/logos/vllm-logo-text-dark.png
diff --git a/...rce/assets/logos/vllm-logo-text-light.png → docs/assets/logos/vllm-logo-text-light.png b/...rce/assets/logos/vllm-logo-text-light.png → docs/assets/logos/vllm-logo-text-light.png
diff --git a/docs/source/community/meetups.md → docs/community/meetups.md b/docs/source/community/meetups.md → docs/community/meetups.md
@@ -1,6 +1,7 @@
-(meetups)=
-
-# vLLM Meetups
+---
+title: vLLM Meetups
+---
+[](){ #meetups }
 
 We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
 

diff --git a/docs/source/community/sponsors.md → docs/community/sponsors.md b/docs/source/community/sponsors.md → docs/community/sponsors.md
diff --git a/...source/contributing/deprecation_policy.md → docs/contributing/deprecation_policy.md b/...source/contributing/deprecation_policy.md → docs/contributing/deprecation_policy.md
diff --git a/...rce/contributing/dockerfile/dockerfile.md → docs/contributing/dockerfile/dockerfile.md b/...rce/contributing/dockerfile/dockerfile.md → docs/contributing/dockerfile/dockerfile.md
@@ -1,7 +1,7 @@
 # Dockerfile
 
 We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](#deployment-docker).
+More information about deploying with Docker can be found [here][deployment-docker].
 
 Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
 
@@ -17,11 +17,9 @@ The edges of the build graph represent:
 
 - `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
 
-  > :::{figure} /assets/contributing/dockerfile-stages-dependency.png
-  > :align: center
-  > :alt: query
-  > :width: 100%
-  > :::
+  > <figure markdown="span">
+  >   ![](../../assets/contributing/dockerfile-stages-dependency.png){ align="center" alt="query" width="100%" }
+  > </figure>
   >
   > Made using: <https://github.com/patrickhoefler/dockerfilegraph>
   >

diff --git a/docs/contributing/model/README.md b/docs/contributing/model/README.md
@@ -0,0 +1,23 @@
+---
+title: Adding a New Model
+---
+[](){ #new-model }
+
+This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
+
+Contents:
+
+- [Basic](basic.md)
+- [Registration](registration.md)
+- [Tests](tests.md)
+- [Multimodal](multimodal.md)
+
+!!! note
+    The complexity of adding a new model depends heavily on the model's architecture.
+    The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
+    However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
+
+!!! tip
+    If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
+    or ask on our [developer slack](https://slack.vllm.ai).
+    We will be happy to help you out!