Skip to content

Commit 1cb194a

Browse files
[Doc] Reorganize user guide (#18661)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 2cd4d58 commit 1cb194a

27 files changed

+211
-216
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ FILL IN THE PR DESCRIPTION HERE
33
FIX #xxxx (*link existing issues this PR will resolve*)
44

55
<!--- pyml disable-next-line no-emphasis-as-heading -->
6-
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>** (anything written below this line will be removed by GitHub Actions)
6+
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# Contributing to vLLM
22

3-
You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing/overview.html).
3+
You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing).

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
100100
## Contributing
101101

102102
We welcome and value any contributions and collaborations.
103-
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/stable/contributing/overview.html) for how to get involved.
103+
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/stable/contributing) for how to get involved.
104104

105105
## Sponsors
106106

docs/.nav.yml

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,29 +5,35 @@ nav:
55
- getting_started/quickstart.md
66
- getting_started/installation
77
- Examples:
8-
- LMCache: getting_started/examples/lmcache
9-
- getting_started/examples/offline_inference
10-
- getting_started/examples/online_serving
11-
- getting_started/examples/other
8+
- Offline Inference: getting_started/examples/offline_inference
9+
- Online Serving: getting_started/examples/online_serving
10+
- Others:
11+
- LMCache: getting_started/examples/lmcache
12+
- getting_started/examples/other/*
1213
- Quick Links:
13-
- User Guide: serving/offline_inference.md
14-
- Developer Guide: contributing/overview.md
14+
- User Guide: usage/README.md
15+
- Developer Guide: contributing/README.md
1516
- API Reference: api/README.md
1617
- Timeline:
1718
- Roadmap: https://roadmap.vllm.ai
1819
- Releases: https://github.com/vllm-project/vllm/releases
1920
- User Guide:
21+
- usage/README.md
22+
- General:
23+
- usage/*
2024
- Inference and Serving:
2125
- serving/offline_inference.md
2226
- serving/openai_compatible_server.md
2327
- serving/*
2428
- serving/integrations
25-
- Training: training
2629
- Deployment:
2730
- deployment/*
2831
- deployment/frameworks
2932
- deployment/integrations
30-
- Performance: performance
33+
- Training: training
34+
- Configuration:
35+
- Summary: configuration/README.md
36+
- configuration/*
3137
- Models:
3238
- models/supported_models.md
3339
- models/generative_models.md
@@ -37,12 +43,11 @@ nav:
3743
- features/compatibility_matrix.md
3844
- features/*
3945
- features/quantization
40-
- Other:
41-
- getting_started/*
4246
- Developer Guide:
43-
- contributing/overview.md
44-
- glob: contributing/*
45-
flatten_single_child_sections: true
47+
- contributing/README.md
48+
- General:
49+
- glob: contributing/*
50+
flatten_single_child_sections: true
4651
- Model Implementation: contributing/model
4752
- Design Documents:
4853
- V0: design

docs/configuration/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Configuration Options
2+
3+
This section lists the most common options for running the vLLM engine.
4+
For a full list, refer to the [configuration][configuration] page.
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Conserving Memory
2+
3+
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
4+
5+
## Tensor Parallelism (TP)
6+
7+
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
8+
9+
The following code splits the model across 2 GPUs.
10+
11+
```python
12+
from vllm import LLM
13+
14+
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
15+
tensor_parallel_size=2)
16+
```
17+
18+
!!! warning
19+
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
20+
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
21+
22+
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
23+
24+
!!! note
25+
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
26+
27+
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
28+
29+
## Quantization
30+
31+
Quantized models take less memory at the cost of lower precision.
32+
33+
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
34+
and used directly without extra configuration.
35+
36+
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
37+
38+
## Context length and batch size
39+
40+
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
41+
and the maximum batch size (`max_num_seqs` option).
42+
43+
```python
44+
from vllm import LLM
45+
46+
llm = LLM(model="adept/fuyu-8b",
47+
max_model_len=2048,
48+
max_num_seqs=2)
49+
```
50+
51+
## Reduce CUDA Graphs
52+
53+
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
54+
55+
!!! warning
56+
CUDA graph capture takes up more memory in V1 than in V0.
57+
58+
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
59+
60+
```python
61+
from vllm import LLM
62+
from vllm.config import CompilationConfig, CompilationLevel
63+
64+
llm = LLM(
65+
model="meta-llama/Llama-3.1-8B-Instruct",
66+
compilation_config=CompilationConfig(
67+
level=CompilationLevel.PIECEWISE,
68+
# By default, it goes up to max_num_seqs
69+
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
70+
),
71+
)
72+
```
73+
74+
You can disable graph capturing completely via the `enforce_eager` flag:
75+
76+
```python
77+
from vllm import LLM
78+
79+
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
80+
enforce_eager=True)
81+
```
82+
83+
## Adjust cache size
84+
85+
If you run out of CPU RAM, try the following options:
86+
87+
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
88+
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
89+
90+
## Multi-modal input limits
91+
92+
You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
93+
94+
```python
95+
from vllm import LLM
96+
97+
# Accept up to 3 images and 1 video per prompt
98+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
99+
limit_mm_per_prompt={"image": 3, "video": 1})
100+
```
101+
102+
You can go a step further and disable unused modalities completely by setting its limit to zero.
103+
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
104+
105+
```python
106+
from vllm import LLM
107+
108+
# Accept any number of images but no videos
109+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
110+
limit_mm_per_prompt={"video": 0})
111+
```
112+
113+
You can even run a multi-modal model for text-only inference:
114+
115+
```python
116+
from vllm import LLM
117+
118+
# Don't accept images. Just text.
119+
llm = LLM(model="google/gemma-3-27b-it",
120+
limit_mm_per_prompt={"image": 0})
121+
```
122+
123+
## Multi-modal processor arguments
124+
125+
For certain models, you can adjust the multi-modal processor arguments to
126+
reduce the size of the processed multi-modal inputs, which in turn saves memory.
127+
128+
Here are some examples:
129+
130+
```python
131+
from vllm import LLM
132+
133+
# Available for Qwen2-VL series models
134+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
135+
mm_processor_kwargs={
136+
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
137+
})
138+
139+
# Available for InternVL series models
140+
llm = LLM(model="OpenGVLab/InternVL2-2B",
141+
mm_processor_kwargs={
142+
"max_dynamic_patch": 4, # Default is 12
143+
})
144+
```
File renamed without changes.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Model Resolution
2+
3+
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
4+
and finding the corresponding implementation that is registered to vLLM.
5+
Nevertheless, our model resolution may fail for the following reasons:
6+
7+
- The `config.json` of the model repository lacks the `architectures` field.
8+
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
9+
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
10+
11+
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
12+
For example:
13+
14+
```python
15+
from vllm import LLM
16+
17+
model = LLM(
18+
model="cerebras/Cerebras-GPT-1.3B",
19+
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
20+
)
21+
```
22+
23+
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.

docs/performance/optimization.md renamed to docs/configuration/optimization.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
---
2-
title: Optimization and Tuning
3-
---
4-
[](){ #optimization-and-tuning }
1+
# Optimization and Tuning
52

63
This guide covers optimization strategies and performance tuning for vLLM V1.
74

File renamed without changes.

0 commit comments

Comments
 (0)