Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
07ca70a
[Core][Easy] Use envs.__getattr__ for all Unify to environment variab…
Jialin Oct 15, 2025
9354660
[Bugfix]fix Qwen3 xml tool parser (#26345)
Zhikaiiii Oct 15, 2025
bfad142
[BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not Non…
xuechendi Oct 15, 2025
e66d787
Disable FlashInfer sampler by default (#26859)
mgoin Oct 15, 2025
96b9aa5
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): name c…
morrison-turnansky Oct 15, 2025
a2986b3
[Bugfix] Fixes prefix-repetition benchmark script (#26828)
kouroshHakha Oct 15, 2025
85a65e7
[Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (#2…
taohui Oct 15, 2025
c43ca82
[Docs] Move build.inc into arm.inc (#26862)
windsonsea Oct 15, 2025
e471d7c
[CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (…
izhuhaoran Oct 15, 2025
a27b288
[Feature] default --extra-body param to disable thinking in vllm benc…
lengrongfu Oct 15, 2025
7cfa420
[BugFix] Patch inductor partitioning logic (#26735)
angelayi Oct 15, 2025
8c851f6
[Bugfix] Fix qwen3-omni audio truncation issue (#26815)
Isotr0py Oct 15, 2025
f0862ea
[Graph Partition] pass tests for decorator (#26831)
BoyuanFeng Oct 15, 2025
8865da1
[Bugfix][Multi Modal] Fix incorrect Molmo token processing (#26873)
sangho-vision Oct 15, 2025
302ef40
[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…
MengqingCao Oct 15, 2025
b8a4572
[Misc] Use helper function to generate dummy messages in OpenAI MM te…
DarkLight1337 Oct 15, 2025
efdef57
[bugfix] Lazy import cv2 (#26869)
angelayi Oct 15, 2025
f5ed68e
[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (#26456)
zyongye Oct 15, 2025
f3c378f
[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (#…
zhewenl Oct 15, 2025
71557a5
[CI] Fix mypy for `vllm/executor` (#26845)
yewentao256 Oct 15, 2025
6256697
[Doc] ruff format remaining Python examples (#26795)
DarkLight1337 Oct 15, 2025
650b51f
[doc] add Context Parallel Deployment doc (#26877)
youkaichao Oct 15, 2025
5210dc3
[Misc] Update TritonLanguagePlaceholder to have attributes that are u…
madongfly Oct 15, 2025
5c3bae1
[Fix] Remove divisibility requirement between num_kv_heads and tp_siz…
ant-yy Oct 15, 2025
7f83b4e
[Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (#26842)
Jialin Oct 15, 2025
db1764e
[Platform] allow platform to init dp group (#22243)
wangxiyuan Oct 15, 2025
d4d1a60
[Lora]Load tuned multi-lora kernel configs from json files (#26319)
li2haipeng Oct 15, 2025
f54f851
[Model][2/N] Improve all pooling task | Support multi-vector retrieva…
noooop Oct 15, 2025
4cf0141
added parser for moe detection with test
morrison-turnansky Oct 14, 2025
c264394
Set up -O infrastrucutre
adabeyta Oct 14, 2025
4c8f770
name change and removed editing backedn in _apply_optimization_level …
morrison-turnansky Oct 14, 2025
3b1c862
updated defaults for each pass config
morrison-turnansky Oct 15, 2025
49e52fd
set cuda graph mode defaults
morrison-turnansky Oct 15, 2025
2ee0cb8
added skelaton for non model specifc settings, and test to veriy that…
morrison-turnansky Oct 15, 2025
f33d830
made is_model_moe inaccessible from user
morrison-turnansky Oct 15, 2025
3b6d03f
added parsing function to determine if model is quantized
morrison-turnansky Oct 15, 2025
23985af
added model specific optimizations
morrison-turnansky Oct 15, 2025
0719123
updated default config design
morrison-turnansky Oct 15, 2025
74d5a6a
added vllm config default test
morrison-turnansky Oct 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -b 32 -l 100 -t 8
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
value: 0.90
limit: 100
num_fewshot: 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -b 32 -l 250 -t 8 -f 5
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend: "vllm-vlm"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.80
limit: 250 # will run on 250 * 14 subjects = 3500 samples
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
# For vllm script, with -t option (tensor parallel size)
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -l 1319 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
Expand Down
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m Qwen/Qwen2.5-VL-7B-Instruct -l 2500 -t 1

model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
value: 0.855
limit: 2500
num_fewshot: 0
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large-h100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-mm-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9

usage() {
echo``
echo "Runs lm eval harness on ChartQA using multimodal vllm."
echo "This pathway is intended to be used to create baselines for "
echo "our correctness tests in vllm's CI."
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:l:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm-vlm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE" \
--tasks chartqa \
--batch_size auto \
--apply_chat_template \
--limit $LIMIT
Empty file modified .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
100644 → 100755
Empty file.
50 changes: 50 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on MMLUPRO for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
echo "Runs lm eval harness on MMLU Pro using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
--tasks mmlu_pro --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size auto
12 changes: 9 additions & 3 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,27 @@
def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get("trust_remote_code", False)
max_model_len = eval_config.get("max_model_len", 4096)
batch_size = eval_config.get("batch_size", "auto")
backend = eval_config.get("backend", "vllm")
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len}"
f"max_model_len={max_model_len},"
)
results = lm_eval.simple_evaluate(
model="vllm",
model=backend,
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto",
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
# text models. however, this is regressing measured strict-match for
# existing text models in CI, so only apply it for mm.
apply_chat_template=backend == "vllm-vlm",
batch_size=batch_size,
)
return results

Expand Down
10 changes: 10 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -734,6 +734,16 @@ steps:
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work

- label: Multi-Modal Accuracy Eval (Small Models) # 50min
timeout_in_minutes: 70
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- vllm/multimodal/
- vllm/inputs/
- vllm/v1/core/
commands:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1

- label: Multi-Modal Models Test (Extended) 1
mirror_hardwares: [amdexperimental]
optional: true
Expand Down
4 changes: 2 additions & 2 deletions cmake/external_projects/qutlass.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ else()
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
)
FetchContent_Populate(qutlass)
set(qutlass_SOURCE_DIR "${qutlass_SOURCE_DIR}")
endif()

FetchContent_Populate(qutlass)

if(NOT qutlass_SOURCE_DIR)
message(FATAL_ERROR "[QUTLASS] source directory could not be resolved.")
endif()
Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,12 @@ You can adjust `compilation_config` to achieve a better balance between inferenc

```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel
from vllm.config import CompilationConfig, CompilationMode

llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
mode=CompilationMode.VLLM_COMPILE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
Expand Down
4 changes: 2 additions & 2 deletions docs/design/cuda_graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ class AttentionCGSupport(enum.Enum):
"""NO CUDA Graphs support"""
```

Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that case, we seek the minimum capability of all backends to determine the final capability of the model, and we might resolve the incompatible CUDA Graphs mode by downgrading the mode to the best fit one. For example, downgrading `FULL` mode to `FULL_AND_PIECEWISE` mode if the minimum capability is `UNIFORM_BATCH`, or `PIECEWISE` mode if the minimum capability is `NEVER` for -O3 compilation level. For the complete fallback policy, please see the code of [initialize_cudagraph_capture][vllm.v1.worker.gpu_model_runner.GPUModelRunner.initialize_cudagraph_capture].
Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that case, we seek the minimum capability of all backends to determine the final capability of the model, and we might resolve the incompatible CUDA Graphs mode by downgrading the mode to the best fit one. For example, downgrading `FULL` mode to `FULL_AND_PIECEWISE` mode if the minimum capability is `UNIFORM_BATCH`, or `PIECEWISE` mode if the minimum capability is `NEVER` for -O3 compilation mode. For the complete fallback policy, please see the code of [initialize_cudagraph_capture][vllm.v1.worker.gpu_model_runner.GPUModelRunner.initialize_cudagraph_capture].

The following table lists backends that support full CUDA Graphs at the time of writing.

Expand Down Expand Up @@ -202,7 +202,7 @@ os.environ.setdefault("VLLM_LOGGING_LEVEL", "DEBUG")
import vllm
from vllm.config import CUDAGraphMode

compilation_config = {"level": 3, "cudagraph_mode": "FULL_AND_PIECEWISE"}
compilation_config = {"mode": 3, "cudagraph_mode": "FULL_AND_PIECEWISE"}
model = vllm.LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
dtype="auto",
Expand Down
10 changes: 6 additions & 4 deletions docs/features/quantization/auto_awq.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
quant_path = "mistral-instruct-v0.2-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
model_path,
low_cpu_mem_usage=True,
use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Expand Down
4 changes: 2 additions & 2 deletions docs/features/quantization/bitblas.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitblas"
quantization="bitblas",
)
```

Expand All @@ -53,6 +53,6 @@ llm = LLM(
dtype=torch.float16,
trust_remote_code=True,
quantization="bitblas",
max_model_len=1024
max_model_len=1024,
)
```
4 changes: 2 additions & 2 deletions docs/features/quantization/bnb.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True
trust_remote_code=True,
)
```

Expand All @@ -43,7 +43,7 @@ llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitsandbytes"
quantization="bitsandbytes",
)
```

Expand Down
9 changes: 7 additions & 2 deletions docs/features/quantization/fp8.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,9 @@ from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
```
Expand All @@ -63,7 +65,10 @@ Since simple RTN does not require data for weight quantization and the activatio

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"],
)

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
Expand Down
12 changes: 7 additions & 5 deletions docs/features/quantization/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@ You can also use the GGUF model directly through the LLM entrypoint:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
"content": "You are a helpful assistant",
},
{
"role": "user",
"content": "Hello"
"content": "Hello",
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
"content": "Hello! How can I assist you today?",
},
{
"role": "user",
Expand All @@ -67,8 +67,10 @@ You can also use the GGUF model directly through the LLM entrypoint:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
llm = LLM(
model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)
Expand Down
2 changes: 1 addition & 1 deletion docs/features/quantization/gptqmodel.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
split="train",
).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)
Expand Down
6 changes: 4 additions & 2 deletions docs/features/quantization/int4.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@ from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
```
Expand Down Expand Up @@ -166,7 +168,7 @@ The following is an example of an expanded quantization recipe you can tune to y
},
ignore=["lm_head"],
update_size=NUM_CALIBRATION_SAMPLES,
dampening_frac=0.01
dampening_frac=0.01,
)
```

Expand Down
Loading