Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
e971780
[Bugfix][ROCm] Fix Unsupported attention metadata type for speculativ…
vllmellm Jan 6, 2026
7101e08
[Models]: Use `MMEncoderAttention` for MoonViT (#31738)
Isotr0py Jan 6, 2026
ee2e69d
[Bugfix][CI/Build] Fix failing pooling models test due to Triton kern…
Isotr0py Jan 6, 2026
97ca4c3
[Chore] Remove more V0 dead code from `sequence.py` (#31783)
DarkLight1337 Jan 6, 2026
799b572
[cpu][bench] Add CPU paged attention benchmarks (#31720)
fadara01 Jan 6, 2026
db31832
[Misc] Use `deprecated` for `seed_everything` (#31780)
DarkLight1337 Jan 6, 2026
43d384b
[CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. (#31797)
noooop Jan 6, 2026
6ebb66c
[Doc] Fix format of multimodal_inputs.md (#31800)
BlankRH Jan 6, 2026
14df02b
[Chore] Cleanup `mem_utils.py` (#31793)
DarkLight1337 Jan 6, 2026
e0327c9
[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_c…
LucasWilkinson Jan 6, 2026
bf0f3a4
[Bugfix] Fix torch.compile error for DP + MoE on CPU Backend (#31650)
kzwrime Jan 6, 2026
6444824
[Misc] Implement `TokenizerLike.convert_tokens_to_ids` (#31796)
DarkLight1337 Jan 6, 2026
2c1a4f2
[Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) (#31790)
Jzz1943 Jan 6, 2026
0202971
[Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false (#31…
chaunceyjiang Jan 6, 2026
96860af
[Model] rename use_pad_token to use_sep_token (#31784)
noooop Jan 6, 2026
cbd4690
[LoRA]Disable linear LoRA kernel PDL (#31777)
jeejeelee Jan 6, 2026
02809af
[Bugfix]: Fix cross attention backend selection for Turing GPU (#31806)
Isotr0py Jan 6, 2026
d3e477c
[MoE Refactor] Add Temporary Integration Tests - H100/B200 (#31759)
robertgshaw2-redhat Jan 6, 2026
af8fd73
[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (#31593)
robertgshaw2-redhat Jan 6, 2026
28c9477
[NemotronH] Use ReplicatedLinear for fc1_latent_proj (#31807)
roikoren755 Jan 6, 2026
2f4bdee
[Quantization][MoE] remove unused ep logic from moe marlin (#31571)
jinzhen-lin Jan 6, 2026
4c73be1
[Attention][2/n] Remove usage of deprecated `seq_lens_cpu` and `num_c…
LucasWilkinson Jan 6, 2026
22dffca
[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722)
vadiklyutiy Jan 6, 2026
6f5e653
[Log] add log about gpu worker init snapshot and requested memory (#2…
andyxning Jan 6, 2026
142c4d1
make 500: InternalServerError more informative (#20610)
guicho271828 Jan 6, 2026
4e67a8f
[Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking…
ReinforcedKnowledge Jan 6, 2026
f7008ce
[Perf] Async Scheduling + Speculative Decoding + Structured Outputs (…
benchislett Jan 6, 2026
c071636
[ROCm][CI] Fix tests/compile unit tests (#28895)
charlifu Jan 6, 2026
8becf14
[Quantization][Refactor] Move CPU GPTQ kernel into MP linear (#31801)
bigPYJ1151 Jan 6, 2026
ada6f91
Fix RecursionError in MediaWithBytes unpickling (#31191)
nrghosh Jan 6, 2026
dba9537
Report error log after vllm bench serve (#31808)
elvircrn Jan 6, 2026
d498997
[Spec Decode][UX] Add acceptance stats to `vllm bench serve` report (…
MatthewBonanni Jan 6, 2026
2a42ae7
[ROCm][CI] Fix ModernBERT token classification test numerical accurac…
AndreasKaratzas Jan 6, 2026
e5d427e
[ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal…
AndreasKaratzas Jan 6, 2026
309a8f6
[Bugfix] Handle mistral tokenizer in get_hf_processor (#31817)
DarkLight1337 Jan 6, 2026
9a1d20a
[CI] Add warmup run in test_fusion_attn (#31183)
angelayi Jan 7, 2026
364a8bc
[ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing `V…
AndreasKaratzas Jan 7, 2026
6f35154
[Frontend] Implement robust video frame recovery for corrupted videos…
vSeamar Jan 7, 2026
873480d
[Misc][BE] Type coverage for vllm/compilation [1/3] (#31554)
Lucaskabela Jan 7, 2026
5b833be
[1/2][lmcache connector] clean up lmcache multi-process adapter (#31…
ApostaC Jan 7, 2026
a051525
[Model] Enable LoRA support for PaliGemma (#31656)
A1c0r-Z Jan 7, 2026
1b8af95
[Doc] Update release docs (#31799)
DarkLight1337 Jan 7, 2026
f09c5fe
Change warning in get_current_vllm_config to report caller's line num…
tlrmchlsmth Jan 7, 2026
0a2c2dc
fixed mypy warnings for files vllm/v1/attention with TEMPORARY workar…
MrIceCreamMan Jan 7, 2026
aafd4d2
[Chore] Try remove `init_cached_hf_modules` (#31786)
DarkLight1337 Jan 7, 2026
6409004
[ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA bac…
vllmellm Jan 7, 2026
c7a79d4
[Attention][3/n] Remove usage of deprecated `seq_lens_cpu` and `num_c…
LucasWilkinson Jan 7, 2026
55caa60
refactor: find_loaded_library (#31866)
tom-zju Jan 7, 2026
efeaac9
[Bugfix] Fix race condition in async-scheduling for vlm model (#31841)
tianshu-Michael-yu Jan 7, 2026
4829148
[BugFix] LoRA: Support loading base_layer of experts (#31104)
HollowMan6 Jan 7, 2026
4614c5a
[Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper functio…
c0de128 Jan 7, 2026
0dd5dee
[Bugfix][Kernel] fix bias adding in triton kernel implemented fused m…
xuebwang-amd Jan 7, 2026
e759637
[Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808)
weiyu0824 Jan 7, 2026
59fe6f2
[XPU]fallback to TRITON_ATTN on xpu when use float32 dtype (#31762)
1643661061leo Jan 7, 2026
1f33e38
[Model] Cleanup: Remove redundant manual definition of `make_empty_in…
maang-h Jan 7, 2026
0790f07
[Misc] Improve error messages for unsupported types and parameters (#…
BlankRH Jan 7, 2026
d111bc5
[Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on (#31757)
andyl98 Jan 7, 2026
41cfa50
[ROCm][AITER] fix wrong argument passed to AITER `flash_attn_varlen_…
vllmellm Jan 7, 2026
9741387
[Refactor] GLM-ASR Modeling (#31779)
JaredforReal Jan 7, 2026
b665bbc
[Chore] Migrate V0 attention utils (#31891)
DarkLight1337 Jan 7, 2026
1ab055e
[OpenAI] Extend VLLMValidationError to additional validation paramete…
R3hankhan123 Jan 7, 2026
cc6dafa
[Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 L…
katec846 Jan 7, 2026
b7036c8
[Refactor] Clean up pooler modules (#31897)
DarkLight1337 Jan 7, 2026
1d9e9ae
[Bugfix]: prevent leaking tokens in crash log (#30751)
dr75 Jan 7, 2026
b89443b
[KVConnector]: Enable Cross-layers KV cache layout for MultiConnector…
kfirtoledo Jan 7, 2026
30399cc
UX: add vLLM env info in '/server_info' (#31899)
jeejeelee Jan 7, 2026
bf184a6
Enable quantized attention in NemotronH models (#31898)
roikoren755 Jan 7, 2026
05f47bd
[Doc] Fix: Correct vLLM announcing blog post link in docs (#31868)
Ayobami-00 Jan 7, 2026
f347ac6
[Perf] Fuse stride preparation for NVFP4 cutlass_moe (#31837)
mgoin Jan 7, 2026
c907d22
[refactor] refactor memory constants usage (#31865)
andyxning Jan 7, 2026
0ada960
[Kernel] Support bias type in grouped_topk kernel (#31781)
xyang16 Jan 7, 2026
6170d47
[EPLB] Optimize EPLB with numpy (#29499)
ilmarkov Jan 7, 2026
10ef65e
[BugFix] Fix bad words with speculative decoding (#31908)
njhill Jan 7, 2026
ffc0a27
Add back missing DeepEP LL params (#31911)
elvircrn Jan 7, 2026
5dcd7ef
[MoE Refactor][15/N] Apply Refactor to Fp8 (#31415)
robertgshaw2-redhat Jan 8, 2026
0d76674
[0/N][Attention] Fix miscellaneous pre-commit issues (#31924)
MatthewBonanni Jan 8, 2026
25eef3d
feat(moe): Add is_act_and_mul=False support for Triton MoE kernels (#…
rabi Jan 8, 2026
39d8200
fix(rocm): add early return in get_flash_attn_version for ROCm (#31286)
rabi Jan 8, 2026
8dd2419
[CI] Skip Qwen-VL in multimodal processing tests due to flaky externa…
AndreasKaratzas Jan 8, 2026
9f6dcb7
[MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692)
robertgshaw2-redhat Jan 8, 2026
a79079f
[BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 (#31915)
zou3519 Jan 8, 2026
c4041f3
[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router…
AndreasKaratzas Jan 8, 2026
087a138
[ROCm][CI] Fix attention backend test flakiness from uninitialized KV…
AndreasKaratzas Jan 8, 2026
cddbc2b
[ROCm][CI] Add rocm support for run-multi-node-test.sh (#31922)
charlifu Jan 8, 2026
f1b1bea
[CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 (#3…
rasmith Jan 8, 2026
6b2a672
[Doc] Add Claude code usage example (#31188)
mgoin Jan 8, 2026
5f2a473
[ROCm][CI] v1 cpu offloading attention backend fix (#31833)
AndreasKaratzas Jan 8, 2026
9572f74
[Model] Enable LoRA support for tower and connector in DotsOCR (#31825)
ShaanveerS Jan 8, 2026
2ab441b
[platform] add dp_metadata arg to set_additional_forward_context (#31…
Ronald1995 Jan 8, 2026
be6a81f
[chore] Update FA commit (#30460)
LucasWilkinson Jan 8, 2026
791b2fc
[grpc] Support gRPC server entrypoint (#30190)
CatherineSue Jan 8, 2026
287b37c
[BugFix] Fix spec decoding edge case bugs (#31944)
njhill Jan 8, 2026
d3235cb
[Fix] Enable mm_processor_cache with vision LoRA (#31927)
prashanth058 Jan 8, 2026
e5173d3
[Bugfix] Remove the num_hidden_layers override for glm4_moe (#31745)
andyl98 Jan 8, 2026
63baa28
[Model] Enable LoRA support for tower and connector in GLM4-V (#31652)
Zyyeric Jan 8, 2026
107cf8e
fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN (#31712)
rabi Jan 8, 2026
33156f5
[docker] A follow-up patch to fix #30913: `[docker] install cuda13 ve…
wangshangsam Jan 8, 2026
573a1d1
[ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch (#…
ZhiweiYan-96 Jan 8, 2026
eac3b96
[Models] Allow converting Qwen3-VL into Reranker model (#31890)
Isotr0py Jan 8, 2026
b634e61
Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Com…
Lumosis Jan 8, 2026
8cbdc7e
[CI/Build] Enable test_kv_cache_events_dp for AMD (#31834)
rjrock Jan 8, 2026
1f21429
fix(compile): apply partition wrapper when loading AOT cached functio…
devbyteai Jan 8, 2026
96fcd3c
[Misc] Support qwen3-next lora (#31719)
BJWang-ant Jan 8, 2026
04a4966
RayLLM Bugfix - Preserve obj store URL for multi engine_config creati…
omer-dayan Jan 8, 2026
d1b6fe0
[Chore] Further cleanup pooler (#31951)
DarkLight1337 Jan 8, 2026
5576227
[Model] Standardize common vision encoders (#31947)
DarkLight1337 Jan 8, 2026
2972a05
[MM Encoder]: Make MMEncoderAttention's `scale` takes effect properly…
Isotr0py Jan 8, 2026
18d4e48
[Voxtral] Fix speech transcription api (#31388)
patrickvonplaten Jan 8, 2026
59d260f
[Model] Add Grok-2 (#31847)
dangoldbj Jan 8, 2026
03fd76c
[Model] Add LFM2-VL model support (#31758)
tianshu-Michael-yu Jan 8, 2026
1123a87
[Model] Enable LoRA support for Pixtral (#31724)
A1c0r-Z Jan 8, 2026
7645bc5
[OpenAI] Fix tool_choice=required streaming when output has trailing …
maylikenoother Jan 8, 2026
72c068b
[CI] [Bugfix] Fix unbounded variable in `run-multi-node-test.sh` (#31…
tjtanaa Jan 8, 2026
1da3a54
[Docs]: update claude code url (#31971)
chaunceyjiang Jan 8, 2026
fe86be6
[Model] Support IQuestCoder model (#31575)
yxing-bj Jan 8, 2026
eaba8ec
[Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming…
chaunceyjiang Jan 8, 2026
b8112c1
[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 (#31960)
danisereb Jan 8, 2026
49568d5
[Doc] Improve MM models LoRA notes (#31979)
jeejeelee Jan 8, 2026
a3d909a
[Misc] Tidy up some spec decode logic in GPUModelRunner (#31591)
njhill Jan 8, 2026
a563866
Fix ijson build for Power. (#31702)
npanpaliya Jan 8, 2026
83e1c76
[CI][ROCm] Fix NIXL tests on ROCm (#31728)
NickLucche Jan 8, 2026
7508243
[Model Runner V2] Simplify BlockTables with UVA (#31965)
WoosukKwon Jan 8, 2026
87e07a6
Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE ke…
mgoin Jan 8, 2026
f16bfbe
[Documentation][torch.compile] Add documentation for torch.compile + …
Lucaskabela Jan 8, 2026
aa125ec
[Frontend] Improve error message (#31987)
DarkLight1337 Jan 8, 2026
e74698c
[Misc][Refactor] Add FusedMoERouter object (#30519)
bnellnm Jan 8, 2026
5d3b609
[Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support…
dsikka Jan 8, 2026
6cdf015
[Misc] Fix `Current vLLM config is not set.` warnings, assert to avoi…
LucasWilkinson Jan 8, 2026
d62cfe5
[MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change …
zyongye Jan 9, 2026
5825bbc
[Quantization] Deprecate Long Tail of Schemes (#31688)
robertgshaw2-redhat Jan 9, 2026
11cec29
[BugFix] Add spec-decode-incompatible request param validation (#31982)
njhill Jan 9, 2026
6ebe34d
[Feature] Add iteration level logging and enhance nvtx marker (#31193)
maxyanghu Jan 9, 2026
0fa8dd2
[Bugfix] Fix Typo from NVFP4 Refactor (#31977)
robertgshaw2-redhat Jan 9, 2026
a4ec0c5
[Frontend] Add MCP tool streaming support to Responses API (#31761)
daniel-salib Jan 9, 2026
6e41966
Resolve merge conflict
Ri0S Jan 9, 2026
a35d727
Merge remote-tracking branch 'origin/main' into bugfix/responses_stre…
Ri0S Jan 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
24 changes: 21 additions & 3 deletions .buildkite/scripts/run-multi-node-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@

set -euox pipefail

# To detect ROCm
# Check multiple indicators:
if [ -e /dev/kfd ] || \
[ -d /opt/rocm ] || \
command -v rocm-smi &> /dev/null || \
[ -n "${ROCM_HOME:-}" ]; then
IS_ROCM=1
else
IS_ROCM=0
fi

if [[ $# -lt 4 ]]; then
echo "Usage: .buildkite/scripts/run-multi-node-test.sh WORKING_DIR NUM_NODES NUM_GPUS DOCKER_IMAGE COMMAND1 COMMAND2 ... COMMANDN"
exit 1
Expand All @@ -26,21 +37,28 @@ for command in "${COMMANDS[@]}"; do
echo "$command"
done


start_network() {
docker network create --subnet=192.168.10.0/24 docker-net
}

start_nodes() {
for node in $(seq 0 $(($NUM_NODES-1))); do
GPU_DEVICES='"device='
if [ "$IS_ROCM" -eq 1 ]; then
GPU_DEVICES='--device /dev/kfd --device /dev/dri -e HIP_VISIBLE_DEVICES='
else
GPU_DEVICES='--gpus "device='
fi
for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
GPU_DEVICES+=$(($DEVICE_NUM))
if [ "$node_gpu" -lt $(($NUM_GPUS - 1)) ]; then
GPU_DEVICES+=','
fi
done
GPU_DEVICES+='"'
if [ "$IS_ROCM" -eq 0 ]; then
GPU_DEVICES+='"'
fi

# start the container in detached mode
# things to note:
Expand All @@ -49,7 +67,7 @@ start_nodes() {
# 3. map the huggingface cache directory to the container
# 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes:
# starting from 192.168.10.11)
docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \
docker run -d $GPU_DEVICES --shm-size=10.24gb -e HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \
--network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \
/bin/bash -c "tail -f /dev/null"
Expand Down
31 changes: 19 additions & 12 deletions .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,7 @@ steps:
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/test_vision_embeds.py
# Need tf32 to avoid conflicting precision issue with terratorch on ROCm.
# TODO: Remove after next torch update
- VLLM_FLOAT32_MATMUL_PRECISION="tf32" pytest -v -s entrypoints/openai/test_vision_embeds.py
- pytest -v -s entrypoints/openai/test_vision_embeds.py
- pytest -v -s entrypoints/test_chat_utils.py

- label: Entrypoints Integration Test (API Server 2)
Expand Down Expand Up @@ -519,8 +517,7 @@ steps:
- tests/samplers
- tests/conftest.py
commands:
- pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
- pytest -v -s -m 'not skip_v1' samplers

- label: LoRA Test %N # 20min each
timeout_in_minutes: 30
Expand Down Expand Up @@ -989,9 +986,7 @@ steps:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pip freeze | grep -E 'torch'
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing --ignore models/multimodal/pooling/test_prithvi_mae.py
# Need tf32 to avoid conflicting precision issue with terratorch on ROCm.
# TODO: Remove after next torch update
- VLLM_FLOAT32_MATMUL_PRECISION="tf32" pytest -v -s models/multimodal/pooling/test_prithvi_mae.py -m core_model
- pytest -v -s models/multimodal/pooling/test_prithvi_mae.py -m core_model
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work

- label: Multi-Modal Accuracy Eval (Small Models) # 5min
Expand Down Expand Up @@ -1356,9 +1351,7 @@ steps:
# end platform plugin tests
# begin io_processor plugins test, all the code in between uses the prithvi_io_processor plugin
- pip install -e ./plugins/prithvi_io_processor_plugin
# Need tf32 to avoid conflicting precision issue with terratorch on ROCm.
# TODO: Remove after next torch update
- VLLM_FLOAT32_MATMUL_PRECISION="tf32" pytest -v -s plugins_tests/test_io_processor_plugins.py
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
Expand Down Expand Up @@ -1455,7 +1448,21 @@ steps:
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
- VLLM_ATTENTION_BACKEND=ROCM_ATTN bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
- VLLM_ATTENTION_BACKEND=ROCM_ATTN bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 15
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
- VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

##### multi gpus test #####
##### A100 test #####
Expand Down
32 changes: 28 additions & 4 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1104,6 +1104,7 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
- tests/examples/offline_inference/data_parallel.py
- .buildkite/scripts/run-multi-node-test.sh
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
Expand Down Expand Up @@ -1266,8 +1267,8 @@ steps:
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt

- label: NixlConnector PD accuracy tests (Distributed) # 30min
timeout_in_minutes: 30
- label: NixlConnector PD accuracy tests (Distributed) # 40min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
Expand All @@ -1277,8 +1278,8 @@ steps:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

- label: DP EP NixlConnector PD accuracy tests (Distributed)
timeout_in_minutes: 30
- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
timeout_in_minutes: 15
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
Expand Down Expand Up @@ -1406,3 +1407,26 @@ steps:
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1

##### MoE Refactor (Temporary) Tests #####

- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
gpu: h100
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt

- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt

- label: MoE Refactor Integration Test (B200 DP - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
2 changes: 1 addition & 1 deletion .buildkite/test_areas/distributed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ steps:
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
- bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

- label: Pipeline + Context Parallelism (4 GPUs))
timeout_in_minutes: 60
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,8 @@ ep_kernels_workspace/

# Allow tracked library source folders under submodules (e.g., benchmarks/lib)
!vllm/benchmarks/lib/

# Generated gRPC protobuf files (compiled at build time from vllm_engine.proto)
vllm/grpc/vllm_engine_pb2.py
vllm/grpc/vllm_engine_pb2_grpc.py
vllm/grpc/vllm_engine_pb2.pyi
47 changes: 15 additions & 32 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,30 @@
# Releasing vLLM

vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via [PyPI](https://pypi.org/project/vllm). These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.

## Release Versioning
## Release Cadence and Versioning

vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released.
We aim to have a regular release every 2 weeks. Since v0.12.0, regular releases increment the minor version rather than patch version. The list of past releases can be found [here](https://vllm.ai/releases).

* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0.
* _minor_ major features
* _patch_ features and backwards-compatible bug fixes
* _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release
Our version numbers are expressed in the form `vX.Y.Z`, where `X` is the major version, `Y` is the minor version, and `Z` is the patch version. They are incremented according to the following rules:

## Release Cadence
* _Major_ releases are reserved for architectural milestones involving sweeping API changes, similar to PyTorch 2.0.
* _Minor_ releases correspond to regular releases, which include new features, bug fixes and other backwards-compatible changes.
* _Patch_ releases correspond to special releases for new models, as well as emergency patches for critical performance, functionality and security issues.

Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release.
Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional.
This versioning scheme is similar to [SemVer](https://semver.org/) for compatibility purposes, except that backwards compatibility is only guaranteed for a limited number of minor releases (see our [deprecation policy](https://docs.vllm.ai/en/latest/contributing/deprecation_policy) for details).

| Release Date | Patch release versions | Post Release versions |
| --- | --- | --- |
| Jan 2025 | 0.7.0 | --- |
| Feb 2025 | 0.7.1, 0.7.2, 0.7.3 | --- |
| Mar 2025 | 0.7.4, 0.7.5 | --- |
| Apr 2025 | 0.7.6, 0.7.7 | --- |
| May 2025 | 0.7.8, 0.7.9 | --- |
| Jun 2025 | 0.7.10, 0.7.11 | --- |
| Jul 2025 | 0.7.12, 0.7.13 | --- |
| Aug 2025 | 0.7.14, 0.7.15 | --- |
| Sep 2025 | 0.7.16, 0.7.17 | --- |
| Oct 2025 | 0.7.18, 0.7.19 | --- |
| Nov 2025 | 0.7.20, 0.7.21 | --- |
| Dec 2025 | 0.7.22, 0.7.23 | --- |

## Release branch
## Release Branch

Each release is built from a dedicated release branch.

* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live.
* For post releases, previously cut release branch is reused
* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release.
* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets.
* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch.
* For _major_ and _minor_ releases, the release branch cut is performed 1-2 days before release is live.
* For _patch_ releases, previously cut release branch is reused.
* Release builds are triggered via push to RC tag like `vX.Y.Z-rc1`. This enables us to build and test multiple RCs for each release.
* Final tag: `vX.Y.Z` does not trigger the build but used for Release notes and assets.
* After branch cut is created, we monitor the main branch for any reverts and apply these reverts to a release branch.

## Release Cherry-Pick Criteria
### Cherry-Pick Criteria

After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.

Expand Down
4 changes: 3 additions & 1 deletion benchmarks/cutlass_benchmarks/sparse_benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,9 @@ def bench(
return bench_int8(dtype, m, k, n, label, sub_label)
if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label)
raise ValueError("unsupported type")
raise ValueError(
f"Unsupported dtype {dtype}: should be one of torch.int8, torch.float8_e4m3fn."
)


# runner
Expand Down
5 changes: 2 additions & 3 deletions benchmarks/kernels/benchmark_activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@

import vllm.model_executor.layers.activation # noqa F401
from vllm.model_executor.custom_op import CustomOp
from vllm.platforms import current_platform
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed

batch_size_range = [1, 16, 128]
seq_len_range = [1, 16, 64, 1024, 4096]
Expand All @@ -30,7 +29,7 @@ def benchmark_activation(
device = "cuda"
num_tokens = batch_size * seq_len
dim = intermediate_size
current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)

if func_name == "gelu_and_mul":
Expand Down
Loading