[do not merge] pr test for nm changes into 2.20 #107

ckhordiasma · 2025-04-17T14:44:00Z

added version str
[Kernel] Vectorized SiluAndMul (Merge pull request #181 from heyselbi/sync-v0.10.0-rhoai #182)
Fp8 Kernels c2x (Fix vLLM version #183)
[Kernel] Cutlass c3x Performance (Add required label to satify konflux conforma checks #186)
added version
add missing lm-eval file
Sync push Tekton files from rhoai-2.18 to rhoai-2.19 with updated versioning
Rename vllm-rocm-v2-18-push.yaml to vllm-rocm-v2-19-push.yaml
Rename vllm-cuda-v2-18-push.yaml to vllm-cuda-v2-19-push.yaml
Update vllm-cuda-v2-19-push.yaml
Update vllm-cuda-v2-19-push.yaml
Update vllm-rocm-v2-19-push.yaml
Update pre-commit's isort version to remove warnings (Update pre-commit's isort version to remove warnings vllm-project/vllm#13614)
[V1][Minor] Print KV cache size in token counts ([V1][Minor] Print KV cache size in token counts vllm-project/vllm#13596)
fix neuron performance issue (fix neuron performance issue vllm-project/vllm#13589)
[Frontend] Add backend-specific options for guided decoding ([Frontend] Add backend-specific options for guided decoding vllm-project/vllm#13505)
[Bugfix] Fix max_num_batched_tokens for MLA ([Bugfix] Fix max_num_batched_tokens for MLA vllm-project/vllm#13620)
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth ([Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth vllm-project/vllm#13245)
Add llmaz as another integration (Add llmaz as another integration vllm-project/vllm#13643)
[Misc] Adding script to setup ray for multi-node vllm deployments ([Misc] Adding script to setup ray for multi-node vllm deployments vllm-project/vllm#12913)
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ([NVIDIA] Fix an issue to use current stream for the nvfp4 quant vllm-project/vllm#13632)
Use pre-commit to update requirements-test.txt (Use pre-commit to update requirements-test.txt vllm-project/vllm#13617)
[Bugfix] Add mm_processor_kwargs to chat-related protocols ([Bugfix] Add mm_processor_kwargs to chat-related protocols vllm-project/vllm#13644)
[V1][Sampler] Avoid an operation during temperature application ([V1][Sampler] Avoid an operation during temperature application vllm-project/vllm#13587)
Missing comment explaining VDR variable in GGUF kernels (Missing comment explaining VDR variable in GGUF kernels vllm-project/vllm#13290)
[FEATURE] Enables /score endpoint for embedding models ([FEATURE] Enables /score endpoint for embedding models vllm-project/vllm#12846)
[ci] Fix metrics test model path ([ci] Fix metrics test model path vllm-project/vllm#13635)
[Kernel]Add streamK for block-quantized CUTLASS kernels ([Kernel]Add streamK for block-quantized CUTLASS kernels vllm-project/vllm#12978)
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation ([Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation vllm-project/vllm#13586)
fix typo of grafana dashboard, with correct datasource (fix typo of grafana dashboard, with correct datasource vllm-project/vllm#13668)
[Attention] MLA with chunked prefill ([Attention] MLA with chunked prefill vllm-project/vllm#12639)
[Misc] Fix yapf linting tools etc not running on pre-commit ([Misc] Fix yapf linting tools etc not running on pre-commit vllm-project/vllm#13695)
docs: Add a note on full CI run in contributing guide (docs: Add a note on full CI run in contributing guide vllm-project/vllm#13646)
[HTTP Server] Make model param optional in request ([HTTP Server] Make model param optional in request vllm-project/vllm#13568)
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… ([Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… vllm-project/vllm#13672)
[Misc] Capture and log the time of loading weights ([Misc] Capture and log the time of loading weights vllm-project/vllm#13666)
[ROCM] fix native attention function call ([ROCM] fix native attention function call vllm-project/vllm#13650)
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA ([Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA vllm-project/vllm#13687)
[Misc] Bump compressed-tensors ([Misc] Bump compressed-tensors vllm-project/vllm#13619)
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len ([Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len vllm-project/vllm#13691)
[v1] Support allowed_token_ids in v1 Sampler ([v1] Support allowed_token_ids in v1 Sampler vllm-project/vllm#13210)
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler ([Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler vllm-project/vllm#13594)
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size (Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size vllm-project/vllm#13660)
[V1][Metrics] Support vllm:cache_config_info ([V1][Metrics] Support vllm:cache_config_info vllm-project/vllm#13299)
[Metrics] Add --show-hidden-metrics-for-version CLI arg ([Metrics] Add --show-hidden-metrics-for-version CLI arg vllm-project/vllm#13295)
[Misc] Reduce LoRA-related static variable ([Misc] Reduce LoRA-related static variable vllm-project/vllm#13166)
[CI/Build] Fix pre-commit errors ([CI/Build] Fix pre-commit errors vllm-project/vllm#13696)
[core] set up data parallel communication ([core] set up data parallel communication vllm-project/vllm#13591)
[ci] fix linter ([ci] fix linter vllm-project/vllm#13701)
Support SSL Key Rotation in HTTP Server (Support SSL Key Rotation in HTTP Server vllm-project/vllm#13495)
[NVIDIA] Support nvfp4 cutlass gemm ([NVIDIA] Support nvfp4 cutlass gemm vllm-project/vllm#13571)
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths ([V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths vllm-project/vllm#13095)
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm ([ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm vllm-project/vllm#13231)
[Doc] Dockerfile instructions for optional dependencies and dev transformers ([Doc] Dockerfile instructions for optional dependencies and dev transformers vllm-project/vllm#13699)
[Bugfix] Fix boolean conversion for OpenVINO env variable ([Bugfix] Fix boolean conversion for OpenVINO env variable vllm-project/vllm#13615)
[XPU]fix setuptools version for xpu ([XPU]fix setuptools version for xpu vllm-project/vllm#13548)
[CI/Build] fix uv caching in Dockerfile ([CI/Build] fix uv caching in Dockerfile vllm-project/vllm#13611)
[CI/Build] Fix pre-commit errors from [NVIDIA] Support nvfp4 cutlass gemm vllm-project/vllm#13571 ([CI/Build] Fix pre-commit errors from #13571 vllm-project/vllm#13709)
[BugFix] Minor: logger import in attention backend ([BugFix] Minor: logger import in attention backend vllm-project/vllm#13706)
[ci] Use env var to control whether to use S3 bucket in CI ([ci] Use env var to control whether to use S3 bucket in CI vllm-project/vllm#13634)
[Quant] BaiChuan SupportsQuant ([Quant] BaiChuan SupportsQuant vllm-project/vllm#13710)
[LMM] Implement merged multimodal processor for whisper ([LMM] Implement merged multimodal processor for whisper vllm-project/vllm#13278)
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms ([Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms vllm-project/vllm#13688)
[Misc] Deprecate --dataset from benchmark_serving.py ([Misc] Deprecate --dataset from benchmark_serving.py vllm-project/vllm#13708)
[v1] torchrun compatibility ([v1] torchrun compatibility vllm-project/vllm#13642)
[V1][BugFix] Fix engine core client shutdown hangs ([V1][BugFix] Fix engine core client shutdown hangs vllm-project/vllm#13298)
Fix some issues with benchmark data output (Fix some issues with benchmark data output vllm-project/vllm#13641)
[ci] Add logic to change model to S3 path only when S3 CI env var is on ([ci] Add logic to change model to S3 path only when S3 CI env var is on vllm-project/vllm#13727)
[V1][Core] Fix memory issue with logits & sampling ([V1][Core] Fix memory issue with logits & sampling vllm-project/vllm#13721)
[model][refactor] remove cuda hard code in models and layers ([model][refactor] remove cuda hard code in models and layers vllm-project/vllm#13658)
[Bugfix] fix(logging): add missing opening square bracket ([Bugfix] fix(logging): add missing opening square bracket vllm-project/vllm#13011)
[CI/Build] add python-json-logger to requirements-common ([CI/Build] add python-json-logger to requirements-common vllm-project/vllm#12842)
Expert Parallelism (EP) Support for DeepSeek V2 (Expert Parallelism (EP) Support for DeepSeek Models vllm-project/vllm#12583)
[BugFix] Illegal memory access for MoE On H20 ([BugFix] Illegal memory access for MoE On H20 vllm-project/vllm#13693)
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set ([Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set vllm-project/vllm#12513)
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) ([V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) vllm-project/vllm#10980)
Revert "[V1][Core] Fix memory issue with logits & sampling" (Revert "[V1][Core] Fix memory issue with logits & sampling" vllm-project/vllm#13775)
Fix precommit fail in fused_moe intermediate_cache2 chunking (Fix precommit fail in fused_moe intermediate_cache2 chunking vllm-project/vllm#13772)
[Misc] Clean Up EngineArgs.create_engine_config ([Misc] Clean Up EngineArgs.create_engine_config vllm-project/vllm#13734)
[Misc][Chore] Clean Up AsyncOutputProcessing Logs ([Misc][Chore] Clean Up AsyncOutputProcessing Logs vllm-project/vllm#13780)
Remove unused kwargs from model definitions (Remove unused kwargs from model definitions vllm-project/vllm#13555)
[Doc] arg_utils.py: fixed a typo ([Doc] arg_utils.py: fixed a typo vllm-project/vllm#13785)
[Misc] set single whitespace between log sentences ([Misc] set single whitespace between log sentences vllm-project/vllm#13771)
[Bugfix][Quantization] Fix FP8 + EP ([Bugfix][Quantization] Fix FP8 + EP vllm-project/vllm#13784)
[Misc][Attention][Quantization] init property earlier ([Misc][Attention][Quantization] init property earlier vllm-project/vllm#13733)
[V1][Metrics] Implement vllm:lora_requests_info metric ([V1][Metrics] Implement vllm:lora_requests_info metric vllm-project/vllm#13504)
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" ([Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" vllm-project/vllm#13802)
[Bugfix] Support MLA for CompressedTensorsWNA16 ([Bugfix] Support MLA for CompressedTensorsWNA16 vllm-project/vllm#13725)
Fix CompressedTensorsWNA16MoE with grouped scales (Fix CompressedTensorsWNA16MoE with grouped scales vllm-project/vllm#13769)
[Core] LoRA V1 - Add add/pin/list/remove_lora functions ([Core] LoRA V1 - Add add/pin/list/remove_lora functions vllm-project/vllm#13705)
[Misc] Check that the model can be inspected upon registration ([Misc] Check that the model can be inspected upon registration vllm-project/vllm#13743)
[Core] xgrammar: Expand list of unsupported jsonschema keywords ([Core] xgrammar: Expand list of unsupported jsonschema keywords vllm-project/vllm#13783)
[Bugfix] Modify modelscope api usage in transformer_utils ([Bugfix] Modify modelscope api usage in transformer_utils vllm-project/vllm#13807)
[misc] Clean up ray compiled graph type hints ([misc] Clean up ray compiled graph type hints vllm-project/vllm#13731)
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. ([Feature] Support KV cache offloading and disagg prefill with LMCache connector. vllm-project/vllm#12953)
[ROCm][Quantization][Kernel] Using HIP FP8 header ([ROCm][Quantization][Kernel] Using HIP FP8 header vllm-project/vllm#12593)
[CI/Build] Fix V1 LoRA failure ([CI/Build] Fix V1 LoRA failure vllm-project/vllm#13767)
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs ([Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs vllm-project/vllm#13724)
[Bugfix] Initialize attention bias on the same device as Query/Key/Value ([Bugfix] Initialize attention bias on the same device as Query/Key/Value vllm-project/vllm#13468)
[Bugfix] Flush TunableOp results before worker processes are destroyed. ([Bugfix] Flush TunableOp results before worker processes are destroyed. vllm-project/vllm#13623)
[Bugfix] Fix deepseek-vl2 inference with more than 2 images ([Bugfix] Fix deepseek-vl2 inference with more than 2 images vllm-project/vllm#13818)
Fix /v1/audio/transcriptions Bad Request Error (Fix /v1/audio/transcriptions Bad Request Error vllm-project/vllm#13811)
[Bugfix] Revert inspection code in [Misc] Check that the model can be inspected upon registration vllm-project/vllm#13743 ([Bugfix] Revert inspection code in #13743 vllm-project/vllm#13832)
Fix string parsing error (Fix string parsing error vllm-project/vllm#13825)
[Neuron] Add custom_ops for neuron backend ([Neuron] Add custom_ops for neuron backend vllm-project/vllm#13246)
Fix failing MyGemma2Embedding test (Fix failing MyGemma2Embedding test vllm-project/vllm#13820)
[Model] Support Grok1 ([Model] Support Grok1 vllm-project/vllm#13795)
DeepSeek V2/V3/R1 only place lm_head on last pp rank (DeepSeek V2/V3/R1 only place lm_head on last pp rank vllm-project/vllm#13833)
[misc] Show driver IP info when Ray fails to allocate driver worker ([misc] Show driver IP info when Ray fails to allocate driver worker vllm-project/vllm#13858)
[V1][Spec Decode] Change Spec Decode Rejection Sampling API ([V1][Spec Decode] Change Spec Decode Rejection Sampling API vllm-project/vllm#13729)
[Misc]Code Cleanup ([Misc]Code Cleanup vllm-project/vllm#13859)
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues ([Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues vllm-project/vllm#13797)
Improve pipeline partitioning (Improve pipeline partitioning vllm-project/vllm#13839)
[Doc] fix the incorrect module path of tensorize_vllm_model ([Doc] fix the incorrect module path of tensorize_vllm_model vllm-project/vllm#13863)
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms ([ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms vllm-project/vllm#13844)
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine ([v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine vllm-project/vllm#13837)
[Misc] Improve LoRA spelling ([Misc] Improve LoRA spelling vllm-project/vllm#13831)
[Misc] Fix input processing for Ultravox ([Misc] Fix input processing for Ultravox vllm-project/vllm#13871)
[Bugfix] Add test example for Ultravox v0.5 ([Bugfix] Add test example for Ultravox v0.5 vllm-project/vllm#13890)
Add comments on accessing kv_cache and attn_metadata (Add comments on accessing kv_cache and attn_metadata vllm-project/vllm#13887)
[Bugfix] Handle None parameters in Mistral function calls. ([Bugfix] Handle None parameters in Mistral function calls. vllm-project/vllm#13786)
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor ([Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor vllm-project/vllm#13736)
[Bugfix] Do not crash V0 engine on input errors ([Bugfix] Do not crash V0 engine on input errors vllm-project/vllm#13101)
[Bugfix] Update expected token counts for Ultravox tests ([Bugfix] Update expected token counts for Ultravox tests vllm-project/vllm#13895)
[TPU] use torch2.6 with whl package ([TPU] use torch2.6 with whl package vllm-project/vllm#13860)
[Misc] fixed qwen_vl_utils parameter error ([Misc] fixed qwen_vl_utils parameter error vllm-project/vllm#13906)
[Bugfix] Backend option to disable xgrammar any_whitespace ([Bugfix] Backend option to disable xgrammar any_whitespace vllm-project/vllm#12744)
[BugFix] Make FP8 Linear compatible with torch.compile ([BugFix] Make FP8 Linear compatible with torch.compile vllm-project/vllm#13918)
[Kernel] FlashMLA integration ([Kernel] FlashMLA integration vllm-project/vllm#13747)
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined ([ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined vllm-project/vllm#13851)
Use CUDA 12.4 as default for release and nightly wheels (Use CUDA 12.4 as default for release and nightly wheels vllm-project/vllm#12098)
[misc] Rename Ray ADAG to Compiled Graph ([misc] Rename Ray ADAG to Compiled Graph vllm-project/vllm#13928)
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding ([ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding vllm-project/vllm#13922)
[V1][Metrics] Handle preemptions ([V1][Metrics] Handle preemptions vllm-project/vllm#13169)
[CI/Build] Add examples/ directory to be labelled by mergify ([CI/Build] Add examples/ directory to be labelled by mergify vllm-project/vllm#13944)
[Misc] fixed 'required' is an invalid argument for positionals ([Misc] fixed 'required' is an invalid argument for positionals vllm-project/vllm#13948)
[PP] Correct cache size check ([PP] Correct cache size check vllm-project/vllm#13873)
Fix test_block_fp8.py test for MoE (Fix test_block_fp8.py test for MoE vllm-project/vllm#13915)
[VLM] Support multimodal inputs for Florence-2 models ([VLM] Support multimodal inputs for Florence-2 models vllm-project/vllm#13320)
[Model] Deepseek GGUF support ([Model] Deepseek GGUF support vllm-project/vllm#13167)
Update quickstart.md (Update quickstart.md vllm-project/vllm#13958)
Deduplicate .pre-commit-config.yaml's exclude (Deduplicate .pre-commit-config.yaml's exclude vllm-project/vllm#13967)
[bugfix] Fix profiling for RayDistributedExecutor ([bugfix] Fix profiling for RayDistributedExecutor vllm-project/vllm#13945)
Update LMFE version to v0.10.11 to support new versions of transforme… (Update LMFE version to v0.10.11 to support new versions of transforme… vllm-project/vllm#13930)
[Bugfix] Fix qwen2.5-vl overflow issue ([Bugfix] Fix qwen2.5-vl overflow issue vllm-project/vllm#13968)
[VLM] Generalized prompt updates for multi-modal processor ([VLM] Generalized prompt updates for multi-modal processor vllm-project/vllm#13964)
[Attention] MLA support for V1 ([Attention] MLA support for V1 vllm-project/vllm#13789)
Bump azure/setup-helm from 4.2.0 to 4.3.0 (Bump azure/setup-helm from 4.2.0 to 4.3.0 vllm-project/vllm#13742)
[VLM] Deprecate legacy input mapper for OOT multimodal models ([VLM] Deprecate legacy input mapper for OOT multimodal models vllm-project/vllm#13979)
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups ([ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups vllm-project/vllm#13970)
[V1][Minor] Minor cleanup for GPU Model Runner ([V1][Minor] Minor cleanup for GPU Model Runner vllm-project/vllm#13983)
[core] Perf improvement for DSv3 on AMD GPUs ([core] Perf improvement for DSv3 on AMD GPUs vllm-project/vllm#13718)
[Attention] Flash MLA for V1 ([Attention] Flash MLA for V1 vllm-project/vllm#13867)
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict ([Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict vllm-project/vllm#13626)
[Misc] Print FusedMoE detail info ([Misc] Print FusedMoE detail info vllm-project/vllm#13974)
[V1]SupportsV0Only protocol for model definitions ([V1]SupportsV0Only protocol for model definitions vllm-project/vllm#13959)
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ([Bugfix] Check that number of images matches number of <|image|> tokens with mllama vllm-project/vllm#13911)
[Doc] Move multimodal Embedding API example to Online Serving page ([Doc] Move multimodal Embedding API example to Online Serving page vllm-project/vllm#14017)
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) ([Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) vllm-project/vllm#13987)
Use smaller embedding model when not testing model specifically (Use smaller embedding model when not testing model specifically vllm-project/vllm#13891)
[Hardware][Intel-Gaudi] Regional compilation support ([Hardware][Intel-Gaudi] Regional compilation support vllm-project/vllm#13213)
[V1][Minor] Restore V1 compatibility with LLMEngine class ([V1][Minor] Restore V1 compatibility with LLMEngine class vllm-project/vllm#13090)
Update AutoAWQ docs (Update AutoAWQ docs vllm-project/vllm#14042)
[Bugfix] Fix MoeWNA16Method activation ([Bugfix] Fix MoeWNA16Method activation vllm-project/vllm#14024)
[VLM][Bugfix] Enable specifying prompt target via index ([VLM][Bugfix] Enable specifying prompt target via index vllm-project/vllm#14038)
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series ([Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series vllm-project/vllm#14031)
[Doc] Fix ROCm documentation ([Doc] Fix ROCm documentation vllm-project/vllm#14041)
Fix entrypoint tests for embedding models (Fix entrypoint tests for embedding models vllm-project/vllm#14052)
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU ([V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU vllm-project/vllm#13379)
[v1] Cleanup the BlockTable in InputBatch ([v1] Cleanup the BlockTable in InputBatch vllm-project/vllm#13977)
Add RELEASE.md (Add RELEASE.md vllm-project/vllm#13926)
[v1] Move block pool operations to a separate class ([v1] Move block pool operations to a separate class vllm-project/vllm#13973)
[core] Bump ray to 2.43 ([core] Bump ray to 2.43 vllm-project/vllm#13994)
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass ([torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass vllm-project/vllm#10902)
[Docs] Add pipeline_parallel_size to optimization docs ([Docs] Add pipeline_parallel_size to optimization docs vllm-project/vllm#14059)
[Bugfix] Add file lock for ModelScope download ([Bugfix] Add file lock for ModelScope download vllm-project/vllm#14060)
[Misc][Kernel]: Add GPTQAllSpark Quantization ([Misc][Kernel]: Add GPTQAllSpark Quantization vllm-project/vllm#12931)
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor ([Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor vllm-project/vllm#14053)
[Documentation] Add more deployment guide for Kubernetes deployment ([Documentation] Add more deployment guide for Kubernetes deployment vllm-project/vllm#13841)
[Doc] Consolidate whisper and florence2 examples ([Doc] Consolidate whisper and florence2 examples vllm-project/vllm#14050)
[V1][Minor] Do not print attn backend twice ([V1][Minor] Do not print attn backend twice vllm-project/vllm#13985)
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class ([ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class vllm-project/vllm#14065)
[v1][Bugfix] Only cache blocks that are not in the prefix cache ([v1][Bugfix] Only cache blocks that are not in the prefix cache vllm-project/vllm#14073)
[v1] Add __repr__ to KVCacheBlock to avoid recursive print ([v1] Add __repr__ to KVCacheBlock to avoid recursive print vllm-project/vllm#14081)
[Model] Add LoRA support for TransformersModel ([Model] Add LoRA support for TransformersModel vllm-project/vllm#13770)
[Misc] Accurately capture the time of loading weights ([Misc] Accurately capture the time of loading weights vllm-project/vllm#14063)
[Doc] Source building add clone step ([Doc] Source building add clone step vllm-project/vllm#14086)
[v0][structured output] Support reasoning output ([v0][structured output] Support reasoning output vllm-project/vllm#12955)
Update deprecated Python 3.8 typing (Update deprecated Python 3.8 typing vllm-project/vllm#13971)
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure ([Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure vllm-project/vllm#14051)
[Misc] duplicate code in deepseek_v2 ([Misc] typo find in deepseek_v2 vllm-project/vllm#14106)
[Misc][Platform] Move use allgather to platform ([Misc][Platform] Move use allgather to platform vllm-project/vllm#14010)
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 ([Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 vllm-project/vllm#13921)
remove lm-eval
revert to upstream versioning
missing import
[V1] Refactor parallel sampling support ([V1] Refactor parallel sampling support vllm-project/vllm#13774)
Improve the docs for TransformersModel (Improve the docs for TransformersModel vllm-project/vllm#14147)
[ROCm] Faster Custom Paged Attention kernels ([ROCm] Faster Custom Paged Attention kernels vllm-project/vllm#12348)
Fix head_dim not existing in all model configs (Transformers backend) (Fix head_dim not existing in all model configs (Transformers backend) vllm-project/vllm#14141)
[V0][Metrics] Remove unimplemented vllm:tokens_total ([V0][Metrics] Remove unimplemented vllm:tokens_total vllm-project/vllm#14134)
[V0][Metrics] Deprecate some KV/prefix cache metrics ([V0][Metrics] Deprecate some KV/prefix cache metrics vllm-project/vllm#14136)
[V1] Simplify stats logging ([V1] Simplify stats logging vllm-project/vllm#14082)
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics ([WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics vllm-project/vllm#14055)
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 ([Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 vllm-project/vllm#14100)
[Kernel] Optimize moe intermediate_cache usage ([Kernel] Optimize moe intermediate_cache usage vllm-project/vllm#13625)
[Docs] Add GPTQModel ([Docs] Add GPTQModel vllm-project/vllm#14056)
[v1] Add comments to the new ragged paged attention Pallas kernel ([v1] Add comments to the new ragged paged attention Pallas kernel vllm-project/vllm#14155)
[Model] Add support for GraniteMoeShared models ([Model] Add support for GraniteMoeShared models vllm-project/vllm#13313)
[core] moe fp8 block quant tuning support ([core] moe fp8 block quant tuning support vllm-project/vllm#14068)
[Misc] Remove lru_cache in NvmlCudaPlatform ([Misc] Remove lru_cache in NvmlCudaPlatform vllm-project/vllm#14156)
[core] Pass all driver env vars to ray workers unless excluded ([core] Pass all driver env vars to ray workers unless excluded vllm-project/vllm#14099)
Use math.prod instead of np.prod for trivial ops (Use math.prod instead of np.prod for trivial ops vllm-project/vllm#14142)
Fix benchmark_moe.py tuning for CUDA devices (Fix benchmark_moe.py tuning for CUDA devices vllm-project/vllm#14164)
[platform] add debug logging during inferring the device type ([platform] add debug logging during inferring the device type vllm-project/vllm#14195)
[sleep mode] error out with expandable_segments ([sleep mode] error out with expandable_segments vllm-project/vllm#14189)
[doc] add "Failed to infer device type" to faq ([doc] add "Failed to infer device type" to faq vllm-project/vllm#14200)
[Bugfix] Restrict MacOS CPU detection ([Bugfix] Restrict MacOS CPU detection vllm-project/vllm#14210)
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs ([V1][BugFix] Fix remaining sync engine client shutdown errors/hangs vllm-project/vllm#13869)
[V0][Metrics] Deprecate some questionable request time metrics ([V0][Metrics] Deprecate some questionable request time metrics vllm-project/vllm#14135)
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py ([V1][Molmo] Fix get_multimodal_embeddings() in molmo.py vllm-project/vllm#14161)
add cutlass support for blackwell fp8 gemm (add cutlass support for blackwell fp8 gemm vllm-project/vllm#13798)
[TPU][Profiler] Support start_profile/stop_profile in TPU worker ([TPU][Profiler] Support start_profile/stop_profile in TPU worker vllm-project/vllm#13988)
Fix performance when --generation-config is not None (Fix performance when --generation-config is not None vllm-project/vllm#14223)
[Frontend] Do prompt_logprobs clamping for chat as well as completions ([Frontend] Do prompt_logprobs clamping for chat as well as completions vllm-project/vllm#14225)
[Docs] Update Dockerfile dependency image ([Docs] Update Dockerfile dependency image vllm-project/vllm#14215)
[v1][Metrics] Add design doc ([v1][Metrics] Add design doc vllm-project/vllm#12745)
[Security] Serialize using safetensors instead of pickle in Mooncake Pipe (Serialize using safetensors for KV caches vllm-project/vllm#14228)
Clean up unused padding_idx variables across many model definitions (Clean up unused padding_idx variables across many model definitions vllm-project/vllm#13240)
[ROCm] Disable a few more kernel tests that are broken on ROCm ([ROCm] Disable a few more kernel tests that are broken on ROCm vllm-project/vllm#14145)
[V1][TPU] TPU multimodal model support for ragged attention ([V1][TPU] TPU multimodal model support for ragged attention vllm-project/vllm#14158)
[misc] announce china meetup ([misc] announce china meetup vllm-project/vllm#14248)
Moved numba from common requirements to cuda/rocm specific requirements (Moved numba from common requirements to cuda/rocm specific requirements vllm-project/vllm#14199)
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 vllm-project/vllm#14157)
[Bugfix] Fix gptq_marlin for deepseek-v3 ([Bugfix] Fix gptq_marlin for deepseek-v3 vllm-project/vllm#13750)
[V1][Bugfix] Do not reset prefix caching metrics ([V1][Bugfix] Do not reset prefix caching metrics vllm-project/vllm#14235)
[Model] New model support for Phi-4-multimodal-instruct ([Model] New model support for Phi-4-multimodal-instruct vllm-project/vllm#14119)
[V1] EP/TP MoE + DP Attention ([V1] EP/TP MoE + DP Attention vllm-project/vllm#13931)
[platforms] improve rocm debugging info ([platforms] improve rocm debugging info vllm-project/vllm#14257)
Temporarily disable test_awq_gemm_opcheck (Temporarily disable test_awq_gemm_opcheck vllm-project/vllm#14251)
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param ([Frontend] Allow return_tokens_as_token_ids to be passed as a request param vllm-project/vllm#14066)
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing ([Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing vllm-project/vllm#14256)
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler ([Bugfix][V1] Fix allowed_token_ids for v1 Sampler vllm-project/vllm#14169)
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID ([Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID vllm-project/vllm#14217)
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor ([Doc] [3/N] Refer code examples for common cases in dev multimodal processor vllm-project/vllm#14278)
Small update for external_launcher backend docs (Small update for external_launcher backend docs vllm-project/vllm#14288)
[V1][Frontend] Add Testing For V1 Runtime Parameters ([V1][Frontend] Add Testing For V1 Runtime Parameters vllm-project/vllm#14159)
[LoRA] Remove linear hack outside transformers backend ([LoRA] Remove linear hack outside transformers backend vllm-project/vllm#14177)
[Misc] Add Qwen2MoeForCausalLM moe tuning support ([Misc] Add Qwen2MoeForCausalLM moe tuning support vllm-project/vllm#14276)
prefix_caching.md: Fixed typo ([Doc] Fixed typo in prefix_caching.md vllm-project/vllm#14293)
[Bugfix] Fix broken vision language example ([Bugfix] Fix broken vision language example vllm-project/vllm#14292)
[Docs] Add Meta Slides ([Docs] Add Meta Slides vllm-project/vllm#14297)
[V1][Minor] Remove obsolete FIXME comment ([V1][Minor] Remove obsolete FIXME comment vllm-project/vllm#14304)
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (Deprecate best_of Sampling Parameter in anticipation for vLLM V1 vllm-project/vllm#13997)
[V1][BugFix] Fix for mixed top_k batch ([V1][BugFix] Fix for mixed top_k batch vllm-project/vllm#14301)
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env ([misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env vllm-project/vllm#14267)
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test ([V1][Easy] Add empty allowed_token_ids in the v1 sampler test vllm-project/vllm#14308)
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch ([Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch vllm-project/vllm#14237)
[Bugfix] Remove num_tokens_across_dp ([Bugfix] Remove num_tokens_across_dp vllm-project/vllm#14302)
[BugFix] Fix prefix caching V0 MLA ([BugFix] Fix prefix caching V0 MLA vllm-project/vllm#14255)
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline ([CI/Build] Use spawn multiprocessing mode for V1 test pipeline vllm-project/vllm#14243)
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM vllm-project/vllm#13917)
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation ([Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation vllm-project/vllm#13850)
[BugFix] MLA + V1, illegal memory access and accuracy issues ([BugFix] MLA + V1, illegal memory access and accuracy issues vllm-project/vllm#14253)
[misc] Mention ray list nodes command to troubleshoot ray issues ([misc] Mention ray list nodes command to troubleshoot ray issues vllm-project/vllm#14318)
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 ([Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 vllm-project/vllm#14114)
[V1] LoRA - Enable more V1 tests ([V1] LoRA - Enable more V1 tests vllm-project/vllm#14315)
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention ([Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention vllm-project/vllm#11301)
[Hardware] Update the flash attn tag to support Blackwell ([Hardware] Update the flash attn tag to support Blackwell vllm-project/vllm#14244)
[Model] Update Paligemma multimodal processing with PromptUpdate ([Model] Update Paligemma multimodal processing with PromptUpdate vllm-project/vllm#14015)
[V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 ([VLM] Support Pixtral-HF on V1 vllm-project/vllm#14275)
[Core] Optimizing cross-attention QKVParallelLinear computation ([Core] Optimizing cross-attention QKVParallelLinear computation vllm-project/vllm#12325)
[Frontend][Docs] Transcription API streaming ([Frontend][Docs] Transcription API streaming vllm-project/vllm#13301)
[Doc] Update reasoning with stream example to use OpenAI library ([Doc] Update reasoning with stream example to use OpenAI library vllm-project/vllm#14077)
[Doc] Correct beam_search using in generative_models.md ([Doc] Correct beam_search using in generative_models.md vllm-project/vllm#14363)
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend ([Kernel] [V1] Improved performance for V1 Triton (ROCm) backend vllm-project/vllm#14152)
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 ([Bugfix][Core] fix abort_seq_group and memory leak when n>1 vllm-project/vllm#14326)
[Core] Don't use cache during multi-modal profiling ([Core] Don't use cache during multi-modal profiling vllm-project/vllm#14336)
[Doc] Fix date typo in README.md ([Doc] Fix date typo in README.md vllm-project/vllm#14366)
[RLHF] use worker_extension_cls for compatibility with V0 and V1 ([RLHF] use worker_extension_cls for compatibility with V0 and V1 vllm-project/vllm#14185)
Reinstate best_of for V0 (Reinstate best_of for V0 vllm-project/vllm#14356)
Adding cpu inference with VXE ISA for s390x architecture (Adding cpu inference with VXE ISA for s390x architecture vllm-project/vllm#12613)
Add authors to license header. (Add authors to license header. vllm-project/vllm#14371)
Fix mla prefill context performance (Fix mla prefill context performance vllm-project/vllm#13897)
[V1] Do not detokenize if sampling param detokenize is False ([V1] Do not detokenize if sampling param detokenize is False vllm-project/vllm#14224)
[Distributed] Add enable_expert_parallel arg ([Distributed] Add enable_expert_parallel arg vllm-project/vllm#14305)
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa ([CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa vllm-project/vllm#13569)
[CI] Disable spawn when running V1 Test ([CI] Disable spawn when running V1 Test vllm-project/vllm#14345)
[Kernel] Add needs_fixed_stride_order tag to most GEMMs ([Kernel] Add needs_fixed_stride_order tag to most GEMMs vllm-project/vllm#14306)
[Bugfix] Fix use_direct_call condition in FusedMoE layer for ([Bugfix] Fix use_direct_call condition in FusedMoE layer for vllm-project/vllm#14382)
[Bug] Fix Attention when ignored in by quant_method ([Bug] Fix Attention when ignored in by quant_method vllm-project/vllm#14313)
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends ([V1][Bugfix] Standardize quantized kv cache rejection for attention backends vllm-project/vllm#14221)
[Docs] Add nsight guide to profiling docs ([Docs] Add nsight guide to profiling docs vllm-project/vllm#14298)
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue ([Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue vllm-project/vllm#14310)
[Doc] Fix a typo ([Doc] Fix a typo vllm-project/vllm#14385)
[Bugfix] Correctly call cudaProfilerStop in benchmarks script ([Bugfix] Correctly call cudaProfilerStop in benchmarks script vllm-project/vllm#14183)
[Perf] Reduce MLA CPU overheads in V1 ([Perf] Reduce MLA CPU overheads in V1 vllm-project/vllm#14384)
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object ([FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object vllm-project/vllm#14390)
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs ([BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs vllm-project/vllm#14396)
[Bugfix] Fix JambaForCausalLM LoRA ([Bugfix] Fix JambaForCausalLM LoRA vllm-project/vllm#14370)
[Build] Add nightly wheel fallback when latest commit wheel unavailable ([Build] Add nightly wheel fallback when latest commit wheel unavailable vllm-project/vllm#14358)
OpenVINO: added CPU-like conditions (OpenVINO: added CPU-like conditions vllm-project/vllm#14338)
[GH] Auto-apply multi-modality label to relevant PRs ([GH] Auto-apply multi-modality label to relevant PRs vllm-project/vllm#14402)
correct wrong markdown syntax (correct wrong markdown syntax vllm-project/vllm#14414)
[Bugfix] Further clean up LoRA test ([Bugfix] Further clean up LoRA test vllm-project/vllm#14422)
[Bugfix] Clean up multi-modal processors ([Bugfix] Clean up multi-modal processors vllm-project/vllm#14417)
[Misc] Set default value of seed to None ([Misc] Set default value of seed to None vllm-project/vllm#14274)
[BUGFIX] Skip tokenization support for throughput benchmark ([BUGFIX] Skip tokenization support for throughput benchmark vllm-project/vllm#12712)
Fix missing kv_caches and attn_metadata in OpenVINOCausalLM (Fix missing kv_caches and attn_metadata in OpenVINOCausalLM vllm-project/vllm#14271)
Use the optimized block sizes after tuning the kernel. (Use the optimized block sizes after tuning the kernel. vllm-project/vllm#14329)
[V1][Core] Support for Structured Outputs ([V1][Core] Support for Structured Outputs vllm-project/vllm#12388)
[Doc] Update prefix_caching.md to match the example image ([Doc] Update prefix_caching.md to match the example image vllm-project/vllm#14420)
[Benchmarks] Make detokenization optional in benchmark scripts ([Benchmarks] Make detokenization optional in benchmark scripts vllm-project/vllm#11697)
[Kernel] optimize performance of gptq marlin kernel when n is small ([Kernel] optimize performance of gptq marlin kernel when n is small vllm-project/vllm#14138)
[Misc] Add Phi4-MM example ([Misc] Add Phi4-MM example vllm-project/vllm#14343)
[v1] torch.compile integration explanation ([v1] torch.compile integration explanation vllm-project/vllm#14437)
[V1] Eagerly remove finished requests from the batch ([V1] Eagerly remove finished requests from the batch vllm-project/vllm#14388)
[V1][Metrics] Fix traceback with preemptions+LoRA ([V1][Metrics] Fix traceback with preemptions+LoRA vllm-project/vllm#14220)
[Bugfix] Fix torch_xla which can't handle None seed introduced in [Misc] Set default value of seed to None vllm-project/vllm#14274 ([Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 vllm-project/vllm#14459)
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC ([V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC vllm-project/vllm#13949)
[Bugfix][V1] Handle MLA in kv_cache_interface ([Bugfix][V1] Handle MLA in kv_cache_interface vllm-project/vllm#14462)
Revert "[Perf] Reduce MLA CPU overheads in V1 ([Perf] Reduce MLA CPU overheads in V1 vllm-project/vllm#14384)" (Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" vllm-project/vllm#14471)
[Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache ([Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache vllm-project/vllm#14369)
[MISC][V1] Register process killing handler only in the main thread ([MISC][V1] Register process killing handler only in the main thread vllm-project/vllm#14380)
[core] add extra_args to SamplingParams ([core] add extra_args to SamplingParams vllm-project/vllm#13300)
[CI/Build] refactor: set timezone of container to UTC ([CI/Build] refactor: set timezone of container to UTC vllm-project/vllm#12888)
Default to generation_config from model (Default to generation_config from model vllm-project/vllm#12622)
[Doc]add doc for Qwen models tool calling ([Doc]add doc for Qwen models tool calling vllm-project/vllm#14478)
[Doc] Added QwQ-32B to the supported models list in the reasoning out… ([Doc] Added QwQ-32B to the supported models list in the reasoning out… vllm-project/vllm#14479)
[Bugfix] Make the deviceprofiler include LoRA memory. ([Bugfix] Make the deviceprofiler include LoRA memory. vllm-project/vllm#14469)
Add training doc signposting to TRL (Add training doc signposting to TRL vllm-project/vllm#14439)
[Build/BugFix] Fix hopper 12.8 build ([Build/BugFix] Fix hopper 12.8 build vllm-project/vllm#14354)
Add RLHF document (Add RLHF document vllm-project/vllm#14482)
[CI/Build] Use a fixed seed to avoid flaky tests ([CI/Build] Use a fixed seed to avoid flaky tests vllm-project/vllm#14480)
[V1] TPU - Add tensor parallel support via Ray ([V1] TPU - Add tensor parallel support via Ray vllm-project/vllm#13618)
[VLM] Add TP support for Phi-4-MM ([VLM] Add TP support for Phi-4-MM vllm-project/vllm#14453)
[Misc] add use_tqdm_on_load to reduce logs ([Misc] add use_tqdm_on_load to reduce logs vllm-project/vllm#14407)
[V1][Core] Fix memory issue with logits & sampling ([V1][Core] Fix memory issue with logits & sampling vllm-project/vllm#13776)
[benchmarks] Add option to use unique jsonschema for each request ([benchmarks] Add option to use unique jsonschema for each request vllm-project/vllm#14457)
[Misc] Don't run ruff at all on 3rd party libs ([Misc] Don't run ruff at all on 3rd party libs vllm-project/vllm#14493)
Move requirements into their own directory (Move requirements into their own directory vllm-project/vllm#12547)
[Bugfix] DeepSeek Accuracy ([Bugfix] DeepSeek Accuracy vllm-project/vllm#14476)
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling ([Bugfix] Fix profiling OOM and decouple encoder multimodal profiling vllm-project/vllm#14361)
Update CODEOWNERS for structured output (Update CODEOWNERS for structured output vllm-project/vllm#14496)
[Misc] Upgrade to Python 3.9 typing for additional directories ([Misc] Upgrade to Python 3.9 typing for additional directories vllm-project/vllm#14492)
[V1] Support bad_words in sampler ([V1] Support bad_words in sampler vllm-project/vllm#13376)
Revert "[V1][Core] Fix memory issue with logits & sampling" (Revert "[V1][Core] Fix memory issue with logits & sampling" vllm-project/vllm#14504)
[Attention] Default to FlashMLA backend for MLA ([Attention] Default to FlashMLA backend for MLA vllm-project/vllm#14451)
[V1][TPU] Remove unnecessary padding for running on TPU. ([V1][TPU] Remove unnecessary padding for running on TPU. vllm-project/vllm#14467)
[Feat] Support chunked prefill for LMCache connector ([Feat] Support chunked prefill for LMCache connector vllm-project/vllm#14505)
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 ([Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 vllm-project/vllm#12428)
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work ([Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work vllm-project/vllm#14498)
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup ([Hardware][TPU] Fix the recompiling issue in logits processor after warmup vllm-project/vllm#14510)
[Misc] Ensure out-of-tree quantization method recognize by cli args ([Misc] Ensure out-of-tree quantization method recognize by cli args vllm-project/vllm#14328)
[Bugfix] Wrong requirements path - rocm ([Bugfix] Wrong requirements path - rocm vllm-project/vllm#14527)
[Feature] Consolidate performance benchmark datasets ([Feature] Consolidate performance benchmark datasets vllm-project/vllm#14036)
[Misc] Add log information for handle_process_request. ([Misc] Add log information for handle_process_request. vllm-project/vllm#14130)
[Docs] Mention model_impl arg when explaining Transformers fallback ([Docs] Mention model_impl arg when explaining Transformers fallback vllm-project/vllm#14552)
[Frontend] support image embeds ([Frontend] support image embeds vllm-project/vllm#13955)
[Kernel] Add more dtype support for GGUF kernels ([Kernel] Add more dtype support for GGUF kernels vllm-project/vllm#14043)
[Doc] Update PaliGemma note to a warning ([Doc] Update PaliGemma note to a warning vllm-project/vllm#14565)
Correct capitalisation: Github -> GitHub (Correct capitalisation: Github -> GitHub vllm-project/vllm#14561)
[V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL ([V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL vllm-project/vllm#14548)
Correct capitalisation: VLLM -> vLLM (Correct capitalisation: VLLM -> vLLM vllm-project/vllm#14562)
[Docs] Make installation URLs nicer ([Docs] Make installation URLs nicer vllm-project/vllm#14556)
[Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 ([Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 vllm-project/vllm#14554)
[Perf] Improve MLA on V1 ([Perf] Improve MLA on V1 vllm-project/vllm#14540)
[Minor] Update the tqdm bar for parallel sampling ([Minor] Update the tqdm bar for parallel sampling vllm-project/vllm#14571)
[V1] LoRA - Add triton kernels for V1 ([V1] LoRA - Add triton kernels for V1 vllm-project/vllm#13096)
Fix typo in benchmark_serving_structured_output.py (Fix typo in benchmark_serving_structured_output.py vllm-project/vllm#14566)
[V1] Prevent xgrammar from breaking TPU support ([V1] Prevent xgrammar from breaking TPU support vllm-project/vllm#14575)
[Kernel] moe wna16 cuda kernel ([Kernel] moe wna16 cuda kernel vllm-project/vllm#13321)
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils ([MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils vllm-project/vllm#14379)
[Neuron] Add Neuron device communicator for vLLM v1 ([Neuron] Add Neuron device communicator for vLLM v1 vllm-project/vllm#14085)
[neuron] add reshape_and_cache ([neuron] add reshape_and_cache vllm-project/vllm#14391)
[V1][PP] Do not block engine core when no requests to schedule ([V1][PP] Do not block engine core when no requests to schedule vllm-project/vllm#14585)
[Bugfix] Fix FP16 overflow for DeepSeek V2 ([Bugfix] Fix FP16 overflow for DeepSeek V2 vllm-project/vllm#13232)
[V1][Core] Fix memory issue with logits & sampling ([V1][Core] Fix memory issue with logits & sampling vllm-project/vllm#14508)
[Misc] Correct deepseek-vl2 chat template ([Misc] Correct deepseek-vl2 chat template vllm-project/vllm#14558)
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync ([Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync vllm-project/vllm#14377)
[VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor ([VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor vllm-project/vllm#14602)
benchmarks: simplify test jsonschema (benchmarks: simplify test jsonschema vllm-project/vllm#14567)
dynamic distpatch of fp8 kernels (dynamic distpatch of fp8 kernels vllm-project/vllm#14245)
[Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 ([Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 vllm-project/vllm#14609)
Uninstall dependencies before installing requirements/tpu.txt (Uninstall dependencies before installing requirements/tpu.txt vllm-project/vllm#14586)
[V1] Add regex structured output support with xgrammar ([V1] Add regex structured output support with xgrammar vllm-project/vllm#14590)
docs: Add documentation for s390x cpu implementation (docs: Add documentation for s390x cpu implementation vllm-project/vllm#14198)
[BugFix/Build] Fix sparse kernels not getting built on hopper ([BugFix/Build] Fix sparse kernels not getting built on hopper vllm-project/vllm#14572)
[Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. ([Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. vllm-project/vllm#14564)
[V1] Remove cache from StructuredOutputManager ([V1] Remove cache from StructuredOutputManager vllm-project/vllm#14622)
fix some typos : supported_head_sizes (fix some typos : supported_head_sizes vllm-project/vllm#14627)
[V1] Delay all xgrammar usage until needed ([V1] Delay all xgrammar usage until needed vllm-project/vllm#14616)
Fix run_tpu_test (Fix run_tpu_test vllm-project/vllm#14641)
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly ([V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly vllm-project/vllm#14597)
Red Hat Konflux update vllm-cpu-v2-19 Signed-off-by: red-hat-konflux [email protected]
[Bugfix][V1][PP] Only warmup sampler at last PP rank ([Bugfix][V1][PP] Only warmup sampler at last PP rank vllm-project/vllm#14643)
[release] Add commands to clean up logs on TPU release node ([release] Add commands to clean up logs on TPU release node vllm-project/vllm#14642)
[Feature] Add vllm bench CLI ([Feature] Add vllm bench CLI vllm-project/vllm#13993)
[core][V1] pluggable scheduler ([core][V1] pluggable scheduler vllm-project/vllm#14466)
[Doc] Update benchmarks README ([Doc] Update benchmarks README vllm-project/vllm#14646)
[Model] Extend Ultravox to accept audio longer than 30s ([Model] Extend Ultravox to accept audio longer than 30s vllm-project/vllm#13631)
[V1][Core] Support MistralTokenizer for Structured Output ([V1][Core] Support MistralTokenizer for Structured Output vllm-project/vllm#14625)
[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization ([Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization vllm-project/vllm#14545)
[Kernel] GGUF MoE kernel ([Kernel] GGUF MoE kernel vllm-project/vllm#14613)
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing ([V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing vllm-project/vllm#14645)
[Kernel] Add ModelOpt FP4 Checkpoint Support ([Kernel] Add ModelOpt FP4 Checkpoint Support vllm-project/vllm#12520)
Remove PR pipeline and apply customizations to the push pipeline
[CPU] Upgrade CPU backend to torch-2.6 ([CPU] Upgrade CPU backend to torch-2.6 vllm-project/vllm#13381)
fix output image tag and update cel-exprssion
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. ([ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. vllm-project/vllm#14629)
[Model] Add support for Gemma 3 ([Model] Add support for Gemma 3 vllm-project/vllm#14660)
[Bugfix] Missing thumbnail from NVLM-D processor ([Bugfix] Missing thumbnail from NVLM-D processor vllm-project/vllm#14633)
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm ([ROCm] Enable chunked prefill/paged attention in MLA on ROCm vllm-project/vllm#14316)
[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. ([FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. vllm-project/vllm#14664)
[BugFix][V1] Fix parallel sampling finishing/aborts ([BugFix][V1] Fix parallel sampling finishing/aborts vllm-project/vllm#14512)
[V1] Allow sliding window + prefix caching ([V1] Allow sliding window + prefix caching vllm-project/vllm#13069)
increase timeout
[release] Add force remove for TPU logs ([release] Add force remove for TPU logs vllm-project/vllm#14697)
[bugfix] fixup warning message for plugged schedulers for v1 ([bugfix] fixup warning message for plugged schedulers for v1 vllm-project/vllm#14700)
Add ray[data] as tpu dependency (Add ray[data] as tpu dependency vllm-project/vllm#14691)
[ROCm][FP8] Fix for adjustments needed only for fnuz ([ROCm][FP8] Fix for adjustments needed only for fnuz vllm-project/vllm#14689)
[BugFix][TritonMLA] Process weights after model loading for GGUF ([BugFix][TritonMLA] Process weights after model loading for GGUF vllm-project/vllm#14555)
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config ([Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config vllm-project/vllm#14367)
[V1][TPU] Add assertion on multi-step-scheduler ([V1][TPU] Add assertion on multi-step-scheduler vllm-project/vllm#14707)
[Quant] BartModel SupportsQuant ([Quant] BartModel SupportsQuant vllm-project/vllm#14699)
[Quant] Bamba SupportsQuant ([Quant] Bamba SupportsQuant vllm-project/vllm#14698)
[Bugfix] Fix chunked prefill for GGUF ([Bugfix] Fix chunked prefill for GGUF vllm-project/vllm#14666)
[CI/Build] Delete ultravox LoRA test ([CI/Build] Delete ultravox LoRA test vllm-project/vllm#14730)
[Bugfix] fix benchmark moe ([Bugfix] fix benchmark moe vllm-project/vllm#14653)
[VLM] Support pan-and-scan for Gemma3 multi-modal processor ([VLM] Support pan-and-scan for Gemma3 multi-modal processor vllm-project/vllm#14672)
[VLM] Support loading InternVideo2.5 models as original InternVLChatModel ([VLM] Support loading InternVideo2.5 models as original InternVLChatModel vllm-project/vllm#14738)
[Bugfix] Fix prompt format of GLM4V ([Bugfix] Fix prompt format of GLM4V vllm-project/vllm#14539)
[V1][Minor] Minor enhancements on scheduler ([V1][Minor] Minor enhancements on scheduler vllm-project/vllm#14732)
[Misc] Clean up processor tests ([Misc] Clean up processor tests vllm-project/vllm#14771)
[V1][Core] using cached vocab_size for Structured Outputs ([V1][Core] using cached vocab_size for Structured Outputs vllm-project/vllm#14630)
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output ([V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output vllm-project/vllm#14624)
[Attention] Remove slow setattr in MLA ([Attention] Remove slow setattr in MLA vllm-project/vllm#14769)
[Doc] Fix typo in documentation ([Doc] Fix typo in documentation vllm-project/vllm#14783)
[Doc] Fix small typo in Transformers fallback ([Doc] Fix small typo in Transformers fallback vllm-project/vllm#14791)
[V1] TPU - Enable prefix caching by default ([V1] TPU - Enable prefix caching by default vllm-project/vllm#14773)
forward fix PR 14245, restore build on ROCm 6.2 (forward fix PR 14245, restore build on ROCm 6.2 vllm-project/vllm#14709)
[V1] Move OOM check into sampler run ([V1] Move OOM check into sampler run vllm-project/vllm#14728)
[V1] Temporarily disable FlashInfer Rejection Sampler ([V1] Temporarily disable FlashInfer Rejection Sampler vllm-project/vllm#14788)
[Kernel] LoRA - Enable CUDAGraphs for V1 ([Kernel] LoRA - Enable CUDAGraphs for V1 vllm-project/vllm#14626)
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. ([Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. vllm-project/vllm#14431)
[Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it ([Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it vllm-project/vllm#14681)
[ci] Reduce number of tests in fastcheck ([ci] Reduce number of tests in fastcheck vllm-project/vllm#14782)
[Misc][Minor] Simplify SamplingParams.__post_init__() ([Misc][Minor] Simplify SamplingParams.__post_init__() vllm-project/vllm#14772)
[Neuron] flatten test parameterization for neuron attention kernels ([Neuron] flatten test parameterization for neuron attention kernels vllm-project/vllm#14712)
[Feature] Add visionarena offline support for benchmark_throughput ([Feature] Add visionarena offline support for benchmark_throughput vllm-project/vllm#14654)
[CI] Fix missing example model id in processor test ([CI] Fix missing example model id in processor test vllm-project/vllm#14787)
[Attention] MLA get rid of materialization ([Attention] MLA get rid of materialization vllm-project/vllm#14770)
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel ([Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel vllm-project/vllm#14667)
[BugFix]Fix performance serving benchmark when enable profiling ([BugFix]Fix performance serving benchmark when enable profiling vllm-project/vllm#14737)
[Misc] Clean up type annotation for SupportsMultiModal ([Misc] Clean up type annotation for SupportsMultiModal vllm-project/vllm#14794)
[Bugfix] Fix small typo in the example of Streaming delimiter ([Bugfix] Fix small typo in the example of Streaming delimiter vllm-project/vllm#14793)
[Misc] Gemma3ForConditionalGeneration supports LoRA ([Misc] Gemma3ForConditionalGeneration supports LoRA vllm-project/vllm#14797)
[V1][Minor] Minor code cleanup for scheduling metrics ([V1][Minor] Minor code cleanup for scheduling metrics vllm-project/vllm#14800)
[Bugfix][W8A8] fixed cutlass block fp8 binding ([Bugfix][W8A8] fixed cutlass block fp8 binding vllm-project/vllm#14796)
[VLM] Various cleanup and fixes ([VLM] Various cleanup and fixes vllm-project/vllm#14806)
[BugFix]: properly catch templating error when preprocess input ([BugFix]: properly catch templating error when preprocess input vllm-project/vllm#13976)
[Bugfix] Fix Aria test loading ([Bugfix] Fix Aria test loading vllm-project/vllm#14823)
[V1] Fix vocab size calculation for structured output ([V1] Fix vocab size calculation for structured output vllm-project/vllm#14826)
[Frontend] Fix log message to use http vs https ([Frontend] Fix log message to use http vs https vllm-project/vllm#14774)
[V1][Metrics] Updated list of deprecated metrics in v0.8 ([V1][Metrics] Updated list of deprecated metrics in v0.8 vllm-project/vllm#14695)
[Frontend] track server_load ([Frontend] track server_load vllm-project/vllm#13950)
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 ([Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 vllm-project/vllm#14430)
[release] Remove log cleanup commands from TPU job ([release] Remove log cleanup commands from TPU job vllm-project/vllm#14838)
Re-enable the AMD Entrypoints Test (Re-enable the AMD Entrypoints Test vllm-project/vllm#14711)
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies ([Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies vllm-project/vllm#14778)
[V1] Fix model parameterization for structured output tests ([V1] Fix model parameterization for structured output tests vllm-project/vllm#14833)
Update to torch==2.6.0 (Update to torch==2.6.0 vllm-project/vllm#12721)
[CI] Add TPU v1 test ([CI] Add TPU v1 test vllm-project/vllm#14834)
[Build/CI] Move ninja to common deps ([Build/CI] Move ninja to common deps vllm-project/vllm#14835)
[Build/CI] Upgrade aiohttp to incldue CVE fix ([Build/CI] Upgrade aiohttp to incldue CVE fix vllm-project/vllm#14840)
[Doc] More neutral K8s deployment guide ([Doc] More neutral K8s deployment guide vllm-project/vllm#14084)
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … ([Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … vllm-project/vllm#14844)
[Neuron][CI] update docker run command ([Neuron][CI] update docker run command vllm-project/vllm#14829)
[Bugfix][V1] Fix flashinfer sampling ([Bugfix][V1] Fix flashinfer sampling vllm-project/vllm#14815)
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… (Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… vllm-project/vllm#14848)
Disable outlines cache by default (Disable outlines cache by default vllm-project/vllm#14837)
[Misc] Remove misleading message in gemma2 and gemma3 ([Misc] Remove misleading message in gemma2 and gemma3 vllm-project/vllm#14850)
[Misc][Easy] Annotate unused vars in the csrc files ([Misc][Easy] Annotate unused vars in the csrc files vllm-project/vllm#14798)
[V1] V1 Enablement Oracle ([V1] V1 Enablement Oracle vllm-project/vllm#13726)
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md ([Docs] Add new East Coast vLLM Meetup slides to README and meetups.md vllm-project/vllm#14852)
[CPU] Support FP8 KV cache ([CPU] Support FP8 KV cache vllm-project/vllm#14741)
[Attention] Get rid of mla cache alignment ([Attention] Get rid of mla cache alignment vllm-project/vllm#14842)
[CI/Build] Delete LoRA bias test ([CI/Build] Delete LoRA bias test vllm-project/vllm#14849)
[V1][Structured Output] calculate vocab_size eagerly ([V1][Structured Output] calculate vocab_size eagerly vllm-project/vllm#14851)
[Doc] V1 user guide ([Doc] V1 user guide vllm-project/vllm#13991)
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes ([Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes vllm-project/vllm#14839)
[Bugfix] EAGLE output norm bug ([Bugfix] EAGLE output norm bug vllm-project/vllm#14464)
[VLM] Limit multimodal input cache by memory ([VLM] Limit multimodal input cache by memory vllm-project/vllm#14805)
[CI][Intel GPU] refine intel GPU ci docker build ([CI][Intel GPU] refine intel GPU ci docker build vllm-project/vllm#14860)
[Core] Expose API endpoint /is_sleeping ([Core] Expose API endpoint /is_sleeping vllm-project/vllm#14312)
[VLM] Merged multi-modal processor for Pixtral ([VLM] Merged multi-modal processor for Pixtral vllm-project/vllm#12211)
[Misc][Doc] Minor benchmark README update ([Misc][Doc] Minor benchmark README update vllm-project/vllm#14874)
[VLM] Clean up Phi-4-MM ViT implementation ([VLM] Clean up Phi-4-MM ViT implementation vllm-project/vllm#14812)
[V1] Remove V0 fallback for mistral-tokenizer ([V1] Remove V0 fallback for mistral-tokenizer vllm-project/vllm#14873)
[Kernel] Add more tuned configs ([Kernel] Add more tuned configs vllm-project/vllm#14877)
[BugFix] Fix torch distributed stateless PG backend init ([BugFix] Fix torch distributed stateless PG backend init vllm-project/vllm#14870)
[V1] [Spec Decode] Fix ngram tests ([V1] [Spec Decode] Fix ngram tests vllm-project/vllm#14878)
[Bugfix] Limit profiling run sequence length by max_model_len ([Bugfix] Limit profiling run sequence length by max_model_len vllm-project/vllm#14785)
[Bugfix] Explicitly disable Phi-4-multimodal in V1 ([Bugfix] Explicitly disable Phi-4-multimodal in V1 vllm-project/vllm#14889)
Revert "[Bugfix] Limit profiling run sequence length by max_model_len ([Bugfix] Limit profiling run sequence length by max_model_len vllm-project/vllm#14785) (Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) vllm-project/vllm#14892)
[BugFix][V1] Fix overhead related to bad_words sampling when not in use ([BugFix][V1] Fix overhead related to bad_words sampling when not in use vllm-project/vllm#14894)
[V1][BugFix] Detect interleaved sliding window attention ([V1][BugFix] Detect interleaved sliding window attention vllm-project/vllm#14896)
[Misc] Catching Ray Compiled Graph PP test failures for V1 ([Misc] Catching Ray Compiled Graph PP test failures for V1 vllm-project/vllm#14847)
[Doc] Add guidance for using ccache with pip install -e . in doc ([Doc] Add guidance for using ccache with pip install -e . in doc vllm-project/vllm#14901)
[V1] Enable Entrypoints Tests ([V1] Enable Entrypoints Tests vllm-project/vllm#14903)
[CI] Nightly Tests ([CI] Fix Tool Calling Tests vllm-project/vllm#14898)
[CI/Build] Update defaults for test reproducibility (#14893)
[V1] Optimize the overhead of rewinding (#14905)
[V1][Minor] Add repr to ConstantList (#14907)
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context (#14910)
[Misc] Replace os environ to monkeypatch in test suite (#14516)
[Benchmark] Do not save detailed info to json by default (#14879)
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
[V1] Remove input cache client (#14864)
[Misc][XPU] Use None as device capacity for XPU (#14932)
[Doc] Add vLLM Beijing meetup slide (#14938)
setup.py: drop assumption about local main branch (#14692)
[MISC] More AMD unused var clean up (#14926)
fix minor miscalled method (#14327)
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
[Bugfix] Fix Ultravox on V1 (#14929)
[Misc] Add --seed option to offline multi-modal examples (#14934)
Update vllm-cpu-v2-19-push.yaml
[Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810)
[Doc] Fix misleading log during multi-modal profiling (#14955)
Add patch merger (#14957)
[V1] Default MLA to V1 (#14921)
[Bugfix] Fix precommit - line too long in pixtral.py (#14960)
[Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
[Fix][Structured Output] using vocab_size to construct matcher (#14868)
adjust workflow triggers (#199)
add UBI-based CUDA dockerfile/bake definition (#198)
[Bugfix] Make Gemma3 MM V0 only for now (#14971)
[Bugfix] Fix interface for Olmo2 on V1 (#14976)
[CI/Build] Use AutoModelForImageTextToText to load VLMs in tests (#14945)
[V1] Guard Against Main Thread Usage (#14972)
[V1] TPU - Fix CI/CD runner (#14974)
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950)
[Neuron] trim attention kernel tests to fit trn1.2x instance (#14988)
[Doc][V1] Fix V1 APC doc (#14920)
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
[Mistral-Small 3.1] Update docs and tests (#14977)
[Misc] Embedding model support LoRA (#14935)
[Bugfix] torchrun compatibility (#14899)
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest (#14352)
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros (#14347)
[Bugfix] Loosen type check to avoid errors in V1 (#15021)
[Bugfix] Register serializers for V0 MQ Engine (#15009)
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
MI325 configs, fused_moe_kernel bugfix (#14987)
[MODEL] Add support for Zamba2 models (#13185)
[Bugfix] Fix broken CPU quantization due to triton import (#15038)
[Bugfix] Fix LoRA extra vocab size (#15047)
[V1] Refactor Structured Output for multiple backends (#14694)
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
[V1] TPU - CI/CD use smaller model (#15054)
fix long dtype in topk sampling (#15049)
[Doc] Minor v1_user_guide update (#15064)
[Misc][V1] Skip device checking if not available (#15061)
[Model] Pixtral: Remove layer instantiation duplication (#15053)
[Model] Remove duplicated message check in Mistral chat completion request (#15069)
[Core] Update dtype detection and defaults (#14858)
[V1] Ensure using int64 for sampled token ids (#15065)
[Bugfix] Re-enable Gemma3 for V1 (#14980)
[CI][Intel GPU] update XPU dockerfile and CI script (#15109)
Update vllm-cuda-v2-19-push.yaml
[V1][Bugfix] Fix oracle for device checking (#15104)
Update nudging
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data (#15107)
[Bugfix] Fix size calculation of processing cache (#15114)
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile (#15070)
[Misc][Benchmark] Add support for different tokenizer_mode (#15040)
** [Bugfix] Adjust mllama to regional compilation (#15112)**
[Misc] Update the "the first vLLM China Meetup" slides link to point to the first page (#15134)
[Frontend] Remove custom_cache_manager (#13791)
[V1] Minor V1 async engine test refactor (#15075)
[FEAT]Support reset prefix cache by specified device (#15003)
simple bugfix: Update stats.py (#15139)
[V1][TPU] Change kv cache shape. (#15145)
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests (#15150)
[Docs] Annouce Ollama and Singapore Meetups (#15161)
[V1] TPU - Tensor parallel MP support (#15059)
[BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
[Doc] Clarify run vllm only on one node in distributed inference (#15148)
Fix broken tests (#14713)
[Bugfix] Fix embedding assignment for InternVL-based models (#15086)
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… (#14673)
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
[Benchmark] Allow oversample request in benchmark dataset (#15170)
[Core][V0] Add guidance backend for structured output (#14589)
[Doc] Update Mistral Small 3.1/Pixtral example (#15184)
[Misc]fixed disable these http request logs (#14754)
[Attention] Flash Attention 3 - fp8 (#14570)
[Doc] Update README.md (#15187)
Enable CUDA graph support for llama 3.2 vision (#14917)
typo: Update config.py (#15189)
[Frontend][Bugfix] support prefill decode disaggregation on deepseek (#14824)
[release] Tag vllm-cpu with latest upon new version released (#15193)
Fixing Imprecise Type Annotations (#15192)
[macOS] Ugrade pytorch to 2.6.0 (#15129)
[Bugfix] Multi-video inference on LLaVA-Onevision (#15082)
Add user forum to README (#15220)
Fix env vars for running Ray distributed backend on GKE (#15166)
Replace misc issues with link to forum (#15226)
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 (#15172)
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043)
Build VLLM CUDA from RHEL AI wheels, add audio and video packages (#85)
[V1] Add flag to disable cascade attention (#15243)
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. (#14617)
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
[CI/Build] LoRA : make add_lora_test safer (#15181)
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 (#15159)
[Misc] Clean up the BitsAndBytes arguments (#15140)
[ROCM] Upgrade torch to 2.6 (#15244)
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15200)
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client (#15240)
[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
[Bugfix] detect alibi and revert to FA2 (#15231)
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies (#14857)
[Docs] Trim the latest news in README (#15261)
[Misc] Better RayExecutor and multiprocessing compatibility (#14705)
Add an example for reproducibility (#15262)
[Hardware][TPU] Add check for no additional graph compilation during runtime (#14710)
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
[Doc] Update LWS docs (#15163)
[V1] Avoid redundant input processing in n>1 case (#14985)
[Feature] specify model in config.yaml (#14855)
[Bugfix] Add int8 torch dtype for KVCache (#15260)
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL (#15273)
[Bugfix] Fix incorrect resolving order for transformers fallback (#15279)
[V1] Fix wrong import path of get_flash_attn_version (#15280)
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend (#15282)
[Misc] Add cProfile helpers (#15074)
[v1] Refactor KVCacheConfig (#14079)
[Bugfix][VLM] fix llava processor (#15285)
Revert "[Feature] specify model in config.yaml (#14855)" (#15293)
[TPU][V1] MHA Pallas backend (#15288)
[Build/CI] Fix env var typo (#15305)
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout (#15301)
[Bugfix][V0] Multi-sequence logprobs streaming edge case (#15259)
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959)
[Doc] add load_format items in docs (#14804)
[Bugfix] Fix torch.compile raise FileNotFoundError (#15278)
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes (#15308)
[Model] Support Tele-FLM Model (#15023)
[V1] Add disable-any-whitespace option support for xgrammar (#15316)
[BugFix][Typing] Fix Imprecise Type Annotations (#15208)
Remove openvino support in favor of external plugin (#15339)
[doc] Add back previous news (#15331)
Fix v1 supported oracle for worker-cls and worker-extension-cls (#15324)
[V1][Usage] Refactor speculative decoding configuration and tests (#14434)
[ci/build] update torch nightly version for GH200 (#15135)
[ci/build] fix broken tests in LLM.collective_rpc (#15350)
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 (#15322)
[Bugfix] consider related env vars for torch.compiled cache hash (#14953)
[V1][Spec Decode] Respect prompt_lookup_max (#15348)
[V1][Spec Decode] Use better defaults for N-gram (#15358)
[Frontend] Support tool calling and reasoning parser (#14511)
[Misc][Doc] Add note regarding loading generation_config by default (#15281)
[V1] Enable V1 Fp8 cache for FA3 in the oracle (#15191)
[Fix] [torch.compile] Improve UUID system for custom passes (#15249)
Fix non-contiguous input passed to Marlin kernel (#15319)
[Misc] Upgrade BNB version (#15183)
[Misc] Remove ignore_reinit_error for ray.init() (#15373)
[Bugfix][V1] Avoid importing PreTrainedModel (#15366)
[Misc] Update guided decoding logs to debug (#15310)
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa ([CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa vllm-project/vllm#13569)" (#15377)
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909)
Fix /v1/audio/transcriptions Bad Request Error (Fix /v1/audio/transcriptions Bad Request Error vllm-project/vllm#13811)
[Kernel] allow non-contiguous input for marlin kernel (#14658)
Fix zmq IPv6 URL format error (#15341)
[Bugfix] Fix chat template loading (#15143)
[distributed] fix dp group (#15355)
[Core] Integrate fastsafetensors loader for loading model weights (#10647)
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL (#15306)
[V1][Minor] fix comments (#15392)
[MISC] Refine no available block debug msg (#15076)
[V1] Aggregate chunked prompt logprobs in model runner (#14875)
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral (#12303)
[DOC] Add Kubernetes deployment guide with CPUs (#14865)
[Doc] Update docs on handling OOM (#15357)
[V1][Perf] Simpler request output queues (#15156)
[BugFix][V1] Quick fix for min_tokens with multiple EOS (#15407)
[Hardware][TPU] Skip failed compilation test (#15421)
[Build] Cython compilation support fix (#14296)
[ROCm][Kernel] MoE weights padding (#14454)
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063)
[Minor][Spec Decode] Remove compiled_softmax (#15416)
Add pipeline parallel support to TransformersModel (#12832)
[Misc] Remove LoRA log (#15388)
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" (#15398)
[Bugfix] Fixed the issue of not being able to input video and image simultaneously (#15387)
[V1] guidance backend for structured output + auto fallback mode (#14779)
[V1][Spec Decode] Update target_logits in place for rejection sampling (#15427)
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 (#15160)
[Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409)
[Kernel][CPU] CPU MLA (#14744)
Dockerfile.rocm.ubi: add [audio,video,tensorizer] extras to vllm install
Dockerfile.ppc64le changes to move to UBI (#15402)
[Misc] Clean up MiniCPM-V/O code (#15337)
[Misc] Remove redundant num_embeds (#15443)
[Doc] Update V1 user guide for multi-modality (#15460)
[Kernel] Fix conflicting macro names for gguf kernels (#15456)
[bugfix] fix inductor cache on max_position_embeddings (#15436)
update version
[CI/Build] Add tests for the V1 tpu_model_runner. (#14843)
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) (#15471)
[bugfix] add supports_v1 platform interface (#15417)
[KONFLUX-2264] Add shell and unicode sast pipeline tasks
Add workaround for shared field_names in pydantic model class (#13925)
[TPU][V1] Fix Sampler recompilation (#15309)
[V1][Minor] Use SchedulerInterface type for engine scheduler field (#15499)
[V1] Support long_prefill_token_threshold in v1 scheduler (#15419)
[core] add bucket padding to tpu_model_runner (#14995)
missed a spot
add slack notifier to vllm-cpu
[Core] LoRA: V1 Scheduler optimization (#15422)
[CI/Build] LoRA: Delete long context tests (#15503)
Transformers backend already supports V1 (#15463)
[Model] Support multi-image for Molmo (#15438)
[Misc] Warn about v0 in benchmark_paged_attn.py (#15495)
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) (#15492)
[misc] LoRA - Skip LoRA kernels when not required (#15152)
Fix raw_request extraction in load_aware_call decorator (#15382)
Update vllm-cpu-v2-19-push.yaml
[Feature] Enhance EAGLE Architecture with Proper RMS Norms (#14990)
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER (#14967)
[Misc] Enhance warning information to user-defined chat template (#15408)
[Misc] improve example script output (#15528)
Separate base model from TransformersModel (#15467)
Apply torchfix (#15532)
Improve validation of TP in Transformers backend (#15540)
[Model] Add Reasoning Parser for Granite Models (#14202)
multi-node offline DP+EP example (#15484)
Fix weight loading for some models in Transformers backend (#15544)
[Refactor] Remove unnecessary backend parameter in structured output interface (#15317)
[V1][Sampler] Faster top-k only implementation (#15478)
Support SHA256 as hash function in prefix caching (#15297)
Applying some fixes for K8s agents in CI (#15493)
[V1] TPU - Revert to exponential padding by default (#15565)
[V1] TPU CI - Fix test_compilation.py (#15570)
Use Cache Hinting for fused_moe kernel (#15511)
[TPU] support disabling xla compilation cache (#15567)
Support FIPS enabled machines with MD5 hashing (#15299)
[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
Add automatic tpu label to mergify.yml (#15560)
add platform check back (#15578)
[misc] LoRA: Remove unused long context test data (#15558)
[Doc] Update V1 user guide for fp8 kv cache support (#15585)
[moe][quant] add weight name case for offset (#15515)
[V1] Refactor num_computed_tokens logic (#15307)
add retries and get rid of progress meter
Allow torchao quantization in SiglipMLP (#15575)
[ROCm] Env variable to trigger custom PA (#15557)
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS (#15583)
[Misc] Restrict ray version dependency and update PP feature warning in V1 (#15556)
[TPU] Avoid Triton Import (#15589)
[Misc] Consolidate LRUCache implementations (#15481)
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
[Misc] Clean up scatter_patch_features (#15559)
[Misc] Use model_redirect to redirect the model name to a local folder. (#14116)
Fix incorrect filenames in vllm_compile_cache.py (#15494)
[Doc] update --system for transformers installation in docker doc (#15616)
[Model] MiniCPM-V/O supports V1 (#15487)
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 (#15211)
solve conflicts with v0.7.3 (upstream)
Dockerfile.ubi: build vllm instead of installing it from AIPCC wheels
make pre-commit happy
Dockerfile.rocm.ubi: also include audio,video,tensorizer extras for last stage
[Doc] Link to onboarding tasks (#15629)
[Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs (#15620)
[Feature] Add middleware to log API Server responses (#15593)
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget (#15621)
Use absolute placement for Ask AI button (#15628)
[Bugfix][TPU][V1] Fix recompilation (#15553)
add libsodium arg
Correct PowerPC to modern IBM Power (#15635)
[CI] Update rules for applying tpu label. (#15634)
Dockerfile.ubi: fix chown cmd permissions
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… ([Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… vllm-project/vllm#13672)
[V1] AsyncLLM data parallel (#13923)
[TPU] Lazy Import (#15656)
[Quantization][V1] BitsAndBytes support V1 (#15611)
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. (#14948)
[Doc] Fix dead links in Job Board (#15637)
[CI][TPU] Temporarily Disable Quant Test on TPU (#15649)
Revert "Use Cache Hinting for fused_moe kernel (#15511)" (#15645)
[Misc]add coding benchmark for speculative decoding (#15303)
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 (#14578)
Refactor error handling for multiple exceptions in preprocessing (#15650)
[Bugfix] Fix mm_hashes forgetting to be passed (#15668)
[V1] Remove legacy input registry (#15673)
[TPU][CI] Fix TPUModelRunner Test (#15667)
[Refactor][Frontend] Keep all logic about reasoning into one class (#14428)
[CPU][CI] Improve CPU Dockerfile (#15690)
[Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' (#15674)
[Misc] Fix test_sleep to use query parameters (#14373)
[Bugfix][Frontend] Eliminate regex based check in reasoning full generator (#14821)
[Frontend] update priority for --api-key and VLLM_API_KEY (#15588)
[Docs] Add "Generation quality changed" section to troubleshooting (#15701)
[Model] Adding torch compile annotations to chatglm (#15624)
[Bugfix][v1] xgrammar structured output supports Enum. (#15594)
[Bugfix] embed_is_patch for Idefics3 (#15696)
[V1] Support disable_any_whtespace for guidance backend (#15584)
[doc] add missing imports (#15699)
[Bugfix] Fix regex compile display format (#15368)
Fix cpu offload testing for gptq/awq/ct (#15648)
[Minor] Remove TGI launching script (#15646)
[Misc] Remove unused utils and clean up imports (#15708)
[Misc] Remove stale func in KVTransferConfig (#14746)
[TPU] [Perf] Improve Memory Usage Estimation (#15671)
[Bugfix] [torch.compile] Add Dynamo metrics context during compilation (#15639)
[V1] TPU - Fix the chunked prompt bug (#15713)
[Misc] cli auto show default value (#15582)
implement prometheus fast-api-instrumentor for http service metrics (#15657)
[Docs][V1] Optimize diagrams in prefix caching design (#15716)
[ROCm][AMD][Build] Update AMD supported arch list (#15632)
[Model] Support Skywork-R1V (#15397)
[Docs] Document v0 engine support in reasoning outputs (#15739)
[Misc][V1] Misc code streamlining (#15723)
[Bugfix] LoRA V1: add and fix entrypoints tests (#15715)
[CI] Speed up V1 structured output tests (#15718)
Use numba 0.61 for python 3.10+ to support numpy>=2 (#15692)
[Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts.openai.api_server (#15700)
[TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K (#15714)
[Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 (#15659)
[doc] update doc (#15740)
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (#15433)
[V1] [Feature] Collective RPC (#15444)
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore (#12957)
[V1] Support interleaved modality items (#15605)
[V1][Minor] Simplify rejection sampler's parse_output (#15741)
[Bugfix] Fix Mllama interleaved images input support (#15564)
[CI] xgrammar structured output supports Enum. (#15757)
[Bugfix] Fix Mistral guided generation using xgrammar (#15704)
[doc] update conda to usage link in installation (#15761)
fix test_phi3v (#15321)
[V1] Override mm_counts for dummy data creation (#15703)
fix: lint fix a ruff checkout syntax error (#15767)
[Bugfix] Added embed_is_patch mask for fuyu model (#15731)
fix: Comments to English for better dev experience (#15768)
[V1][Scheduler] Avoid calling _try_schedule_encoder_inputs for every request (#15778)
[Misc] update the comments (#15780)
[Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup (#15748)
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm (#15050)
Recommend developing with Python 3.12 in developer guide (#15811)
fix: better install requirement for install in setup.py (#15796)
[V1] Fully Transparent Implementation of CPU Offloading (#15354)
[Model] Update support for NemotronNAS models (#15008)
[Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats (#15813)
[Bugfix] Fix missing return value in load_weights method of adapters.py (#15542)
Upgrade transformers to v4.50.3 (#13905)
[Bugfix] Check dimensions of multimodal embeddings in V1 (#15816)
[V1][Spec Decode] Remove deprecated spec decode config params (#15466)
fix: change GB to GiB in logging close #14979 (#15807)
[V1] TPU CI - Add basic perf regression test (#15414)
Fix Transformers backend compatibility check (#15290)
[V1][Core] Remove unused speculative config from scheduler (#15818)
Move dockerfiles into their own directory (#14549)
[Distributed] Add custom allreduce support for ROCM (#14125)
Rename fallback model and refactor supported models section (#15829)
[Frontend] Add Phi-4-mini function calling support (#14886)
[Bugfix][Model] fix mllama multi-image (#14883)
[Bugfix] Fix extra comma (#15851)
[Bugfix]: Fix is_embedding_layer condition in VocabParallelEmbedding (#15824)
[V1] TPU - Fix fused MOE (#15834)
[sleep mode] clear pytorch cache after sleep (#15248)
[ROCm] Use device name in the warning (#15838)
[V1] Implement sliding window attention in kv_cache_manager (#14097)
fix: can not use uv run collect_env close #13888 (#15792)
[Feature] specify model in config.yaml (#15798)
[Misc] Enable V1 LoRA by default (#15320)
[Misc] Fix speculative config repr string (#15860)
[Docs] Fix small error in link text (#15868)
[Bugfix] Fix no video/image profiling edge case for MultiModalDataParser (#15828)
[Misc] Use envs.VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE (#15831)
setup correct nvcc version with CUDA_HOME (#15725)
[Model] Support Mistral3 in the HF Transformers format (#15505)
[Misc] remove unused script (#15746)
Remove format.sh as it's been unsupported >70 days (#15884)
[New Model]: jinaai/jina-reranker-v2-base-multilingual (#15876)
[Doc] Quark quantization documentation (#15861)
Reinstate format.sh and make pre-commit installation simpler (#15890)
[Misc] Allow using OpenCV as video IO fallback (#15055)
[ROCm][Build][Bugfix] Bring the base dockerfile in sync with the ROCm fork (#15820)
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
[CI/Build] Clean up LoRA tests (#15867)
[Model] Aya Vision (#15441)
[Model] Add module name prefixes to gemma3 (#15889)
[CI] Disable flaky structure decoding test temporarily. (#15892)
[V1][Metrics] Initial speculative decoding metrics (#15151)
[V1][Spec Decode] Implement Eagle Proposer [1/N] (#15729)
[Docs] update usage stats language (#15898)
[BugFix] make sure socket close (#15875)
[Model][MiniMaxText01] Support MiniMaxText01 model inference (#13454)
[Docs] Add Ollama meetup slides (#15905)
[Docs] Add Intel as Sponsor (#15913)
[Spec Decode] Fix input triton kernel for eagle (#15909)
[V1] Fix: make sure k_index is int64 for apply_top_k_only (#15907)
[Bugfix] Fix imports for MoE on CPU (#15841)
[V1][Minor] Enhance SpecDecoding Metrics Log in V1 (#15902)
[Doc] Update rocm.inc.md (#15917)
[V1][Bugfix] Fix typo in MoE TPU checking (#15927)
[Benchmark]Fix error message (#15866)
[Misc] Replace print with logger (#15923)
[CI/Build] Further clean up LoRA tests (#15920)
[Bugfix] Fix cache block size calculation for CPU MLA (#15848)
[Build/CI] Update lm-eval to 0.4.8 (#15912)
[Kernel] Add more dtype support for GGUF dequantization (#15879)
[core] Add tags parameter to wake_up() (#15500)
[V1] Fix json_object support with xgrammar (#15488)
Add minimum version for huggingface_hub to enable Xet downloads (#15873)
[Bugfix][Benchmarks] Ensure async_request_deepspeed_mii uses the OpenAI choices key (#15926)
[CI] Remove duplicate entrypoints-test (#15940)
[Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. (#15938)
[Metrics] Hide deprecated metrics (#15458)
[Frontend] Implement Tool Calling with tool_choice='required' (#13483)
[CPU][Bugfix] Using custom allreduce for CPU backend (#15934)
[Model] use AutoWeightsLoader in model load_weights (#15770)
[Misc] V1 LoRA support CPU offload (#15843)
Restricted cmake to be less than version 4 as 4.x breaks the build of… (#15859)
[misc] instruct pytorch to use nvml-based cuda check (#15951)
[V1] Support Mistral3 in V1 (#15950)
Fix huggingface-cli[hf-xet] -> huggingface-cli[hf_xet] (#15969)
[V1][TPU] TPU-optimized top-p implementation (avoids scattering). (#15736)
[TPU] optimize the all-reduce performance (#15903)
[V1][TPU] Do not compile sampling more than needed (#15883)
[ROCM][KERNEL] Paged attention for V1 (#15720)
fix: better error message for get_config close #13889 (#15943)
[bugfix] add seed in torchrun_example.py (#15980)
[ROCM][V0] PA kennel selection when no sliding window provided (#15982)
[Benchmark] Add AIMO Dataset to Benchmark (#15955)
[misc] improve error message for "Failed to infer device type" (#15994)
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (#15367)
[doc] update contribution link (#15922)
fix: tiny fix make format.sh excutable (#16015)
[SupportsQuant] Bert, Blip, Blip2, Bloom (#15573)
[SupportsQuant] Chameleon, Chatglm, Commandr (#15952)
[Neuron][kernel] Fuse kv cache into a single tensor (#15911)
[Minor] Fused experts refactor (#15914)
[Misc][Performance] Advance tpu.txt to the most recent nightly torch … (#16024)
Re-enable the AMD Testing for the passing tests. (#15586)
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (#15732)
[TPU] Switch Test to Non-Sliding Window (#15981)
[Bugfix] Fix function names in test_block_fp8.py (#16033)
[ROCm] Tweak the benchmark script to run on ROCm (#14252)
[Misc] improve gguf check (#15974)
[TPU][V1] Remove ragged attention kernel parameter hard coding (#16041)
doc: add info for macos clang errors (#16049)
[V1][Spec Decode] Avoid logging useless nan metrics (#16023)
[Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (#15939)
[Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (#15945)
[Bugfix][kernels] Fix half2float conversion in gguf kernels (#15995)
[Benchmark][Doc] Update throughput benchmark and README (#15998)
[CPU] Change default block_size for CPU backend (#16002)
[Distributed] [ROCM] Fix custom allreduce enable checks (#16010)
[ROCm][Bugfix] Use platform specific FP8 dtype (#15717)
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only (#15413)
[Bugfix] Fix default behavior/fallback for pp in v1 (#16057)
[CI] Reorganize .buildkite directory (#16001)
docker-bake: cleanup variable definitions (#204)
[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906)
[V1] Scatter and gather placeholders in the model runner (#15712)
Revert "[V1] Scatter and gather placeholders in the model runner" (#16075)
[Kernel][Minor] Re-fuse triton moe weight application (#16071)
[Bugfix][TPU] Fix V1 TPU worker for sliding window (#16059)
[V1][Spec Decode] Update N-gram Proposer Interface (#15750)
[Model] Support Llama4 in vLLM (#16104)
Revert "[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906)"
docker-bake: bump flashinfer to 0.2.1.post2+cu124torch2.6
Dockerfile.ubi: fix requirements files path*
remove duplicate numba dependency
Dockerfile.ubi: bump vllm-tgis-adapter to 0.7.0*
add PR pipeline (#106)

Signed-off-by: Alexander Matveev <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

Signed-off-by: Matthew Vine <[email protected]>

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Chenyaaang <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: weizeng <[email protected]>

Signed-off-by: Mengqing Cao <[email protected]>

Signed-off-by: Cody Yu <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

add retries and get rid of progress meter

Signed-off-by: Gregory Shtrasberg <[email protected]>

…QS (vllm-project#15583) Signed-off-by: Chengji Yao <[email protected]>

…in V1 (vllm-project#15556)

Signed-off-by: [email protected] <[email protected]>

Signed-off-by: Bella kira <[email protected]>

…oject#15587) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: ElizaWszola <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

…r. (vllm-project#14116)

Signed-off-by: <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

…lm-project#15616) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

… vllm/v1 (vllm-project#15211) Signed-off-by: h-sugi <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

https://github.com/neuralmagic/nm-vllm-ent/releases/tag/v0.7.2.1_downstream

…6071) Signed-off-by: Bill Nell <[email protected]>

Signed-off-by: Michael Goin <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]>

…put queue (vllm-project#15906)" This reverts commit 651cf0f.

v0.8.3.0-rc

ckhordiasma · 2025-04-17T14:44:27Z

/build-from-odh

ckhordiasma · 2025-04-17T18:21:19Z

/build-from-odh

* add PR pipeline * add correct default value for additional build secret * update pull request pipeline to use remote pipeline ref * add 4h timeout * call out the remote build platform * rename pipeline, disable making an image index for PR pipeline

ckhordiasma · 2025-04-17T20:14:05Z

/build-from-odh

ckhordiasma · 2025-04-17T20:20:08Z

/build-from-odh

ckhordiasma · 2025-04-17T21:08:40Z

/build-from-odh

ckhordiasma · 2025-04-17T21:34:02Z

/build-from-odh

ckhordiasma · 2025-04-17T23:44:03Z

/build-from-odh

ckhordiasma · 2025-04-18T02:43:44Z

/build-from-odh

alexm-redhat and others added 30 commits March 26, 2025 21:35

[V1] TPU - Revert to exponential padding by default (vllm-project#15565)

b2e85e2

Signed-off-by: Alexander Matveev <[email protected]>

[V1] TPU CI - Fix test_compilation.py (vllm-project#15570)

9d119a8

Signed-off-by: Alexander Matveev <[email protected]>

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

7a88827

[TPU] support disabling xla compilation cache (vllm-project#15567)

e74ff40

Signed-off-by: Chengji Yao <[email protected]>

Support FIPS enabled machines with MD5 hashing (vllm-project#15299)

7a6d45b

Signed-off-by: Matthew Vine <[email protected]>

[Kernel] CUTLASS grouped gemm fp8 MoE kernel (vllm-project#13972)

9239bf7

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Add automatic tpu label to mergify.yml (vllm-project#15560)

ce78f9a

add platform check back (vllm-project#15578)

69db16a

Signed-off-by: Chenyaaang <[email protected]>

[misc] LoRA: Remove unused long context test data (vllm-project#15558)

8095341

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[Doc] Update V1 user guide for fp8 kv cache support (vllm-project#15585)

7f301dd

Signed-off-by: weizeng <[email protected]>

[moe][quant] add weight name case for offset (vllm-project#15515)

fb22be5

Signed-off-by: Mengqing Cao <[email protected]>

[V1] Refactor num_computed_tokens logic (vllm-project#15307)

54aa619

Signed-off-by: Cody Yu <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

add retries and get rid of progress meter

74d70b0

Merge branch 'rhoai-2.19' into moulalis-patch-3

6b55376

Merge pull request #93 from red-hat-data-services/moulalis-patch-3

a1d23a1

add retries and get rid of progress meter

Allow torchao quantization in SiglipMLP (vllm-project#15575)

dcf2a59

[ROCm] Env variable to trigger custom PA (vllm-project#15557)

ecff830

Signed-off-by: Gregory Shtrasberg <[email protected]>

[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SE…

619d3de

…QS (vllm-project#15583) Signed-off-by: Chengji Yao <[email protected]>

[Misc] Restrict ray version dependency and update PP feature warning …

df8d3d1

…in V1 (vllm-project#15556)

[TPU] Avoid Triton Import (vllm-project#15589)

e1e0fd7

Signed-off-by: [email protected] <[email protected]>

[Misc] Consolidate LRUCache implementations (vllm-project#15481)

f4c98b4

Signed-off-by: Bella kira <[email protected]>

[Misc] Clean up scatter_patch_features (vllm-project#15559)

e6c9053

Signed-off-by: DarkLight1337 <[email protected]>

[Misc] Use model_redirect to redirect the model name to a local folde…

3f532cb

…r. (vllm-project#14116)

Fix incorrect filenames in vllm_compile_cache.py (vllm-project#15494)

6278bc8

Signed-off-by: <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

[Doc] update --system for transformers installation in docker doc (vl…

8063dfc

…lm-project#15616) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

[Model] MiniCPM-V/O supports V1 (vllm-project#15487)

ac5bc61

Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix use_cascade_attention handling for Alibi-based models on…

8958217

… vllm/v1 (vllm-project#15211) Signed-off-by: h-sugi <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

Merge with v0.7.2.1_downstream

eb3737d

https://github.com/neuralmagic/nm-vllm-ent/releases/tag/v0.7.2.1_downstream

solve conflicts with v0.7.3 (upstream)

9bffc0f

bnellnm and others added 12 commits April 4, 2025 23:27

[Kernel][Minor] Re-fuse triton moe weight application (vllm-project#1…

d6fc629

…6071) Signed-off-by: Bill Nell <[email protected]>

[Bugfix][TPU] Fix V1 TPU worker for sliding window (vllm-project#16059)

70ad3f9

Signed-off-by: Michael Goin <[email protected]>

[V1][Spec Decode] Update N-gram Proposer Interface (vllm-project#15750)

63375f0

Signed-off-by: Woosuk Kwon <[email protected]>

[Model] Support Llama4 in vLLM (vllm-project#16104)

c575232

Revert "[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for in…

296c657

…put queue (vllm-project#15906)" This reverts commit 651cf0f.

docker-bake: bump flashinfer to 0.2.1.post2+cu124torch2.6

fec228f

Sync with upsteam @ v0.8.3 (296c657)

7fed335

Sync with neuralmagic/nm-vllm-ent @ v0.8.3.0-rc (f08840f14)

6bd4d08

v0.8.3.0-rc

Dockerfile*.ubi: fix requirements files path

ec82cdb

remove duplicate numba dependency

a06f8ac

Dockerfile*.ubi: bump vllm-tgis-adapter to 0.7.0

a133489

Merge branch 'rhoai-2.20' into rhoai-2.19-sync-with-midstream-0.8.3.0

09cbae3

ckhordiasma mentioned this pull request Apr 17, 2025

sync with nm-vllm-ent @ v0.8.3.0-rc0 #103

Merged

ckhordiasma and others added 4 commits April 17, 2025 15:12

add PR pipeline (#106)

d8333c4

* add PR pipeline * add correct default value for additional build secret * update pull request pipeline to use remote pipeline ref * add 4h timeout * call out the remote build platform * rename pipeline, disable making an image index for PR pipeline

modify max_jobs and nvcc_threads

825b3d3

update arguments for build

98ef274

further lower max_jobs to 6

0fed6b0

ckhordiasma force-pushed the 2.20-pr-test branch from c23a05b to 0fed6b0 Compare April 17, 2025 20:12

Merge branch 'rhoai-2.20' into 2.20-pr-test

6e68f5b

extend timeout

22ea094

ckhordiasma closed this Apr 23, 2025

ckhordiasma deleted the 2.20-pr-test branch May 15, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not merge] pr test for nm changes into 2.20 #107

[do not merge] pr test for nm changes into 2.20 #107

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

101 participants

[do not merge] pr test for nm changes into 2.20 #107

[do not merge] pr test for nm changes into 2.20 #107

Uh oh!

Conversation

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 17, 2025

Uh oh!

ckhordiasma commented Apr 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

101 participants