Releases: vllm-project/vllm
v0.6.6.post1
This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .
What's Changed
- [Docs] Document Deepseek V3 support by @simon-mo in #11535
- Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
- [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
- [V1] Fix yapf by @WoosukKwon in #11538
- [CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
- [misc] fix typing by @youkaichao in #11540
- [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
- [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547
Full Changelog: v0.6.6...v0.6.6.post1
v0.6.6
Highlights
-
Support Deepseek V3 (#11523, #11502) model.
- On 8xH200s or MI300x:
vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192
. The context lenght can be increased to about 32K beyond running into memory issue. - For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
- We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
- On 8xH200s or MI300x:
-
Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)
-
Breaking change:
X-Request-ID
echoing is now opt-in instead of on by default for performance reason. Set--enable-request-id-headers
to enable it.
Model Support
- IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
- Add
QVQ
andQwQ
to the list of supported models (#11509)
Performance
- Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Production Engine
- Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
- Online Pooling API (#11457)
- Load video from base64 (#11492)
Others
- Add pypi index for every commit and nightly build (#11404)
What's Changed
- [Bugfix] Set temperature=0.7 in test_guided_choice_chat by @mgoin in #11264
- [V1] Prefix caching for vision language models by @comaniac in #11187
- [Bugfix] Restore support for larger block sizes by @kzawora-intel in #11259
- [Bugfix] Fix guided decoding with tokenizer mode mistral by @wallashss in #11046
- [MISC][XPU]update ipex link for CI fix by @yma11 in #11278
- [Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support by @dsikka in #10995
- [Bugfix] Fix broken phi3-v mm_processor_kwargs tests by @Isotr0py in #11263
- [CI][Misc] Remove Github Action Release Workflow by @simon-mo in #11274
- [FIX] update openai version by @jikunshang in #11287
- [Bugfix] fix minicpmv test by @joerunde in #11304
- [V1] VLM - enable processor cache by default by @alexm-neuralmagic in #11305
- [Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) by @tlrmchlsmth in https://github.com//pull/11311
- [Model] IBM Granite 3.1 by @tjohnson31415 in #11307
- [CI] Expand test_guided_generate to test all backends by @mgoin in #11313
- [V1] Simplify prefix caching logic by removing
num_evictable_computed_blocks
by @heheda12345 in #11310 - [VLM] Merged multimodal processor for Qwen2-Audio by @DarkLight1337 in #11303
- [Kernel] Refactor Cutlass c3x by @varun-sundar-rabindranath in #10049
- [Misc] Optimize ray worker initialization time by @ruisearch42 in #11275
- [misc] benchmark_throughput : Add LoRA by @varun-sundar-rabindranath in #11267
- [Feature] Add load generation config from model by @liuyanyi in #11164
- [Bugfix] Cleanup Pixtral HF code by @DarkLight1337 in #11333
- [Model] Add JambaForSequenceClassification model by @yecohn in #10860
- [V1] Fix multimodal profiling for
Molmo
by @ywang96 in #11325 - [Model] Refactor Qwen2-VL to use merged multimodal processor by @Isotr0py in #11258
- [Misc] Clean up and consolidate LRUCache by @DarkLight1337 in #11339
- [Bugfix] Fix broken CPU compressed-tensors test by @Isotr0py in #11338
- [Misc] Remove unused vllm/block.py by @Ghjk94522 in #11336
- [CI] Adding CPU docker pipeline by @zhouyuan in #11261
- [Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 by @Akashcodes732 in #11331
- [ci][gh200] dockerfile clean up by @youkaichao in #11351
- [Misc] Add tqdm progress bar during graph capture by @mgoin in #11349
- [Bugfix] Fix spec decoding when seed is none in a batch by @wallashss in #10863
- [misc] add early error message for custom ops by @youkaichao in #11355
- [doc] backward compatibility for 0.6.4 by @youkaichao in #11359
- [V1] Fix profiling for models with merged input processor by @ywang96 in #11370
- [CI/Build] fix pre-compiled wheel install for exact tag by @dtrifiro in #11373
- [Core] Loading model from S3 using RunAI Model Streamer as optional loader by @omer-dayan in #10192
- [Bugfix] Don't log OpenAI field aliases as ignored by @mgoin in #11378
- [doc] explain nccl requirements for rlhf by @youkaichao in #11381
- Add ray[default] to wget to run distributed inference out of box by @Jeffwan in #11265
- [V1][Bugfix] Skip hashing empty or None mm_data by @WoosukKwon in #11386
- [Bugfix] update should_ignore_layer by @horheynm in #11354
- [V1] Make AsyncLLMEngine v1-v0 opaque by @rickyyx in #11383
- [Bugfix] Fix issues for
Pixtral-Large-Instruct-2411
by @ywang96 in #11393 - [CI] Fix flaky entrypoint tests by @ywang96 in #11403
- [cd][release] add pypi index for every commit and nightly build by @youkaichao in #11404
- [cd][release] fix race conditions by @youkaichao in #11407
- [Bugfix] Fix fully sharded LoRAs with Mixtral by @n1hility in #11390
- [CI] Unboock H100 Benchmark by @simon-mo in #11419
- [misc][perf] remove old code by @youkaichao in #11425
- mypy type checking for vllm/worker by @lucas-tucker in #11418
- [Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF by @mgoin in #11389
- [Bugfix] torch nightly version in ROCm installation guide by @terrytangyuan in #11423
- [Misc] Add assertion and helpful message for marlin24 compressed models by @dsikka in #11388
- [Misc] add w8a8 asym models by @dsikka in #11075
- [CI] Expand OpenAI test_chat.py guided decoding tests by @mgoin in #11048
- [Bugfix] Add kv cache scales to gemma2.py by @mgoin in #11269
- [Doc] Fix typo in the help message of '--guided-decoding-backend' by @yansh97 in #11440
- [Docs] Convert rST to MyST (Markdown) by @rafvasq in #11145
- [V1] TP Ray executor by @ruisearch42 in #11107
- [Misc]Suppress irrelevant exception stack trace information when CUDA… by @shiquan1988 in #11438
- [Frontend] Online Pooling API by @DarkLight1337 in #11457
- [Bugfix] Fix Qwen2-VL LoRA weight loading by @jeejeelee in #11430
- [Bugfix][Hardware][CPU] Fix CPU
input_positions
creation for text-only inputs with mrope by @Isotr0py in #11434 - [OpenVINO] Fixed installation conflicts by @ilya-lavrenov in #11458
- [attn][tiny fix] fix attn backend in MultiHeadAttention by @MengqingCao in #11463
- [Misc] Move weights mapper by @jeejeelee in #11443
- [Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #11435
- [Model] Automatic conversion of classification and reward models by @DarkLight1337 in #11469
- [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor by @ruisearch42 in #11472
- [Misc] Update disaggregation benchmark scripts and test logs by @Jeffwan in #11456
- [Frontend] Enable decord to load video from base64 by @DarkLight1337 in #11492
- [Doc] Improve GitHub links by @DarkLight1337 in #11491
- [Misc] Move ...
v0.6.5
Highlights
- Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #10699, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242).
- Major improvements in
torch.compile
integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance optimizations (#10558, #10613, #10121, #10383, #10399, #10406, #10437, #10460, #10552, #10622, #10722, #10620, #10906, #11108, #11059, #11005, #10838, #11081, #11110). - Expanded model support, including Aria, Cross Encoders, GLM-4, OLMo November 2024, Telechat2, LoRA improvements and multimodal Granite models (#10514, #10400, #10561, #10503, #10311, #10291, #9057, #10418, #5064).
- Use xgrammar as the default guided decoding backend (#10785)
- Improved hardware enablement for AMD ROCm, ARM AARCH64, TPU prefix caching, XPU AWQ/GPTQ, and various CPU/Gaudi/HPU/NVIDIA enhancements (#10254, #9228, #10307, #10107, #10667, #10565, #10239, #11016, #9735, #10355, #10700).
- Note: Changed default temperature for ChatCompletionRequest from 0.7 to 1.0 to align with OpenAI (#11219)
Model Support
- Added Aria (#10514), Cross Encoder (#10400), GLM-4 (#10561), OLMo (#10503), Telechat2 (#10311), Cohere R7B (#11203), GritLM embeddings (#10816)
- LoRA support for Internlm2, glm-4v, Pixtral-HF (#5064, #10418, #10795).
- Improved quantization (BNB, bitsandbytes) for multiple models (#10795, #10842, #10682, #10549)
- Expanded multimodal support (#10291, #11142).
Hardware Support
- AMD ROCm GGUF quantization (#10254), ARM AARCH64 enablement (#9228), TPU prefix caching (#10307), XPU AWQ/GPTQ (#10107), CPU/Gaudi/HPU enhancements (#10355, #10667, #10565, #10239, #11016, #9735, #10541, #10394, #10700).
Performance & Scheduling
- Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).
Benchmark & Frontend
- Benchmark structured outputs and vision datasets (#10804, #10557, #10880, #10547).
- Frontend: Automatic chat format detection (#9919), input_audio support (#11027), CLI --version (#10369), extra fields in requests (#10463).
Documentation & Plugins
- Architecture overview (#10368), Helm chart (#9199), KubeAI integration (#10837), plugin system docs (#10372), disaggregated prefilling (#11197), structured outputs (#9943), usage section (#10827).
Bugfixes & Misc
What's Changed
- Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
- [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
- [Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
- [Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
- [Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
- [core][misc] keep compatibility for old-style classes by @youkaichao in #10356
- [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
- [Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
- [Misc] bump mistral common version by @simon-mo in #10367
- [Docs] Add Nebius as sponsors by @simon-mo in #10371
- [Frontend] Add --version flag to CLI by @russellb in #10369
- [Doc] Move PR template content to docs by @russellb in #10159
- [Docs] Misc updates to TPU installation instructions by @mikegre-google in #10165
- [Frontend] Automatic detection of chat content format from AST by @DarkLight1337 in #9919
- [doc] add doc for the plugin system by @youkaichao in #10372
- [misc][plugin] improve log messages by @youkaichao in #10386
- [BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel by @rasmith in #10385
- [Misc] Update benchmark to support image_url file or http by @kakao-steve-ai in #10287
- [Misc] Medusa supports custom bias by @skylee-01 in #10361
- [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled by @imkero in #10388
- [V1] Add code owners for V1 by @WoosukKwon in #10397
- [2/N][torch.compile] make compilation cfg part of vllm cfg by @youkaichao in #10383
- [V1] Refactor model executable interface for all text-only language models by @ywang96 in #10374
- [CI/Build] Fix IDC hpu [Device not found] issue by @xuechendi in #10384
- [Bugfix][Hardware][CPU] Fix CPU embedding runner with tensor parallel by @Isotr0py in #10394
- [platforms] refactor cpu code by @youkaichao in #10402
- [Hardware] [HPU]add
mark_step
for hpu by @jikunshang in #10239 - [Bugfix] Fix mrope_position_delta in non-last prefill chunk by @imkero in #10403
- [Misc] Enhance offline_inference to support user-configurable paramet… by @wchen61 in #10392
- [Misc] Add uninitialized params tracking for
AutoWeightsLoader
by @Isotr0py in #10327 - [Bugfix] Ignore ray reinit error when current platform is ROCm or XPU by @HollowMan6 in #10375
- [4/N][torch.compile] clean up set_torch_compile_backend by @youkaichao in #10401
- [VLM] Report multi_modal_placeholders in output by @lk-chen in #10407
- [Model] Remove redundant softmax when using PoolingType.STEP by @Maybewuss in #10415
- [Model][LoRA]LoRA support added for glm-4v by @B-201 in #10418
- [Model] Remove transformers attention porting in VITs by @Isotr0py in #10414
- [Doc] Update doc for LoRA support in GLM-4V by @B-201 in #10425
- [5/N][torch.compile] torch.jit.script --> torch.compile by @youkaichao in #10406
- [Doc] Add documentation for Structured Outputs by @ismael-dm in #9943
- Fix open_collective value in FUNDING.yml by @andrew in #10426
- [Model][Bugfix] Support TP for PixtralHF ViT by @mgoin in #10405
- [Hardware][XPU] AWQ/GPTQ support for xpu backend by @yma11 in #10107
- [Kernel] Explicitly specify other value in tl.load calls by @angusYuhao in #9014
- [Kernel] Initial Machete W4A8 support + Refactors by @LucasWilkinson in #9855
- [3/N][torch.compile] consolidate custom op logging by @youkaichao in #10399
- [ci][bugfix] fix kernel tests by @youkaichao in #10431
- [misc] Allow partial prefix benchmarking & random input generation for prefix benchmarking by @rickyyx in #9929
- [ci/build] Have dependabot ignore all patch update by @khluu in #10436
- [Bugfix]Fix Phi-3 BNB online quantization by @jeejeelee in #10417
- [Platform][Refactor] Extract func
get_default_attn_backend
toPlatform
by @MengqingCao in #10358 - Add openai.beta.chat.completions.parse example to structured_outputs.rst by @mgoin in #10433
- [Bugfix] Guard for negative counter metrics to prevent crash by @tjohnson31415 in #10430
- [Misc] Avoid misleading warning messages by @jeejeelee in #10438
- [Doc] Add the start of an arch overview page by @russellb in #10368
- [misc][plugin] improve plugin loading by @youkaichao in #10443
- [CI][CPU] adding numa node number as container name suffix by @zhouyuan in #10441
- [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) by @xiyuan-lee in #10398
- [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter by @patrickvonplaten in #10449
- Fix: Build error seen on Power Architecture by @mikejuliet13 in #10421
- [Doc] fix link for page that was renamed by @russellb in #10455
- [6/N] to...
v0.6.4.post1
This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for vLLMConfig
usage in out of tree models (#10356)
What's Changed
- Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
- [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
- [Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
- [Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
- [Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
- [core][misc] keep compatibility for old-style classes by @youkaichao in #10356
- [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
- [Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
- [Misc] bump mistral common version by @simon-mo in #10367
New Contributors
Full Changelog: v0.6.4...v0.6.4.post1
v0.6.4
Highlights
- Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent meetup slides
- Signficant progress in
torch.compile
support. Many models now support torch compile with TorchInductor. You can checkout our meetup slides for more details. (#9775, #9614, #9639, #9641, #9876, #9946, #9589, #9896, #9637, #9300, #9947, #9138, #9715, #9866, #9632, #9858, #9889)
Model Support
- New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
- New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
- Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
- Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
- LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
- BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
- Unified multi-modal processor for VLM (#10040, #10044)
- Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)
Hardware Support
- Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
- CPU: Add embedding models support for CPU backend (#10193)
- TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
- Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)
Performance
Engine Core
- Override HF
config.json
via CLI (#5836) - Add goodput metric support (#9338)
- Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
- Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)
Others
- Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
- Dropped support for Python 3.8 (#10038, #8464)
- Basic Integration Test For TPU (#9968)
- Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
- Benchmark throughput now supports image input (#9851)
What's Changed
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
- [Frontend] merge beam search implementations by @LunrEclipse in #9296
- [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
- [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
- [Frontend] Clarify model_type error messages by @stevegrubb in #9345
- [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
- [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
- [BugFix] Fix chat API continuous usage stats by @njhill in #9357
- pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
- [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
- [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
- [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
- [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
- [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
- [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
- [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
- [Core] Rename input data types by @DarkLight1337 in #8688
- [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
- [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
- Support mistral interleaved attn by @patrickvonplaten in #9414
- [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
- [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
- [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
- [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
- [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
- Add notes on the use of Slack by @terrytangyuan in #9442
- [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
- [Misc] Print stack trace using
logger.exception
by @DarkLight1337 in #9461 - [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
- [Bugfix] Allow prefill of assistant response when using
mistral_common
by @sasha0552 in #9446 - [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
- [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
- [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
- [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
- [Misc] Remove commit id file by @DarkLight1337 in #9470
- [torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
- [Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
- [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
- [Bugfix] Print warnings related to
mistral_common
tokenizer only once by @sasha0552 in #9468 - [Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
- Support
BERTModel
(firstencoder-only
embedding model) by @robertgshaw2-neuralmagic in #9056 - [BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
- [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
- [Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
- [CI/Build] Use commit hash references for github actions by @russellb in #9430
- [BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
- [Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
- [BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
- [CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
- [Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
- [Misc] benchmark: Add option to set max concurrency by @russellb in #9390
- [Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
- [CI/Build] Add error matching config for mypy by @russellb in #9512
- [Model] Support Pixtral models ...
v0.6.3.post1
Highlights
New Models
- Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
- Support multiple and interleaved images for Llama3.2 (#9095)
- Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)
Important bug fix
- Fix chat API continuous usage stats (#9357)
- Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
- Fix Molmo text-only input bug (#9397)
- Fix CUDA 11.8 Build (#9386)
- Fix
_version.py
not found issue (#9375)
Other Enhancements
- Remove block manager v1 and make block manager v2 default (#8704)
- Spec Decode Optimize ngram lookup performance (#9333)
What's Changed
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
- [Frontend] merge beam search implementations by @LunrEclipse in #9296
- [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
- [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
- [Frontend] Clarify model_type error messages by @stevegrubb in #9345
- [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
- [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
- [BugFix] Fix chat API continuous usage stats by @njhill in #9357
- pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
- [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
- [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
- [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
- [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
- [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
- [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
- [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
- [Core] Rename input data types by @DarkLight1337 in #8688
- [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
- [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
- Support mistral interleaved attn by @patrickvonplaten in #9414
- [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
- [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
- [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
- [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
- [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
- Add notes on the use of Slack by @terrytangyuan in #9442
- [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
- [Misc] Print stack trace using
logger.exception
by @DarkLight1337 in #9461 - [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
- [Bugfix] Allow prefill of assistant response when using
mistral_common
by @sasha0552 in #9446 - [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
- [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
- [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
- [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
New Contributors
- @gracehonv made their first contribution in #9349
- @streaver91 made their first contribution in #9396
Full Changelog: v0.6.3...v0.6.3.post1
v0.6.3
Highlights
Model Support
- New Models:
- Expansion in functionality:
- Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)
Documentation
- New compatibility matrix for mutual exclusive features (#8512)
- Reorganized installation doc, note that we publish a per-commit docker image (#8931)
Hardware Support:
- Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
- Support AWQ for CPU backend (#7515)
- Add async output processor for xpu (#8897)
- Add on-device sampling support for Neuron (#8746)
Architectural Enhancements
- Progress in vLLM's refactoring to a core core:
- Spec decode removing batch expansion (#8839, #9298).
- We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
- Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
- Move guided decoding params into sampling params (#8252)
- Torch Compile:
- You can now set an env var
VLLM_TORCH_COMPILE_LEVEL
to controltorch.compile
various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), usingVLLM_TORCH_COMPILE_LEVEL=3
can turn on Inductor's full graph compilation without vLLM's custom ops.
- You can now set an env var
Others
- Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
- Enhancements towards priority scheduling (#8965, #8956, #8850)
What's Changed
- [Misc] Update config loading for Qwen2-VL and remove Granite by @ywang96 in #8837
- [Build/CI] Upgrade to gcc 10 in the base build Docker image by @tlrmchlsmth in #8814
- [Docs] Add README to the build docker image by @mgoin in #8825
- [CI/Build] Fix missing ci dependencies by @fyuan1316 in #8834
- [misc][installation] build from source without compilation by @youkaichao in #8818
- [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by @khluu in #8872
- [Bugfix] Include encoder prompts len to non-stream api usage response by @Pernekhan in #8861
- [Misc] Change dummy profiling and BOS fallback warns to log once by @mgoin in #8820
- [Bugfix] Fix print_warning_once's line info by @tlrmchlsmth in #8867
- fix validation: Only set tool_choice
auto
if at least one tool is provided by @chiragjn in #8568 - [Bugfix] Fixup advance_step.cu warning by @tlrmchlsmth in #8815
- [BugFix] Fix test breakages from transformers 4.45 upgrade by @njhill in #8829
- [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by @DarkLight1337 in #8764
- [Feature] Add support for Llama 3.1 and 3.2 tool use by @maxdebayser in #8343
- [Core] Rename
PromptInputs
andinputs
with backward compatibility by @DarkLight1337 in #8876 - [misc] fix collect env by @youkaichao in #8894
- [MISC] Fix invalid escape sequence '' by @panpan0000 in #8830
- [Bugfix][VLM] Fix Fuyu batching inference with
max_num_seqs>1
by @Isotr0py in #8892 - [TPU] Update pallas.py to support trillium by @bvrockwell in #8871
- [torch.compile] use empty tensor instead of None for profiling by @youkaichao in #8875
- [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by @ProExpertProg in #7271
- [Bugfix] fix for deepseek w4a16 by @LucasWilkinson in #8906
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by @varun-sundar-rabindranath in #8378
- [misc][distributed] add VLLM_SKIP_P2P_CHECK flag by @youkaichao in #8911
- [Core] Priority-based scheduling in async engine by @schoennenbeck in #8850
- [misc] fix wheel name by @youkaichao in #8919
- [Bugfix][Intel] Fix XPU Dockerfile Build by @tylertitsworth in #7824
- [Misc] Remove vLLM patch of
BaichuanTokenizer
by @DarkLight1337 in #8921 - [Bugfix] Fix code for downloading models from modelscope by @tastelikefeet in #8443
- [Bugfix] Fix PP for Multi-Step by @varun-sundar-rabindranath in #8887
- [CI/Build] Update models tests & examples by @DarkLight1337 in #8874
- [Frontend] Make beam search emulator temperature modifiable by @nFunctor in #8928
- [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by @heheda12345 in #8891
- [doc] organize installation doc and expose per-commit docker by @youkaichao in #8931
- [Core] Improve choice of Python multiprocessing method by @russellb in #8823
- [Bugfix] Block manager v2 with preemption and lookahead slots by @sroy745 in #8824
- [Bugfix] Fix Marlin MoE act order when is_k_full == False by @ElizaWszola in #8741
- [CI/Build] Add test decorator for minimum GPU memory by @DarkLight1337 in #8925
- [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by @tlrmchlsmth in #8930
- [Model] Support Qwen2.5-Math-RM-72B by @zhuzilin in #8896
- [Model][LoRA]LoRA support added for MiniCPMV2.5 by @jeejeelee in #7199
- [BugFix] Fix seeded random sampling with encoder-decoder models by @njhill in #8870
- [Misc] Fix typo in BlockSpaceManagerV1 by @juncheoll in #8944
- [Frontend] Added support for HF's new
continue_final_message
parameter by @danieljannai21 in #8942 - [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by @mzusman in #8533
- [Model] support input embeddings for qwen2vl by @whyiug in #8856
- [Misc][CI/Build] Include
cv2
viamistral_common[opencv]
by @ywang96 in #8951 - [Model][LoRA]LoRA support added for MiniCPMV2.6 by @jeejeelee in #8943
- [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by @Isotr0py in #8946
- [Core] Make scheduling policy settable via EngineArgs by @schoennenbeck in #8956
- [Misc] Adjust max_position_embeddings for LoRA compatibility by @jeejeelee in #8957
- [ci] Add CODEOWNERS for test directories by @khluu in #8795
- [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by @LiuXiaoxuanPKU in #8975
- [Frontend][Core] Move guided decoding params into sampling params by @joerunde in #8252
- [CI/Build] Fix machete generated kernel files ordering by @khluu in #8976
- [torch.compile] fix tensor alias by @youkaichao in #8982
- [Misc] add process_weights_after_loading for DummyLoader by @divakar-amd in #8969
- [Bugfix] Fix Fuyu tensor parallel inference by @Isotr0py in #8986
- [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by @alex-jw-brooks in #8991
- [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by @schoennenbeck in #8965
- [Doc] Update list of supported models by @DarkLight1337 in #8987
- Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by @vlsav in https://github.com...
v0.6.2
Highlights
Model Support
-
Support Llama 3.2 models (#8811, #8822)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
-
Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)
-
⚠️ You will see the following error now, this is breaking change!Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the
vllm.LLM.use_beam_search
method for dedicated beam search instead, or set the environment variableVLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1
to suppress this error. For more details, see #8306
-
-
Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)
-
Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)
Hardware Support
- TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
- CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
- AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)
Production Engine
- Initial support for priority sheduling (#5958)
- Support Lora lineage and base model metadata management (#6315)
- Batch inference for llm.chat() API (#8648)
Performance
- Introduce
MQLLMEngine
for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584) - Multi-step scheduling enhancements
- Add cuda graph support during decoding for encoder-decoder models (#7631)
Others
- Support sample from HF datasets and image input for benchmark_serving (#8495)
- Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
- [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
- [doc] recommend pip instead of conda by @youkaichao in #8446
- [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
- [misc][ci] fix quant test by @youkaichao in #8449
- [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
- [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
- [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
- [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
- [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
- bump version to v0.6.1.post2 by @simon-mo in #8473
- [Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in #8365
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in #8310
- [Model] support minicpm3 by @SUDA-HLT-ywfang in #8297
- [torch.compile] fix functionalization by @youkaichao in #8480
- [torch.compile] add a flag to disable custom op by @youkaichao in #8488
- [TPU] Implement multi-step scheduling by @WoosukKwon in #8489
- [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in #8490
- [Bugfix][Kernel] Add
IQ1_M
quantization implementation to GGUF kernel by @Isotr0py in #8357 - [Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in #8032
- [Frontend] Expose revision arg in OpenAI server by @lewtun in #8501
- [BugFix] Fix clean shutdown issues by @njhill in #8492
- [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in #8506
- [Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in #7270
- [doc] update doc on testing and debugging by @youkaichao in #8514
- [Bugfix] Bind api server port before starting engine by @kevin314 in #8491
- [perf bench] set timeout to debug hanging by @simon-mo in #8516
- [misc] small qol fixes for release process by @simon-mo in #8517
- [Bugfix] Fix 3.12 builds on main by @joerunde in #8510
- [refactor] remove triton based sampler by @simon-mo in #8524
- [Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in #8525
- [Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in #8521
- [torch.compile] register allreduce operations as custom ops by @youkaichao in #8526
- [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in #8509
- [Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in #8495
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in #7631
- [Feature][kernel] tensor parallelism with bitsandbytes quantizati...
v0.6.1.post2
Highlights
- This release contains an important bugfix related to token streaming combined with stop string (#8468)
What's Changed
- [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
- [doc] recommend pip instead of conda by @youkaichao in #8446
- [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
- [misc][ci] fix quant test by @youkaichao in #8449
- [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
- [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
- [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
- [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
- [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
- bump version to v0.6.1.post2 by @simon-mo in #8473
Full Changelog: v0.6.1.post1...v0.6.1.post2
v0.6.1.post1
Highlights
This release features important bug fixes and enhancements for
- Pixtral models. (#8415, #8425, #8399, #8431)
- Chunked scheduling has been turned off for vision models. Please replace
--max_num_batched_tokens 16384
with--max-model-len 16384
- Chunked scheduling has been turned off for vision models. Please replace
- Multistep scheduling. (#8417, #7928, #8427)
- Tool use. (#8423, #8366)
Also
- support multiple images for qwen-vl (#8247)
- removes
engine_use_ray
(#8126) - add engine option to return only deltas or final output (#7381)
- add bitsandbytes support for Gemma2 (#8338)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
New Contributors
- @blueyo0 made their first contribution in #8338
- @lnykww made their first contribution in #8403
- @vegaluisjose made their first contribution in #8423
Full Changelog: v0.6.1...v0.6.1.post1