Skip to content

Releases: vllm-project/vllm

v0.6.6.post1

27 Dec 06:24
2339d59
Compare
Choose a tag to compare

This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .

What's Changed

Full Changelog: v0.6.6...v0.6.6.post1

v0.6.6

27 Dec 00:12
f49777b
Compare
Choose a tag to compare

Highlights

  • Support Deepseek V3 (#11523, #11502) model.

    • On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context lenght can be increased to about 32K beyond running into memory issue.
    • For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
    • We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
  • Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)

  • Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

  • IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
  • Add QVQ and QwQ to the list of supported models (#11509)

Performance

  • Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)

Production Engine

  • Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
  • Online Pooling API (#11457)
  • Load video from base64 (#11492)

Others

  • Add pypi index for every commit and nightly build (#11404)

What's Changed

Read more

v0.6.5

17 Dec 23:10
2d1b9ba
Compare
Choose a tag to compare

Highlights

Model Support

Hardware Support

Performance & Scheduling

  • Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Documentation & Plugins

Bugfixes & Misc

What's Changed

Read more

v0.6.4.post1

15 Nov 17:50
a6221a1
Compare
Choose a tag to compare

This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for vLLMConfig usage in out of tree models (#10356)

What's Changed

New Contributors

Full Changelog: v0.6.4...v0.6.4.post1

v0.6.4

15 Nov 07:32
02dbf30
Compare
Choose a tag to compare

Highlights

Model Support

  • New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
  • New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
  • Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
    • Add user-configurable --task parameter for models that support both generation and embedding (#9424)
    • Chat-based Embeddings API (#9759)
  • Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
  • LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
  • BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
  • Unified multi-modal processor for VLM (#10040, #10044)
  • Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

  • Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
  • CPU: Add embedding models support for CPU backend (#10193)
  • TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
  • Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

  • Combine chunked prefill with speculative decoding (#9291)
  • fused_moe Performance Improvement (#9384)

Engine Core

  • Override HF config.json via CLI (#5836)
  • Add goodput metric support (#9338)
  • Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
  • Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

  • Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
  • Dropped support for Python 3.8 (#10038, #8464)
  • Basic Integration Test For TPU (#9968)
  • Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
  • Benchmark throughput now supports image input (#9851)

What's Changed

Read more

v0.6.3.post1

17 Oct 17:26
a2c71c5
Compare
Choose a tag to compare

Highlights

New Models

  • Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
  • Support multiple and interleaved images for Llama3.2 (#9095)
  • Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

  • Fix chat API continuous usage stats (#9357)
  • Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
  • Fix Molmo text-only input bug (#9397)
  • Fix CUDA 11.8 Build (#9386)
  • Fix _version.py not found issue (#9375)

Other Enhancements

  • Remove block manager v1 and make block manager v2 default (#8704)
  • Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

New Contributors

Full Changelog: v0.6.3...v0.6.3.post1

v0.6.3

14 Oct 20:20
fd47e57
Compare
Choose a tag to compare

Highlights

Model Support

  • New Models:
  • Expansion in functionality:
    • Add Gemma2 embedding model (#9004)
    • Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
    • LoRA:
      • LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
      • Expand lora modules for mixtral (#9008)
    • Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
    • Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
    • Tool use:
      • Add support for Llama 3.1 and 3.2 tool use (#8343)
      • Support tool calling for InternLM2.5 (#8405)
  • Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

  • New compatibility matrix for mutual exclusive features (#8512)
  • Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

  • Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
  • Support AWQ for CPU backend (#7515)
  • Add async output processor for xpu (#8897)
  • Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

  • Progress in vLLM's refactoring to a core core:
    • Spec decode removing batch expansion (#8839, #9298).
    • We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
    • Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
    • Move guided decoding params into sampling params (#8252)
  • Torch Compile:
    • You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

  • Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
  • Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

Read more

v0.6.2

25 Sep 21:50
7193774
Compare
Choose a tag to compare

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

Read more

v0.6.1.post2

13 Sep 18:35
9ba0817
Compare
Choose a tag to compare

Highlights

  • This release contains an important bugfix related to token streaming combined with stop string (#8468)

What's Changed

Full Changelog: v0.6.1.post1...v0.6.1.post2

v0.6.1.post1

13 Sep 04:40
acda0b3
Compare
Choose a tag to compare

Highlights

This release features important bug fixes and enhancements for

  • Pixtral models. (#8415, #8425, #8399, #8431)
    • Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
  • Multistep scheduling. (#8417, #7928, #8427)
  • Tool use. (#8423, #8366)

Also

  • support multiple images for qwen-vl (#8247)
  • removes engine_use_ray (#8126)
  • add engine option to return only deltas or final output (#7381)
  • add bitsandbytes support for Gemma2 (#8338)

What's Changed

New Contributors

Full Changelog: v0.6.1...v0.6.1.post1