Skip to content

Releases: vllm-project/vllm

v0.2.3

03 Dec 20:30
0f90eff
Compare
Choose a tag to compare

Major changes

  • Refactoring on Worker, InputMetadata, and Attention
  • Fix TP support for AWQ models
  • Support Prometheus metrics
  • Fix Baichuan & Baichuan 2

What's Changed

New Contributors

Full Changelog: v0.2.2...v0.2.3

v0.2.2

19 Nov 05:58
c5f7740
Compare
Choose a tag to compare

Major changes

  • Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
  • Extensive refactoring for better tensor parallelism & quantization support
  • New models: Yi, ChatGLM, Phi
  • Changes in scheduler: from 1D flattened input tensor to 2D tensor
  • AWQ support for all models
  • Added LogitsProcessor API
  • Preliminary support for SqueezeLLM

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1.post1

17 Oct 16:31
Compare
Choose a tag to compare

This is an emergency release to fix a bug on tensor parallelism support.

v0.2.1

16 Oct 20:01
651c614
Compare
Choose a tag to compare

Major Changes

  • PagedAttention V2 kernel: Up to 20% end-to-end latency reduction
  • Support log probabilities for prompt tokens
  • AWQ support for Mistral 7B

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0

28 Sep 22:31
e2fb71e
Compare
Choose a tag to compare

Major changes

  • Up to 60% performance improvement by optimizing de-tokenization and sampler
  • Initial support for AWQ (performance not optimized)
  • Support for RoPE scaling and LongChat
  • Support for Mistral-7B
  • Many bug fixes

What's Changed

New Contributors

Full Changelog: v0.1.7...v0.2.0

v0.1.7

11 Sep 07:56
90eb3f4
Compare
Choose a tag to compare

A minor release to fix the bugs in ALiBi, Falcon-40B, and Code Llama.

What's Changed

New Contributors

Full Changelog: v0.1.6...v0.1.7

v0.1.6

08 Sep 07:08
1117aa1
Compare
Choose a tag to compare

Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.

What's Changed

New Contributors

Full Changelog: v0.1.5...v0.1.6

v0.1.5

07 Sep 23:16
852ef5b
Compare
Choose a tag to compare

Major Changes

  • Align beam search with hf_model.generate.
  • Stablelize AsyncLLMEngine with a background engine loop.
  • Add support for CodeLLaMA.
  • Add many model correctness tests.
  • Many other correctness fixes.

What's Changed

New Contributors

Full Changelog: v0.1.4...v0.1.5

vLLM v0.1.4

25 Aug 03:31
791d79d
Compare
Choose a tag to compare

Major changes

  • From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
  • New models: InternLM, Qwen, Aquila.
  • Optimizing CUDA kernels for paged attention and GELU.
  • Many bug fixes.

What's Changed

New Contributors

Full Changelog: v0.1.3...v0.1.4

vLLM v0.1.3

02 Aug 23:56
aa84c92
Compare
Choose a tag to compare

What's Changed

Major changes

  • More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
  • Efficient support for MQA and GQA.
  • Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
  • And many bug fixes.

All changes

New Contributors

Full Changelog: v0.1.2...v0.1.3