Releases · vllm-project/vllm

03 Dec 20:30

github-actions

v0.2.3

0f90eff

v0.2.3

Major changes

Refactoring on Worker, InputMetadata, and Attention
Fix TP support for AWQ models
Support Prometheus metrics
Fix Baichuan & Baichuan 2

What's Changed

Add instructions to install vllm+cu118 by @WoosukKwon in #1717
Documentation about official docker image by @simon-mo in #1709
Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
Migrate linter from pylint to ruff by @simon-mo in #1665
[FIX] Update the doc link in README.md by @zhuohan123 in #1730
[BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
[Fix] Fix bugs in scheduler by @linotfan in #1727
Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
[BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
[FIX] Fix the case when input_is_parallel=False for ScaledActivation by @zhuohan123 in #1737
Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
[DOCS] Add engine args documentation by @casper-hansen in #1741
Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
Fix repetition penalty aligned with huggingface by @beginlner in #1577
[build] Avoid building too many extensions by @ymwangg in #1624
[Minor] Fix model docstrings by @WoosukKwon in #1764
Added echo function to OpenAI API server. by @wanmok in #1504
Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
Correct comments in parallel_state.py by @explainerauthors in #1818
Fix OPT weight loading by @WoosukKwon in #1819
[FIX] Fix class naming by @zhuohan123 in #1803
Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
[FIX] Fix formatting error in main branch by @zhuohan123 in #1822
[Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
Better integration with Ray Serve by @FlorianJoncour in #1821
Refactor Attention by @WoosukKwon in #1840
[Docs] Add information about using shared memory in docker by @simon-mo in #1845
Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
Refactor worker & InputMetadata by @WoosukKwon in #1843
Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
[FIX] Fix docker build error (#1831) by @allenhaozi in #1832
Add profile option to latency benchmark by @WoosukKwon in #1839
Remove max_num_seqs in latency benchmark by @WoosukKwon in #1855
Support max-model-len argument for throughput benchmark by @aisensiy in #1858
Fix rope cache key error by @esmeetu in #1867
docs: add instructions for Langchain by @mspronesti in #1162
Support chat template and echo for chat API by @Tostino in #1756
Fix Baichuan tokenizer error by @WoosukKwon in #1874
Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
[Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
Fix the broken sampler tests by @WoosukKwon in #1896
Add Production Metrics in Prometheus format by @simon-mo in #1890
Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
Fix broken worker test by @WoosukKwon in #1900
chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
Fix num_gpus when TP > 1 by @WoosukKwon in #1852
Bump up to v0.2.3 by @WoosukKwon in #1903

New Contributors

@boydfd made their first contribution in #1395
@explainerauthors made their first contribution in #1818
@FlorianJoncour made their first contribution in #1821
@MichaelMcCulloch made their first contribution in #1779
@jeejeeli made their first contribution in #1828
@allenhaozi made their first contribution in #1832
@aisensiy made their first contribution in #1858
@xukp20 made their first contribution in #1886

Full Changelog: v0.2.2...v0.2.3

Contributors

aisensiy, allenhaozi, and 19 other contributors

Assets 10

19 Nov 05:58

github-actions

v0.2.2

c5f7740

v0.2.2

Major changes

Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
Extensive refactoring for better tensor parallelism & quantization support
New models: Yi, ChatGLM, Phi
Changes in scheduler: from 1D flattened input tensor to 2D tensor
AWQ support for all models
Added LogitsProcessor API
Preliminary support for SqueezeLLM

What's Changed

Change scheduler & input tensor shape by @WoosukKwon in #1381
Add Mistral 7B to test_models by @WoosukKwon in #1366
fix typo by @WrRan in #1383
Fix TP bug by @WoosukKwon in #1389
Fix type hints by @lxrite in #1427
remove useless statements by @WrRan in #1408
Pin dependency versions by @thiagosalvatore in #1429
SqueezeLLM Support by @chooper1 in #1326
aquila model add rope_scaling by @Sanster in #1457
fix: don't skip first special token. by @gesanqiu in #1497
Support repetition_penalty by @beginlner in #1424
Fix bias in InternLM by @WoosukKwon in #1501
Delay GPU->CPU sync in sampling by @Yard1 in #1337
Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
Fix logging issues by @Tostino in #1494
Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
vLLM always places spaces between special tokens by @blahblahasdf in #1373
[Fix] Fix duplicated logging messages by @zhuohan123 in #1524
Add dockerfile by @skrider in #1350
Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
[Small] Formatter only checks lints in changed files by @cadedaniel in #1528
Add MptForCausalLM key in model_loader by @wenfeiy-db in #1526
[BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
Adding a health endpoint by @Fluder-Paradyne in #1540
Remove MPTConfig by @WoosukKwon in #1529
Force paged attention v2 for long contexts by @Yard1 in #1510
docs: add description by @lots-o in #1553
Added logits processor API to sampling params by @noamgat in #1469
YaRN support implementation by @Yard1 in #1264
Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
Support Yi model by @esmeetu in #1567
ChatGLM2 Support by @GoHomeToMacDonal in #1261
Upgrade to CUDA 12 by @zhuohan123 in #1527
[Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
Build CUDA11.8 wheels for release by @WoosukKwon in #1596
Add Yi model to quantization support by @forpanyang in #1600
Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len by @irasin in #1617
Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
[Minor] Move RoPE selection logic to get_rope by @WoosukKwon in #1633
Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
Remove MptConfig by @megha95 in #1668
feat(config): support parsing torch.dtype by @aarnphm in #1641
Fix loading error when safetensors contains empty tensor by @twaka in #1687
[Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
[models] Microsoft Phi 1.5 by @maximzubkov in #1664
[Fix] Update Supported Models List by @zhuohan123 in #1690
Return usage for openai requests by @ichernev in #1663
[Fix] Fix comm test by @zhuohan123 in #1691
Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
Support Min P Sampler by @esmeetu in #1642
Read quantization_config in hf config by @WoosukKwon in #1695
Support download models from www.modelscope.cn by @liuyhwangyh in #1588
follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
Add AWQ support for all models by @WoosukKwon in #1714
Support fused add rmsnorm for LLaMA by @beginlner in #1667
[Fix] Fix warning msg on quantization by @WoosukKwon in #1715
Bump up the version to v0.2.2 by @WoosukKwon in #1689

New Contributors

@lxrite made their first contribution in #1427
@thiagosalvatore made their first contribution in #1429
@chooper1 made their first contribution in #1326
@beginlner made their first contribution in #1424
@iongpt made their first contribution in #1413
@Tostino made their first contribution in #1494
@jroesch made their first contribution in #1509
@skrider made their first contribution in #1350
@cadedaniel made their first contribution in #1528
@wenfeiy-db made their first contribution in #1526
@Fluder-Paradyne made their first contribution in #1540
@lots-o made their first contribution in #1553
@noamgat made their first contribution in #1469
@casper-hansen made their first contribution in #1235
@GoHomeToMacDonal made their first contribution in #1261
@ymwangg made their first contribution in #1546
@forpanyang made their first contribution in #1600
@GhaziSyed made their first contribution in #1609
@irasin made their first contribution in #1617
@dominik-schwabe made their first contribution in #1628
@lihuahua123 made their first contribution in #1631
@megha95 made their first contribution in #1668
@aarnphm made their first contribution in #1641
@simon-mo made their first contribution in #1666
@maximzubkov made their first contribution in #1664
@ichernev made their first contribution in #1663
@linotfan made their first contribution in #1704
@liuyhwangyh made their first contribution in #1588

Full Changelog: v0.2.1...v0.2.2

Contributors

jroesch, ichernev, and 35 other contributors

Assets 10

17 Oct 16:31

github-actions

v0.2.1.post1

3d40c83

v0.2.1.post1

This is an emergency release to fix a bug on tensor parallelism support.

Assets 6

16 Oct 20:01

github-actions

v0.2.1

651c614

v0.2.1

Major Changes

PagedAttention V2 kernel: Up to 20% end-to-end latency reduction
Support log probabilities for prompt tokens
AWQ support for Mistral 7B

What's Changed

fixing typo in tiiuae/falcon-rw-7b model name by @0ssamaak0 in #1226
Added dtype arg to benchmarks by @kg6-sleipnir in #1228
fix vulnerable memory modification to gpu shared memory by @soundOfDestiny in #1241
support sharding llama2-70b on more than 8 GPUs by @zhuohan123 in #1209
[Minor] Fix type annotations by @WoosukKwon in #1238
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic by @zhuohan123 in #1181
add support for tokenizer revision by @cassanof in #1163
Use monotonic time where appropriate by @Yard1 in #1249
API server support ipv4 / ipv6 dualstack by @yunfeng-scale in #1288
Move bfloat16 check to worker by @Yard1 in #1259
[FIX] Explain why the finished_reason of ignored sequences are length by @zhuohan123 in #1289
Update README.md by @zhuohan123 in #1292
[Minor] Fix comment in mistral.py by @zhuohan123 in #1303
lock torch version to 2.0.1 when build for #1283 by @yanxiyue in #1290
minor update by @WrRan in #1311
change the timing of sorting logits by @yhlskt23 in #1309
workaround of AWQ for Turing GPUs by @twaka in #1252
Fix overflow in awq kernel by @chu-tianxiang in #1295
Update model_loader.py by @AmaleshV in #1278
Add blacklist for model checkpoint by @WoosukKwon in #1325
Update README.md Aquila2. by @ftgreat in #1331
Improve detokenization performance by @Yard1 in #1338
Bump up transformers version & Remove MistralConfig by @WoosukKwon in #1254
Fix the issue for AquilaChat2-* models by @lu-wang-dl in #1339
Fix error message on TORCH_CUDA_ARCH_LIST by @WoosukKwon in #1239
Minor fix on AWQ kernel launch by @WoosukKwon in #1356
Implement PagedAttention V2 by @WoosukKwon in #1348
Implement prompt logprobs & Batched topk for computing logprobs by @zhuohan123 in #1328
Fix PyTorch version to 2.0.1 in workflow by @WoosukKwon in #1377
Fix PyTorch index URL in workflow by @WoosukKwon in #1378
Fix sampler test by @WoosukKwon in #1379
Bump up the version to v0.2.1 by @zhuohan123 in #1355

New Contributors

@0ssamaak0 made their first contribution in #1226
@kg6-sleipnir made their first contribution in #1228
@soundOfDestiny made their first contribution in #1241
@cassanof made their first contribution in #1163
@yunfeng-scale made their first contribution in #1288
@yanxiyue made their first contribution in #1290
@yhlskt23 made their first contribution in #1309
@chu-tianxiang made their first contribution in #1295
@AmaleshV made their first contribution in #1278
@lu-wang-dl made their first contribution in #1339

Full Changelog: v0.2.0...v0.2.1

Contributors

WrRan, twaka, and 14 other contributors

Assets 6

28 Sep 22:31

github-actions

v0.2.0

e2fb71e

v0.2.0

Major changes

Up to 60% performance improvement by optimizing de-tokenization and sampler
Initial support for AWQ (performance not optimized)
Support for RoPE scaling and LongChat
Support for Mistral-7B
Many bug fixes

What's Changed

add option to shorten prompt print in log by @leiwen83 in #991
Make max_model_len configurable by @Yard1 in #972
Fix typo in README.md by @eltociear in #1033
Use TGI-like incremental detokenization by @Yard1 in #984
Add Model Revision Support in #1014
[FIX] Minor bug fixes by @zhuohan123 in #1035
Announce paper release by @WoosukKwon in #1036
Fix detokenization leaving special tokens by @Yard1 in #1044
Add pandas to requirements.txt by @WoosukKwon in #1047
OpenAI-Server: Only fail if logit_bias has actual values by @LLukas22 in #1045
Fix warning message on LLaMA FastTokenizer by @WoosukKwon in #1037
Abort when coroutine is cancelled by @rucyang in #1020
Implement AWQ quantization support for LLaMA by @WoosukKwon in #1032
Remove AsyncLLMEngine busy loop, shield background task by @Yard1 in #1059
Fix hanging when prompt exceeds limit by @chenxu2048 in #1029
[FIX] Don't initialize parameter by default by @zhuohan123 in #1067
added support for quantize on LLM module by @orellavie1212 in #1080
align llm_engine and async_engine step method. by @esmeetu in #1081
Fix get_max_num_running_seqs for waiting and swapped seq groups by @zhuohan123 in #1068
Add safetensors support for quantized models by @WoosukKwon in #1073
Add minimum capability requirement for AWQ by @WoosukKwon in #1064
[Community] Add vLLM Discord server by @zhuohan123 in #1086
Add pyarrow to dependencies & Print warning on Ray import error by @WoosukKwon in #1094
Add gpu_memory_utilization and swap_space to LLM by @WoosukKwon in #1090
Add documentation to Triton server tutorial by @tanmayv25 in #983
rope_theta and max_position_embeddings from config by @Yard1 in #1096
Replace torch.cuda.DtypeTensor with torch.tensor by @WoosukKwon in #1123
Add float16 and float32 to dtype choices by @WoosukKwon in #1115
clean api code, remove redundant background task. by @esmeetu in #1102
feat: support stop_token_ids parameter. by @gesanqiu in #1097
Use --ipc=host in docker run for distributed inference by @WoosukKwon in #1125
Docs: Fix broken link to openai example by @nkpz in #1145
Announce the First vLLM Meetup by @WoosukKwon in #1148
[Sampler] Vectorized sampling (simplified) by @zhuohan123 in #1048
[FIX] Simplify sampler logic by @zhuohan123 in #1156
Fix config for Falcon by @WoosukKwon in #1164
Align max_tokens behavior with openai by @HermitSun in #852
[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs by @WoosukKwon in #1074
Add comments on RoPE initialization by @WoosukKwon in #1176
Allocate more shared memory to attention kernel by @Yard1 in #1154
Support Longchat by @LiuXiaoxuanPKU in #555
fix typo (?) by @WrRan in #1184
fix qwen-14b model by @Sanster in #1173
Automatically set max_num_batched_tokens by @WoosukKwon in #1198
Use standard extras for uvicorn by @danilopeixoto in #1166
Keep special sampling params by @blahblahasdf in #1186
qwen add rope_scaling by @Sanster in #1210
[Mistral] Mistral-7B-v0.1 support by @Bam4d in #1196
Fix Mistral model by @WoosukKwon in #1220
[Fix] Remove false assertion by @WoosukKwon in #1222
Add Mistral to supported model list by @WoosukKwon in #1221
Fix OOM in attention kernel test by @WoosukKwon in #1223
Provide default max model length by @WoosukKwon in #1224
Bump up the version to v0.2.0 by @WoosukKwon in #1212

New Contributors

@leiwen83 made their first contribution in #991
@LLukas22 made their first contribution in #1045
@rucyang made their first contribution in #1020
@chenxu2048 made their first contribution in #1029
@orellavie1212 made their first contribution in #1080
@tanmayv25 made their first contribution in #983
@nkpz made their first contribution in #1145
@WrRan made their first contribution in #1184
@danilopeixoto made their first contribution in #1166
@blahblahasdf made their first contribution in #1186
@Bam4d made their first contribution in #1196

Full Changelog: v0.1.7...v0.2.0

Contributors

nkpz, Bam4d, and 18 other contributors

Assets 6

11 Sep 07:56

github-actions

v0.1.7

90eb3f4

v0.1.7

A minor release to fix the bugs in ALiBi, Falcon-40B, and Code Llama.

What's Changed

fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True by @Jingru in #871
Fix wrong dtype in PagedAttentionWithALiBi bias by @Yard1 in #996
fix: CUDA error when inferencing with Falcon-40B base model by @kyujin-cho in #992
[Docs] Update installation page by @WoosukKwon in #1005
Update setup.py by @WoosukKwon in #1006
Use FP32 in RoPE initialization by @WoosukKwon in #1004
Bump up the version to v0.1.7 by @WoosukKwon in #1013

New Contributors

@Jingru made their first contribution in #871
@kyujin-cho made their first contribution in #992

Full Changelog: v0.1.6...v0.1.7

Contributors

kyujin-cho, Jingru, and 2 other contributors

Assets 6

08 Sep 07:08

github-actions

v0.1.6

1117aa1

v0.1.6

Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.

What's Changed

faster startup of vLLM by @ri938 in #982
Start background task in AsyncLLMEngine.generate by @Yard1 in #988
Bump up the version to v0.1.6 by @zhuohan123 in #989

New Contributors

@ri938 made their first contribution in #982

Full Changelog: v0.1.5...v0.1.6

Contributors

ri938, Yard1, and zhuohan123

Assets 6

07 Sep 23:16

github-actions

v0.1.5

852ef5b

v0.1.5

Major Changes

Align beam search with hf_model.generate.
Stablelize AsyncLLMEngine with a background engine loop.
Add support for CodeLLaMA.
Add many model correctness tests.
Many other correctness fixes.

What's Changed

Add support for CodeLlama by @Yard1 in #854
[Fix] Fix a condition for ignored sequences by @zhuohan123 in #867
use flash-attn via xformers by @tmm1 in #877
Enable request body OpenAPI spec for OpenAI endpoints by @Peilun-Li in #865
Accelerate LLaMA model loading by @JF-D in #234
Improve _prune_hidden_states micro-benchmark by @tmm1 in #707
fix: bug fix when penalties are negative by @pfldy2850 in #913
[Docs] Minor fixes in supported models by @WoosukKwon in #920
Fix README.md Link by @zhuohan123 in #927
Add tests for models by @WoosukKwon in #922
Avoid compiling kernels for double data type by @WoosukKwon in #933
[BugFix] Fix NaN errors in paged attention kernel by @WoosukKwon in #936
Refactor AsyncLLMEngine by @Yard1 in #880
Only emit warning about internal tokenizer if it isn't being used by @nelson-liu in #939
Align vLLM's beam search implementation with HF generate by @zhuohan123 in #857
Initialize AsyncLLMEngine bg loop correctly by @Yard1 in #943
FIx vLLM cannot launch by @HermitSun in #948
Clean up kernel unit tests by @WoosukKwon in #938
Use queue for finished requests by @Yard1 in #957
[BugFix] Implement RoPE for GPT-J by @WoosukKwon in #941
Set torch default dtype in a context manager by @Yard1 in #971
Bump up transformers version in requirements.txt by @WoosukKwon in #976
Make AsyncLLMEngine more robust & fix batched abort by @Yard1 in #969
Enable safetensors loading for all models by @zhuohan123 in #974
[FIX] Fix Alibi implementation in PagedAttention kernel by @zhuohan123 in #945
Bump up the version to v0.1.5 by @WoosukKwon in #944

New Contributors

@tmm1 made their first contribution in #877
@Peilun-Li made their first contribution in #865
@JF-D made their first contribution in #234
@pfldy2850 made their first contribution in #913
@nelson-liu made their first contribution in #939

Full Changelog: v0.1.4...v0.1.5

Contributors

tmm1, nelson-liu, and 7 other contributors

Assets 6

25 Aug 03:31

github-actions

v0.1.4

791d79d

vLLM v0.1.4

Major changes

From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
New models: InternLM, Qwen, Aquila.
Optimizing CUDA kernels for paged attention and GELU.
Many bug fixes.

What's Changed

Fix gibberish outputs of GPT-BigCode-based models by @HermitSun in #676
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel by @naed90 in #420
add QWen-7b support by @Sanster in #685
add internlm model by @gqjia in #528
Check the max prompt length for the OpenAI completions API by @nicobasile in #472
[Fix] unwantted bias in InternLM Model by @wangruohui in #740
Supports tokens and arrays of tokens as inputs to the OpenAI completion API by @wanmok in #715
Fix baichuan doc style by @UranusSeven in #748
Fix typo in tokenizer.py by @eltociear in #750
Align with huggingface Top K sampling by @Abraham-Xu in #753
explicitly del state by @cauyxy in #784
Fix typo in sampling_params.py by @wangcx18 in #788
[Feature | CI] Added a github action to build wheels by @Danielkinz in #746
set default coompute capability according to cuda version by @zxdvd in #773
Fix mqa is false case in gpt_bigcode by @zhaoyang-star in #806
Add support for aquila by @shunxing1234 in #663
Update Supported Model List by @zhuohan123 in #825
Fix 'GPTBigCodeForCausalLM' object has no attribute 'tensor_model_parallel_world_size' by @HermitSun in #827
Add compute capability 8.9 to default targets by @WoosukKwon in #829
Implement approximate GELU kernels by @WoosukKwon in #828
Fix typo of Aquila in README.md by @ftgreat in #836
Fix for breaking changes in xformers 0.0.21 by @WoosukKwon in #834
Clean up code by @wenjun93 in #844
Set replacement=True in torch.multinomial by @WoosukKwon in #858
Bump up the version to v0.1.4 by @WoosukKwon in #846

New Contributors

@naed90 made their first contribution in #420
@gqjia made their first contribution in #528
@nicobasile made their first contribution in #472
@wanmok made their first contribution in #715
@UranusSeven made their first contribution in #748
@eltociear made their first contribution in #750
@Abraham-Xu made their first contribution in #753
@cauyxy made their first contribution in #784
@wangcx18 made their first contribution in #788
@Danielkinz made their first contribution in #746
@zhaoyang-star made their first contribution in #806
@shunxing1234 made their first contribution in #663
@ftgreat made their first contribution in #836
@wenjun93 made their first contribution in #844

Full Changelog: v0.1.3...v0.1.4

Contributors

zxdvd, Sanster, and 18 other contributors

Assets 6

02 Aug 23:56

WoosukKwon

v0.1.3

aa84c92

vLLM v0.1.3

What's Changed

Major changes

More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
Efficient support for MQA and GQA.
Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
And many bug fixes.

All changes

fix: only response [DONE] once when streaming response. by @gesanqiu in #378
[Fix] Change /generate response-type to json for non-streaming by @nicolasf in #374
Add trust-remote-code flag to handle remote tokenizers by @codethazine in #364
avoid python list copy in sequence initialization by @LiuXiaoxuanPKU in #401
[Fix] Sort LLM outputs by request ID before return by @WoosukKwon in #402
Add trust_remote_code arg to get_config by @WoosukKwon in #405
Don't try to load training_args.bin by @lpfhs in #373
[Model] Add support for GPT-J by @AndreSlavescu in #226
fix: freeze pydantic to v1 by @kemingy in #429
Fix handling of special tokens in decoding. by @xcnick in #418
add vocab padding for LLama(Support WizardLM) by @esmeetu in #411
Fix the KeyError when loading bloom-based models by @HermitSun in #441
Optimize MQA Kernel by @zhuohan123 in #452
Offload port selection to OS by @zhangir-azerbayev in #467
[Doc] Add doc for running vLLM on the cloud by @Michaelvll in #426
[Fix] Fix the condition of max_seq_len by @zhuohan123 in #477
Add support for baichuan by @codethazine in #365
fix max seq len by @LiuXiaoxuanPKU in #489
Fixed old name reference for max_seq_len by @MoeedDar in #498
hotfix attn alibi wo head mapping by @Oliver-ss in #496
fix(ray_utils): ignore re-init error by @mspronesti in #465
Support trust_remote_code in benchmark by @wangruohui in #518
fix: enable trust-remote-code in api server & benchmark. by @gesanqiu in #509
Ray placement group support by @Yard1 in #397
Fix bad assert in initialize_cluster if PG already exists by @Yard1 in #526
Add support for LLaMA-2 by @zhuohan123 in #505
GPTJConfig has no attribute rotary. by @leegohi04517 in #532
[Fix] Fix GPTBigcoder for distributed execution by @zhuohan123 in #503
Fix paged attention testing. by @shanshanpt in #495
fixed tensor parallel is not defined by @MoeedDar in #564
Add Baichuan-7B to README by @zhuohan123 in #494
[Fix] Add chat completion Example and simplify dependencies by @zhuohan123 in #576
[Fix] Add model sequence length into model config by @zhuohan123 in #575
[Fix] fix import error of RayWorker (#604) by @zxdvd in #605
fix ModuleNotFoundError by @mklf in #599
[Doc] Change old max_seq_len to max_model_len in docs by @SiriusNEO in #622
fix biachuan-7b tp by @Sanster in #598
[Model] support baichuan-13b based on baichuan-7b by @Oliver-ss in #643
Fix log message in scheduler by @LiuXiaoxuanPKU in #652
Add Falcon support (new) by @zhuohan123 in #592
[BUG FIX] upgrade fschat version to 0.2.23 by @YHPeter in #650
Refactor scheduler by @WoosukKwon in #658
[Doc] Add Baichuan 13B to supported models by @zhuohan123 in #656
Bump up version to 0.1.3 by @zhuohan123 in #657

New Contributors

@nicolasf made their first contribution in #374
@codethazine made their first contribution in #364
@lpfhs made their first contribution in #373
@AndreSlavescu made their first contribution in #226
@kemingy made their first contribution in #429
@xcnick made their first contribution in #418
@esmeetu made their first contribution in #411
@HermitSun made their first contribution in #441
@zhangir-azerbayev made their first contribution in #467
@MoeedDar made their first contribution in #498
@Oliver-ss made their first contribution in #496
@mspronesti made their first contribution in #465
@wangruohui made their first contribution in #518
@Yard1 made their first contribution in #397
@leegohi04517 made their first contribution in #532
@shanshanpt made their first contribution in #495
@zxdvd made their first contribution in #605
@mklf made their first contribution in #599
@SiriusNEO made their first contribution in #622
@Sanster made their first contribution in #598
@YHPeter made their first contribution in #650

Full Changelog: v0.1.2...v0.1.3

Contributors

zxdvd, nicolasf, and 24 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

Major Changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Major Changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

What's Changed

Major changes

All changes

New Contributors

Contributors

Releases: vllm-project/vllm

v0.2.3

Major changes

What's Changed

New Contributors

Contributors

v0.2.2

Major changes

What's Changed

New Contributors

Contributors

v0.2.1.post1

v0.2.1

Major Changes

What's Changed

New Contributors

Contributors

v0.2.0

Major changes

What's Changed

New Contributors

Contributors

v0.1.7

What's Changed

New Contributors

Contributors

v0.1.6

What's Changed

New Contributors

Contributors

v0.1.5

Major Changes

What's Changed

New Contributors

Contributors

vLLM v0.1.4

Major changes

What's Changed

New Contributors

Contributors

vLLM v0.1.3

What's Changed

Major changes

All changes

New Contributors

Contributors