Releases: vllm-project/vllm
Releases · vllm-project/vllm
v0.2.3
Major changes
- Refactoring on Worker, InputMetadata, and Attention
- Fix TP support for AWQ models
- Support Prometheus metrics
- Fix Baichuan & Baichuan 2
What's Changed
- Add instructions to install vllm+cu118 by @WoosukKwon in #1717
- Documentation about official docker image by @simon-mo in #1709
- Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
- Migrate linter from
pylint
toruff
by @simon-mo in #1665 - [FIX] Update the doc link in README.md by @zhuohan123 in #1730
- [BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
- Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
- [Fix] Fix bugs in scheduler by @linotfan in #1727
- Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
- fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
- [BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
- [FIX] Fix the case when
input_is_parallel=False
forScaledActivation
by @zhuohan123 in #1737 - Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
- [DOCS] Add engine args documentation by @casper-hansen in #1741
- Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
- Fix repetition penalty aligned with huggingface by @beginlner in #1577
- [build] Avoid building too many extensions by @ymwangg in #1624
- [Minor] Fix model docstrings by @WoosukKwon in #1764
- Added echo function to OpenAI API server. by @wanmok in #1504
- Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
- Correct comments in parallel_state.py by @explainerauthors in #1818
- Fix OPT weight loading by @WoosukKwon in #1819
- [FIX] Fix class naming by @zhuohan123 in #1803
- Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
- [FIX] Fix formatting error in main branch by @zhuohan123 in #1822
- [Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
- Better integration with Ray Serve by @FlorianJoncour in #1821
- Refactor Attention by @WoosukKwon in #1840
- [Docs] Add information about using shared memory in docker by @simon-mo in #1845
- Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
- Refactor worker & InputMetadata by @WoosukKwon in #1843
- Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
- [FIX] Fix docker build error (#1831) by @allenhaozi in #1832
- Add profile option to latency benchmark by @WoosukKwon in #1839
- Remove
max_num_seqs
in latency benchmark by @WoosukKwon in #1855 - Support max-model-len argument for throughput benchmark by @aisensiy in #1858
- Fix rope cache key error by @esmeetu in #1867
- docs: add instructions for Langchain by @mspronesti in #1162
- Support chat template and
echo
for chat API by @Tostino in #1756 - Fix Baichuan tokenizer error by @WoosukKwon in #1874
- Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
- Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
- [Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
- Fix the broken sampler tests by @WoosukKwon in #1896
- Add Production Metrics in Prometheus format by @simon-mo in #1890
- Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
- Fix broken worker test by @WoosukKwon in #1900
- chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
- Fix num_gpus when TP > 1 by @WoosukKwon in #1852
- Bump up to v0.2.3 by @WoosukKwon in #1903
New Contributors
- @boydfd made their first contribution in #1395
- @explainerauthors made their first contribution in #1818
- @FlorianJoncour made their first contribution in #1821
- @MichaelMcCulloch made their first contribution in #1779
- @jeejeeli made their first contribution in #1828
- @allenhaozi made their first contribution in #1832
- @aisensiy made their first contribution in #1858
- @xukp20 made their first contribution in #1886
Full Changelog: v0.2.2...v0.2.3
v0.2.2
Major changes
- Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
- Extensive refactoring for better tensor parallelism & quantization support
- New models: Yi, ChatGLM, Phi
- Changes in scheduler: from 1D flattened input tensor to 2D tensor
- AWQ support for all models
- Added LogitsProcessor API
- Preliminary support for SqueezeLLM
What's Changed
- Change scheduler & input tensor shape by @WoosukKwon in #1381
- Add Mistral 7B to
test_models
by @WoosukKwon in #1366 - fix typo by @WrRan in #1383
- Fix TP bug by @WoosukKwon in #1389
- Fix type hints by @lxrite in #1427
- remove useless statements by @WrRan in #1408
- Pin dependency versions by @thiagosalvatore in #1429
- SqueezeLLM Support by @chooper1 in #1326
- aquila model add rope_scaling by @Sanster in #1457
- fix: don't skip first special token. by @gesanqiu in #1497
- Support repetition_penalty by @beginlner in #1424
- Fix bias in InternLM by @WoosukKwon in #1501
- Delay GPU->CPU sync in sampling by @Yard1 in #1337
- Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
- Fix logging issues by @Tostino in #1494
- Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
- vLLM always places spaces between special tokens by @blahblahasdf in #1373
- [Fix] Fix duplicated logging messages by @zhuohan123 in #1524
- Add dockerfile by @skrider in #1350
- Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
- [Small] Formatter only checks lints in changed files by @cadedaniel in #1528
- Add
MptForCausalLM
key in model_loader by @wenfeiy-db in #1526 - [BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
- Adding a health endpoint by @Fluder-Paradyne in #1540
- Remove
MPTConfig
by @WoosukKwon in #1529 - Force paged attention v2 for long contexts by @Yard1 in #1510
- docs: add description by @lots-o in #1553
- Added logits processor API to sampling params by @noamgat in #1469
- YaRN support implementation by @Yard1 in #1264
- Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
- Support Yi model by @esmeetu in #1567
- ChatGLM2 Support by @GoHomeToMacDonal in #1261
- Upgrade to CUDA 12 by @zhuohan123 in #1527
- [Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
- Build CUDA11.8 wheels for release by @WoosukKwon in #1596
- Add Yi model to quantization support by @forpanyang in #1600
- Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
- config parser: add ChatGLM2 seq_length to
_get_and_verify_max_len
by @irasin in #1617 - Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
- Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
- [Minor] Move RoPE selection logic to
get_rope
by @WoosukKwon in #1633 - Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
- TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
- Remove
MptConfig
by @megha95 in #1668 - feat(config): support parsing torch.dtype by @aarnphm in #1641
- Fix loading error when safetensors contains empty tensor by @twaka in #1687
- [Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
- [models] Microsoft Phi 1.5 by @maximzubkov in #1664
- [Fix] Update Supported Models List by @zhuohan123 in #1690
- Return usage for openai requests by @ichernev in #1663
- [Fix] Fix comm test by @zhuohan123 in #1691
- Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
- Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
- Support Min P Sampler by @esmeetu in #1642
- Read quantization_config in hf config by @WoosukKwon in #1695
- Support download models from www.modelscope.cn by @liuyhwangyh in #1588
- follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
- Add AWQ support for all models by @WoosukKwon in #1714
- Support fused add rmsnorm for LLaMA by @beginlner in #1667
- [Fix] Fix warning msg on quantization by @WoosukKwon in #1715
- Bump up the version to v0.2.2 by @WoosukKwon in #1689
New Contributors
- @lxrite made their first contribution in #1427
- @thiagosalvatore made their first contribution in #1429
- @chooper1 made their first contribution in #1326
- @beginlner made their first contribution in #1424
- @iongpt made their first contribution in #1413
- @Tostino made their first contribution in #1494
- @jroesch made their first contribution in #1509
- @skrider made their first contribution in #1350
- @cadedaniel made their first contribution in #1528
- @wenfeiy-db made their first contribution in #1526
- @Fluder-Paradyne made their first contribution in #1540
- @lots-o made their first contribution in #1553
- @noamgat made their first contribution in #1469
- @casper-hansen made their first contribution in #1235
- @GoHomeToMacDonal made their first contribution in #1261
- @ymwangg made their first contribution in #1546
- @forpanyang made their first contribution in #1600
- @GhaziSyed made their first contribution in #1609
- @irasin made their first contribution in #1617
- @dominik-schwabe made their first contribution in #1628
- @lihuahua123 made their first contribution in #1631
- @megha95 made their first contribution in #1668
- @aarnphm made their first contribution in #1641
- @simon-mo made their first contribution in #1666
- @maximzubkov made their first contribution in #1664
- @ichernev made their first contribution in #1663
- @linotfan made their first contribution in #1704
- @liuyhwangyh made their first contribution in #1588
Full Changelog: v0.2.1...v0.2.2
v0.2.1.post1
This is an emergency release to fix a bug on tensor parallelism support.
v0.2.1
Major Changes
- PagedAttention V2 kernel: Up to 20% end-to-end latency reduction
- Support log probabilities for prompt tokens
- AWQ support for Mistral 7B
What's Changed
- fixing typo in
tiiuae/falcon-rw-7b
model name by @0ssamaak0 in #1226 - Added
dtype
arg to benchmarks by @kg6-sleipnir in #1228 - fix vulnerable memory modification to gpu shared memory by @soundOfDestiny in #1241
- support sharding llama2-70b on more than 8 GPUs by @zhuohan123 in #1209
- [Minor] Fix type annotations by @WoosukKwon in #1238
- TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic by @zhuohan123 in #1181
- add support for tokenizer revision by @cassanof in #1163
- Use monotonic time where appropriate by @Yard1 in #1249
- API server support ipv4 / ipv6 dualstack by @yunfeng-scale in #1288
- Move bfloat16 check to worker by @Yard1 in #1259
- [FIX] Explain why the finished_reason of ignored sequences are length by @zhuohan123 in #1289
- Update README.md by @zhuohan123 in #1292
- [Minor] Fix comment in mistral.py by @zhuohan123 in #1303
- lock torch version to 2.0.1 when build for #1283 by @yanxiyue in #1290
- minor update by @WrRan in #1311
- change the timing of sorting logits by @yhlskt23 in #1309
- workaround of AWQ for Turing GPUs by @twaka in #1252
- Fix overflow in awq kernel by @chu-tianxiang in #1295
- Update model_loader.py by @AmaleshV in #1278
- Add blacklist for model checkpoint by @WoosukKwon in #1325
- Update README.md Aquila2. by @ftgreat in #1331
- Improve detokenization performance by @Yard1 in #1338
- Bump up transformers version & Remove MistralConfig by @WoosukKwon in #1254
- Fix the issue for AquilaChat2-* models by @lu-wang-dl in #1339
- Fix error message on
TORCH_CUDA_ARCH_LIST
by @WoosukKwon in #1239 - Minor fix on AWQ kernel launch by @WoosukKwon in #1356
- Implement PagedAttention V2 by @WoosukKwon in #1348
- Implement prompt logprobs & Batched topk for computing logprobs by @zhuohan123 in #1328
- Fix PyTorch version to 2.0.1 in workflow by @WoosukKwon in #1377
- Fix PyTorch index URL in workflow by @WoosukKwon in #1378
- Fix sampler test by @WoosukKwon in #1379
- Bump up the version to v0.2.1 by @zhuohan123 in #1355
New Contributors
- @0ssamaak0 made their first contribution in #1226
- @kg6-sleipnir made their first contribution in #1228
- @soundOfDestiny made their first contribution in #1241
- @cassanof made their first contribution in #1163
- @yunfeng-scale made their first contribution in #1288
- @yanxiyue made their first contribution in #1290
- @yhlskt23 made their first contribution in #1309
- @chu-tianxiang made their first contribution in #1295
- @AmaleshV made their first contribution in #1278
- @lu-wang-dl made their first contribution in #1339
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Major changes
- Up to 60% performance improvement by optimizing de-tokenization and sampler
- Initial support for AWQ (performance not optimized)
- Support for RoPE scaling and LongChat
- Support for Mistral-7B
- Many bug fixes
What's Changed
- add option to shorten prompt print in log by @leiwen83 in #991
- Make
max_model_len
configurable by @Yard1 in #972 - Fix typo in README.md by @eltociear in #1033
- Use TGI-like incremental detokenization by @Yard1 in #984
- Add Model Revision Support in #1014
- [FIX] Minor bug fixes by @zhuohan123 in #1035
- Announce paper release by @WoosukKwon in #1036
- Fix detokenization leaving special tokens by @Yard1 in #1044
- Add pandas to requirements.txt by @WoosukKwon in #1047
- OpenAI-Server: Only fail if logit_bias has actual values by @LLukas22 in #1045
- Fix warning message on LLaMA FastTokenizer by @WoosukKwon in #1037
- Abort when coroutine is cancelled by @rucyang in #1020
- Implement AWQ quantization support for LLaMA by @WoosukKwon in #1032
- Remove AsyncLLMEngine busy loop, shield background task by @Yard1 in #1059
- Fix hanging when prompt exceeds limit by @chenxu2048 in #1029
- [FIX] Don't initialize parameter by default by @zhuohan123 in #1067
- added support for quantize on LLM module by @orellavie1212 in #1080
- align llm_engine and async_engine step method. by @esmeetu in #1081
- Fix get_max_num_running_seqs for waiting and swapped seq groups by @zhuohan123 in #1068
- Add safetensors support for quantized models by @WoosukKwon in #1073
- Add minimum capability requirement for AWQ by @WoosukKwon in #1064
- [Community] Add vLLM Discord server by @zhuohan123 in #1086
- Add pyarrow to dependencies & Print warning on Ray import error by @WoosukKwon in #1094
- Add gpu_memory_utilization and swap_space to LLM by @WoosukKwon in #1090
- Add documentation to Triton server tutorial by @tanmayv25 in #983
- rope_theta and max_position_embeddings from config by @Yard1 in #1096
- Replace torch.cuda.DtypeTensor with torch.tensor by @WoosukKwon in #1123
- Add float16 and float32 to dtype choices by @WoosukKwon in #1115
- clean api code, remove redundant background task. by @esmeetu in #1102
- feat: support stop_token_ids parameter. by @gesanqiu in #1097
- Use
--ipc=host
indocker run
for distributed inference by @WoosukKwon in #1125 - Docs: Fix broken link to openai example by @nkpz in #1145
- Announce the First vLLM Meetup by @WoosukKwon in #1148
- [Sampler] Vectorized sampling (simplified) by @zhuohan123 in #1048
- [FIX] Simplify sampler logic by @zhuohan123 in #1156
- Fix config for Falcon by @WoosukKwon in #1164
- Align
max_tokens
behavior with openai by @HermitSun in #852 - [Setup] Enable
TORCH_CUDA_ARCH_LIST
for selecting target GPUs by @WoosukKwon in #1074 - Add comments on RoPE initialization by @WoosukKwon in #1176
- Allocate more shared memory to attention kernel by @Yard1 in #1154
- Support Longchat by @LiuXiaoxuanPKU in #555
- fix typo (?) by @WrRan in #1184
- fix qwen-14b model by @Sanster in #1173
- Automatically set
max_num_batched_tokens
by @WoosukKwon in #1198 - Use standard extras for
uvicorn
by @danilopeixoto in #1166 - Keep special sampling params by @blahblahasdf in #1186
- qwen add rope_scaling by @Sanster in #1210
- [Mistral] Mistral-7B-v0.1 support by @Bam4d in #1196
- Fix Mistral model by @WoosukKwon in #1220
- [Fix] Remove false assertion by @WoosukKwon in #1222
- Add Mistral to supported model list by @WoosukKwon in #1221
- Fix OOM in attention kernel test by @WoosukKwon in #1223
- Provide default max model length by @WoosukKwon in #1224
- Bump up the version to v0.2.0 by @WoosukKwon in #1212
New Contributors
- @leiwen83 made their first contribution in #991
- @LLukas22 made their first contribution in #1045
- @rucyang made their first contribution in #1020
- @chenxu2048 made their first contribution in #1029
- @orellavie1212 made their first contribution in #1080
- @tanmayv25 made their first contribution in #983
- @nkpz made their first contribution in #1145
- @WrRan made their first contribution in #1184
- @danilopeixoto made their first contribution in #1166
- @blahblahasdf made their first contribution in #1186
- @Bam4d made their first contribution in #1196
Full Changelog: v0.1.7...v0.2.0
v0.1.7
A minor release to fix the bugs in ALiBi, Falcon-40B, and Code Llama.
What's Changed
- fix "tansformers_module" ModuleNotFoundError when load model with
trust_remote_code=True
by @Jingru in #871 - Fix wrong dtype in PagedAttentionWithALiBi bias by @Yard1 in #996
- fix: CUDA error when inferencing with Falcon-40B base model by @kyujin-cho in #992
- [Docs] Update installation page by @WoosukKwon in #1005
- Update setup.py by @WoosukKwon in #1006
- Use FP32 in RoPE initialization by @WoosukKwon in #1004
- Bump up the version to v0.1.7 by @WoosukKwon in #1013
New Contributors
- @Jingru made their first contribution in #871
- @kyujin-cho made their first contribution in #992
Full Changelog: v0.1.6...v0.1.7
v0.1.6
Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.
What's Changed
- faster startup of vLLM by @ri938 in #982
- Start background task in
AsyncLLMEngine.generate
by @Yard1 in #988 - Bump up the version to v0.1.6 by @zhuohan123 in #989
New Contributors
Full Changelog: v0.1.5...v0.1.6
v0.1.5
Major Changes
- Align beam search with
hf_model.generate
. - Stablelize AsyncLLMEngine with a background engine loop.
- Add support for CodeLLaMA.
- Add many model correctness tests.
- Many other correctness fixes.
What's Changed
- Add support for CodeLlama by @Yard1 in #854
- [Fix] Fix a condition for ignored sequences by @zhuohan123 in #867
- use flash-attn via xformers by @tmm1 in #877
- Enable request body OpenAPI spec for OpenAI endpoints by @Peilun-Li in #865
- Accelerate LLaMA model loading by @JF-D in #234
- Improve _prune_hidden_states micro-benchmark by @tmm1 in #707
- fix: bug fix when penalties are negative by @pfldy2850 in #913
- [Docs] Minor fixes in supported models by @WoosukKwon in #920
- Fix README.md Link by @zhuohan123 in #927
- Add tests for models by @WoosukKwon in #922
- Avoid compiling kernels for double data type by @WoosukKwon in #933
- [BugFix] Fix NaN errors in paged attention kernel by @WoosukKwon in #936
- Refactor AsyncLLMEngine by @Yard1 in #880
- Only emit warning about internal tokenizer if it isn't being used by @nelson-liu in #939
- Align vLLM's beam search implementation with HF generate by @zhuohan123 in #857
- Initialize AsyncLLMEngine bg loop correctly by @Yard1 in #943
- FIx vLLM cannot launch by @HermitSun in #948
- Clean up kernel unit tests by @WoosukKwon in #938
- Use queue for finished requests by @Yard1 in #957
- [BugFix] Implement RoPE for GPT-J by @WoosukKwon in #941
- Set torch default dtype in a context manager by @Yard1 in #971
- Bump up transformers version in requirements.txt by @WoosukKwon in #976
- Make
AsyncLLMEngine
more robust & fix batched abort by @Yard1 in #969 - Enable safetensors loading for all models by @zhuohan123 in #974
- [FIX] Fix Alibi implementation in PagedAttention kernel by @zhuohan123 in #945
- Bump up the version to v0.1.5 by @WoosukKwon in #944
New Contributors
- @tmm1 made their first contribution in #877
- @Peilun-Li made their first contribution in #865
- @JF-D made their first contribution in #234
- @pfldy2850 made their first contribution in #913
- @nelson-liu made their first contribution in #939
Full Changelog: v0.1.4...v0.1.5
vLLM v0.1.4
Major changes
- From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
- New models: InternLM, Qwen, Aquila.
- Optimizing CUDA kernels for paged attention and GELU.
- Many bug fixes.
What's Changed
- Fix gibberish outputs of GPT-BigCode-based models by @HermitSun in #676
- [OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel by @naed90 in #420
- add QWen-7b support by @Sanster in #685
- add internlm model by @gqjia in #528
- Check the max prompt length for the OpenAI completions API by @nicobasile in #472
- [Fix] unwantted bias in InternLM Model by @wangruohui in #740
- Supports tokens and arrays of tokens as inputs to the OpenAI completion API by @wanmok in #715
- Fix baichuan doc style by @UranusSeven in #748
- Fix typo in tokenizer.py by @eltociear in #750
- Align with huggingface Top K sampling by @Abraham-Xu in #753
- explicitly del state by @cauyxy in #784
- Fix typo in sampling_params.py by @wangcx18 in #788
- [Feature | CI] Added a github action to build wheels by @Danielkinz in #746
- set default coompute capability according to cuda version by @zxdvd in #773
- Fix mqa is false case in gpt_bigcode by @zhaoyang-star in #806
- Add support for aquila by @shunxing1234 in #663
- Update Supported Model List by @zhuohan123 in #825
- Fix 'GPTBigCodeForCausalLM' object has no attribute 'tensor_model_parallel_world_size' by @HermitSun in #827
- Add compute capability 8.9 to default targets by @WoosukKwon in #829
- Implement approximate GELU kernels by @WoosukKwon in #828
- Fix typo of Aquila in README.md by @ftgreat in #836
- Fix for breaking changes in xformers 0.0.21 by @WoosukKwon in #834
- Clean up code by @wenjun93 in #844
- Set replacement=True in torch.multinomial by @WoosukKwon in #858
- Bump up the version to v0.1.4 by @WoosukKwon in #846
New Contributors
- @naed90 made their first contribution in #420
- @gqjia made their first contribution in #528
- @nicobasile made their first contribution in #472
- @wanmok made their first contribution in #715
- @UranusSeven made their first contribution in #748
- @eltociear made their first contribution in #750
- @Abraham-Xu made their first contribution in #753
- @cauyxy made their first contribution in #784
- @wangcx18 made their first contribution in #788
- @Danielkinz made their first contribution in #746
- @zhaoyang-star made their first contribution in #806
- @shunxing1234 made their first contribution in #663
- @ftgreat made their first contribution in #836
- @wenjun93 made their first contribution in #844
Full Changelog: v0.1.3...v0.1.4
vLLM v0.1.3
What's Changed
Major changes
- More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
- Efficient support for MQA and GQA.
- Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
- And many bug fixes.
All changes
- fix: only response [DONE] once when streaming response. by @gesanqiu in #378
- [Fix] Change /generate response-type to json for non-streaming by @nicolasf in #374
- Add trust-remote-code flag to handle remote tokenizers by @codethazine in #364
- avoid python list copy in sequence initialization by @LiuXiaoxuanPKU in #401
- [Fix] Sort LLM outputs by request ID before return by @WoosukKwon in #402
- Add trust_remote_code arg to get_config by @WoosukKwon in #405
- Don't try to load training_args.bin by @lpfhs in #373
- [Model] Add support for GPT-J by @AndreSlavescu in #226
- fix: freeze pydantic to v1 by @kemingy in #429
- Fix handling of special tokens in decoding. by @xcnick in #418
- add vocab padding for LLama(Support WizardLM) by @esmeetu in #411
- Fix the
KeyError
when loading bloom-based models by @HermitSun in #441 - Optimize MQA Kernel by @zhuohan123 in #452
- Offload port selection to OS by @zhangir-azerbayev in #467
- [Doc] Add doc for running vLLM on the cloud by @Michaelvll in #426
- [Fix] Fix the condition of max_seq_len by @zhuohan123 in #477
- Add support for baichuan by @codethazine in #365
- fix max seq len by @LiuXiaoxuanPKU in #489
- Fixed old name reference for max_seq_len by @MoeedDar in #498
- hotfix attn alibi wo head mapping by @Oliver-ss in #496
- fix(ray_utils): ignore re-init error by @mspronesti in #465
- Support
trust_remote_code
in benchmark by @wangruohui in #518 - fix: enable trust-remote-code in api server & benchmark. by @gesanqiu in #509
- Ray placement group support by @Yard1 in #397
- Fix bad assert in initialize_cluster if PG already exists by @Yard1 in #526
- Add support for LLaMA-2 by @zhuohan123 in #505
- GPTJConfig has no attribute rotary. by @leegohi04517 in #532
- [Fix] Fix GPTBigcoder for distributed execution by @zhuohan123 in #503
- Fix paged attention testing. by @shanshanpt in #495
- fixed tensor parallel is not defined by @MoeedDar in #564
- Add Baichuan-7B to README by @zhuohan123 in #494
- [Fix] Add chat completion Example and simplify dependencies by @zhuohan123 in #576
- [Fix] Add model sequence length into model config by @zhuohan123 in #575
- [Fix] fix import error of RayWorker (#604) by @zxdvd in #605
- fix ModuleNotFoundError by @mklf in #599
- [Doc] Change old max_seq_len to max_model_len in docs by @SiriusNEO in #622
- fix biachuan-7b tp by @Sanster in #598
- [Model] support baichuan-13b based on baichuan-7b by @Oliver-ss in #643
- Fix log message in scheduler by @LiuXiaoxuanPKU in #652
- Add Falcon support (new) by @zhuohan123 in #592
- [BUG FIX] upgrade fschat version to 0.2.23 by @YHPeter in #650
- Refactor scheduler by @WoosukKwon in #658
- [Doc] Add Baichuan 13B to supported models by @zhuohan123 in #656
- Bump up version to 0.1.3 by @zhuohan123 in #657
New Contributors
- @nicolasf made their first contribution in #374
- @codethazine made their first contribution in #364
- @lpfhs made their first contribution in #373
- @AndreSlavescu made their first contribution in #226
- @kemingy made their first contribution in #429
- @xcnick made their first contribution in #418
- @esmeetu made their first contribution in #411
- @HermitSun made their first contribution in #441
- @zhangir-azerbayev made their first contribution in #467
- @MoeedDar made their first contribution in #498
- @Oliver-ss made their first contribution in #496
- @mspronesti made their first contribution in #465
- @wangruohui made their first contribution in #518
- @Yard1 made their first contribution in #397
- @leegohi04517 made their first contribution in #532
- @shanshanpt made their first contribution in #495
- @zxdvd made their first contribution in #605
- @mklf made their first contribution in #599
- @SiriusNEO made their first contribution in #622
- @Sanster made their first contribution in #598
- @YHPeter made their first contribution in #650
Full Changelog: v0.1.2...v0.1.3