Add factories for logits_processors by maxdebayser · Pull Request #38 · IBM/vllm

maxdebayser · 2024-06-03T21:32:12Z

This is a proposed solution to the guided decoding crash problem.

In vLLM there are two requests that can be used to submit more than one sequence at the same time: the completion request in the legacy OpenAI API, and the generate batch request in the tgis gRPC API. In both cases the sampling params are validated only once and a single SamplingsParam object is shared between all sequences, even when they belong to different sequence groups. The SamplingsParam is mostly a data object, but it has a list of logits processors that are executable. If the logits processors are stateless there is no problem. However, the CFGFSM used by the CFGLogitsProcessor has an internal state that depends on the sequence generated so far and is updated at each iteration. This causes it to raise a KeyError and crash the asyncio event loop.

The first attempted solution added a seq_id parameter to the logits processor call so that it could manage state internally, but that changed the interface that is already used by some external libraries.

The solution proposed here is based on adding factories for stateful logits processors. The basic idea is:

We add processors and factories to the same list so that they are in the correct order
We add a logits_processors list to the SequenceGroupState object
When the SequenceGroup is created, we iterate over the sampling_params.logits_processors
and copy the logits_processors and call the factories to populate SequenceGroupState.logits_processors
The LogitsProcessor(nn.Module) will iterate over the SequenceGroupState.logits_processors instead of
the sampling_params.logits_processors

Here are some diagrams to illustrate the current code structure to better visualize the proposed changes:

The idea is quite simple, but the execution is a bit tricky due to the nature of async code in python where an async call can't call a non-async function that calls an async function. In the PR I tried to support both using LLMEngine directly as well as the AsyncLLMEngine used for serving.

@njhill, I was going to add support to return the processors to the factory, but I realized that it was a little bit more complicated because only the scheduler knows when the sequence is done. Maybe we can add a callback somewhere in the scheduler where we can add the deallocation call. Actually that might be required, because I realized that there is another hidden bug: when sequence are preempted with the recompute policy, that makes the state of the logits processor invalid.

…559)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

Co-authored-by: simon-mo <simon.mo@hey.com>

…d (#4567)

…default to None (#4586)

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

… Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

…trics (#3937)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

`format.sh` now has mypy checks after pulling in upstream changes. This PR makes the mypy suggested modifications to our code. --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

…ubi (#23) Changes: - vLLM v0.4.2 was published today, update our build to use pre-built libs from their wheel - bump other dependencies in the image build (base UBI image, miniforge, flash attention, grpcio-tools, accelerate) - little cleanup to remove `PYTORCH_` args that are no longer used --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

Cherry-pick of fix commit 6100f4b from ODH: opendatahub-io/vllm#17 --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com>

Install and configure use of the NCCL version recommended by vLLM via the [vllm-nccl](https://github.com/vllm-project/vllm-nccl) package. The install is a little wonky... but this set of changes should work. Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

…4927) This PR enables the fused topk_softmax kernel used in moe layer for HIP

…s (#5159)

Signed-off-by: kevin <kevin@anyscale.com>

…g the env var that indicates the vLLM backend (#5210)

…tor (#5229)

…ackend (#5249)

njhill

Thanks @maxdebayser this looks much better!

The other thing we may need to look at is what to do when sequences are forked. I think this only applies to beam search. Is deep-copying the LPs the right thing to do? Could there be problems deep-copying arbitrary LPs?

vllm/sampling_params.py

vllm/model_executor/guided_decoding/__init__.py

vllm/model_executor/guided_decoding/outlines_decoding.py

vllm/sampling_params.py

njhill · 2024-06-06T22:43:18Z

@maxdebayser after addressing the simple comments above (not necessarily the pooling thing yet), maybe you could open an upstream PR? Then we can continue the discussions with others...

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser · 2024-06-07T00:38:43Z

Here's the upstream PR: vllm-project/vllm#5329

tdoublep · 2025-01-17T19:48:05Z

Closing as stale and this repo has been re-purposed for Spyre enablement.

This PR supports fixes the model loading adapting to new fms convention. ### Changes: - passing data_type FP16 for AIU and FP32 on CPU. #### Code Tests Model loading has been tested for all models on CPU and some on AIU (all models up to 7b in size) #### Note please report any issues regarding loading models/ running on AIU immediately.

itechbear and others added 30 commits May 6, 2024 11:35

[Misc] Exclude the tests directory from being packaged (#4552)

b553e05

[BugFix] Include target-device specific requirements.txt in sdist (#4…

37f8957

…559)

[Misc] centralize all usage of environment variables (#4548)

d7f5c58

[kernel] fix sliding window in prefix prefill Triton kernel (#4405)

df04c10

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

[CI/Build] AMD CI pipeline with extended set of tests. (#4267)

299066f

Co-authored-by: simon-mo <simon.mo@hey.com>

[Core] Ignore infeasible swap requests. (#4557)

3e9f425

[Core][Distributed] enable allreduce for multiple tp groups (#4566)

977a6cd

[BugFix] Prevent the task of _force_log from being garbage collecte…

de6d42a

…d (#4567)

[Misc] remove chunk detected debug logs (#4571)

deb0ccc

[Doc] add env vars to the doc (#4572)

9500596

[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)

a5d0d0e

[Bugfix] Allow "None" or "" to be passed to CLI for string args that …

ab445b1

…default to None (#4586)

Fix/async chat serving (#2727)

83f0437

[Kernel] Use flashinfer for decoding (#4353)

0c86070

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

[Speculative decoding] Support target-model logprobs (#4378)

81a9e09

[Misc] add installation time env vars (#4574)

cf0665c

[Misc][Refactor] Introduce ExecuteModelData (#4540)

ecb55eb

[Doc] Chunked Prefill Documentation (#4580)

8e82b90

[CI] check size of the wheels (#4319)

71bb251

[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…

ac5ccb6

…trics (#3937)

bump version to v0.4.2 (#4600)

52b5bcb

[CI] Reduce wheel size by not shipping debug symbols (#4602)

c7426c1

Disable cuda version check in vllm-openai image (#4530)

352ef7c

[Bugfix] Fix asyncio.Task not being subscriptable (#4623)

06241cf

Update vLLM to 323f27b

4c758aa

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

format: make mypy happy (#24)

2caabff

`format.sh` now has mypy checks after pulling in upstream changes. This PR makes the mypy suggested modifications to our code. --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

TGISStatLogger: fix stats usage (#25)

06d9876

Cherry-pick of fix commit 6100f4b from ODH: opendatahub-io/vllm#17 --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com>

divakar-amd and others added 22 commits June 4, 2024 16:02

[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#…

cec4364

…4927) This PR enables the fused topk_softmax kernel used in moe layer for HIP

[Misc] Simplify code and fix type annotations in conftest.py (#5118)

a53e398

[Core] Support image processor (#4197)

b1deaf3

[Core] Remove unnecessary copies in flash attn backend (#5138)

989c7b3

[Kernel] Pass a device pointer into the quantize kernel for the scale…

499ac4e

…s (#5159)

[CI/BUILD] enable intel queue for longer CPU tests (#4113)

bac28b3

[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)

b0563b0

New CI template on AWS stack (#5110)

b7de754

Signed-off-by: kevin <kevin@anyscale.com>

[FRONTEND] OpenAI tools support named functions (#5032)

aa19635

[Bugfix] Support prompt_logprobs==0 (#5217)

16804c0

[Bugfix] Add warmup for prefix caching example (#5235)

ee15107

[Kernel] Enhance MoE benchmarking & tuning script (#4921)

d74f5fb

[Bugfix]: During testing, use pytest monkeypatch for safely overridin…

ad2c81c

…g the env var that indicates the vLLM backend (#5210)

[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…

9fd018a

…tor (#5229)

[CI/Build] Add inputs tests (#5215)

0f78092

[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…

8dddd6b

…ackend (#5249)

[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)

bdbb931

[CI/Build] Simplify model loading for HfRunner (#5251)

72e195a

[CI/Build] Reducing CPU CI execution time (#5241)

14dd5a1

[CI] mark AMD test as softfail to prevent blockage (#5256)

f6af8d4

[Misc] Add transformers version to collect_env.py (#5259)

3e9a627

Merge branch 'main' into lp_factories

3b9b2bb

njhill reviewed Jun 6, 2024

View reviewed changes

address review comments

0be582d

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

tdoublep force-pushed the main branch from 095df75 to 772a667 Compare December 19, 2024 10:20

tdoublep closed this Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add factories for logits_processors#38

Add factories for logits_processors#38
maxdebayser wants to merge 616 commits intomainfrom
lp_factories

maxdebayser commented Jun 3, 2024

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Jun 6, 2024

Uh oh!

maxdebayser commented Jun 7, 2024

Uh oh!

tdoublep commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

maxdebayser commented Jun 3, 2024

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Jun 6, 2024

Uh oh!

maxdebayser commented Jun 7, 2024

Uh oh!

tdoublep commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants