[Draft][Core] Refactor _prepare_model_input_tensors by comaniac · Pull Request #5972 · vllm-project/vllm

comaniac · 2024-06-28T21:06:55Z

NOTE: This PR will be rebased after the following PRs are merged: #4628 #5942.
Meanwhile, reviews and comments are welcome.

This PR refactors _prepare_model_input_tensors. Specifically, we introduce ModelRunnerInputBuilder mainly for logic isolation and modularization. Specifically, ModelRunnerInputBuilder manages all processed input data, including token IDs, positions, sequence length, etc, in one place, and isolates the following logic:

The logic of preparing prefill and decod inputs.
The logic of inserting a new sequence group to input data, considering prefix caching, chunked prefill, sliding windows, etc.
The logic of preparing attention inputs.
The logic of preparing LoRA and multi-modal inputs.
The logic of creating on-device tensors for model inputs.

Note that the purpose of this PR is to enable follow-up refactoring and optimizations, so we don't expect an obvious performance improvement at this moment, although the following optimizations may be slightly helpful:

The unique logic/branches for prefill and decode are separated.
Some iterative list appending (e.g., CUDA graph padding, LoRA requests) are replaced with .extend().

With this isolation, we could further have follow-up optimizations:

Refactor AttentionMetadata to only include on-device tensors, and move all related logic from ModelRunnerInputBuilder.
Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_decode_seq_group() by leveraging tensor processing.
Parallelize the loop for seq_group_metadata in seq_group_metadata_list.
and more.

rkooo567

I remember the goal we want to is to write logics agonistic to prefill/decode (mainly because prefill is a special case of decode). At least that was the direction we wanted last time (and this PR seems to revert that direction). That's also why existing prepare_inputs doesn't distinguish prefill/decode as much as possible. That will enable features such as https://github.com/vllm-project/vllm/pull/6052/files#diff-d3df23c3e3bcfe97ee8507061c6de54f0eff23a8c75d7f5999062c42245290f8

How difficult is it to not distinguish prefill/decode at least in metadata level? Also, cc @zhuohan123

comaniac · 2024-07-02T15:03:05Z

The reason I separated prefill/decode is I observed the following things:

There are lots of branches with "is_prompt" and result in complex logic.
Prepare inputs for prefill is more complex because of chunked prefill and prefix caching. Separating them may help with future multi-step runner (because in-place updating must happen in decode stage for multi-step execution).

Meanwhile, this separation shouldn't affect #6052, which focuses on the forward logic that is orthogonal to prepare_input. And some attention backends (e.g. xformers) cannot be unified in this way anyways.

However, if you feel it's still better to not separate them, I can revert that in this PR. Happy to discuss :)

rkooo567 · 2024-07-02T16:30:00Z

Let me cc @zhuohan123 and @simon-mo for this one. We discussed this before, and I combined prepare_prefill/decode into a single API, and that was the direction they wanted before. It is the second item in this proposal.

https://docs.google.com/document/d/1rg8CoOnrtz1LT-hCK86ZsHuhoTDtqSEGs8KrN4wbITo/edit

I agreed with complex logics. But I think this is actually not fundamental but more of due to tech debt.

comaniac · 2024-07-07T22:14:16Z

Moved to #6164

wip

658a51e

comaniac marked this pull request as draft June 28, 2024 21:07

comaniac added 7 commits June 28, 2024 14:23

fix slot_mapping

406c6b7

isolate moreattention

a6f66ab

merge

2e260bf

a

1db982f

Merge branch 'main' into prepare_input

aed78d2

fix bug

f7e0e48

flash_attn / flashinfer

d15fb1c

rkooo567 reviewed Jul 2, 2024

View reviewed changes

rocm / xformers

592b61b

comaniac closed this Jul 7, 2024

comaniac deleted the prepare_input branch January 3, 2025 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft][Core] Refactor _prepare_model_input_tensors#5972

[Draft][Core] Refactor _prepare_model_input_tensors#5972
comaniac wants to merge 9 commits into
vllm-project:mainfrom
comaniac:prepare_input

comaniac commented Jun 28, 2024

Uh oh!

rkooo567 left a comment

Uh oh!

comaniac commented Jul 2, 2024

Uh oh!

rkooo567 commented Jul 2, 2024

Uh oh!

comaniac commented Jul 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

comaniac commented Jun 28, 2024

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

comaniac commented Jul 2, 2024

Uh oh!

rkooo567 commented Jul 2, 2024

Uh oh!

comaniac commented Jul 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants