[V1] Implement vLLM V1 [1/N] #9289

WoosukKwon · 2024-10-11T15:59:11Z

No description provided.

github-actions · 2024-10-11T15:59:23Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-neuralmagic · 2024-10-11T16:05:29Z

👀

alexm-neuralmagic

@WoosukKwon thanks for the hard work on this, looks like you made good progress. Left some comments/clarifications.

examples/offline_inference.py

vllm/attention/selector.py

vllm/commit_id.py

vllm/entrypoints/llm.py

alexm-neuralmagic · 2024-10-14T13:55:07Z

vllm/entrypoints/llm.py

+        # FIXME:
+        engine_args.max_num_seqs = max(engine_args.max_num_seqs, 2048)
+        engine_args.enable_chunked_prefill = False
+        self.llm_engine = LLMEngineV1.from_engine_args(


Is this where we want to switch between vllm_v1 and the old vllm?

Consider adding an if here?

I introduced a new env variable VLLM_USE_V1, which is 0 by default. By setting this env variable, users can use the V1 code path.

alexm-neuralmagic · 2024-10-14T15:58:47Z

vllm_v1/worker/gpu_model_runner.py

+
+        # Calculate the slot mapping.
+        block_numbers = self.persistent_batch.block_table_cpu_tensor.flatten()[
+            token_indices // self.block_size]


How having M inside token indices (to separate requests) affects the block_numbers we get here? Isn't this results in a "jump"?

vllm_v1/worker/gpu_model_runner.py

zhuohan123

Done most of the reviews. Has no brain left for GPUModelRunner. Will look into that more tomorrow.

vllm_v1/attention/backends/flash_attn.py

vllm/config.py

vllm/entrypoints/llm.py

zhuohan123 · 2024-10-16T04:43:02Z

vllm/entrypoints/llm.py

+        # FIXME:
+        engine_args.max_num_seqs = max(engine_args.max_num_seqs, 2048)
+        engine_args.enable_chunked_prefill = False
+        self.llm_engine = LLMEngineV1.from_engine_args(


Consider adding an if here?

vllm_v1/tokenizer/detokenizer_utils.py

zhuohan123 · 2024-10-16T06:24:54Z

vllm_v1/worker/gpu_worker.py

+def _get_cache_block_size(
+    cache_config: CacheConfig,
+    model_config: ModelConfig,
+    parallel_config: ParallelConfig,
+) -> int:
+    head_size = model_config.get_head_size()
+    num_heads = model_config.get_num_kv_heads(parallel_config)
+    num_attention_layers = model_config.get_num_attention_layers(
+        parallel_config)
+
+    key_cache_block = cache_config.block_size * num_heads * head_size
+    value_cache_block = key_cache_block
+    total = num_attention_layers * (key_cache_block + value_cache_block)
+    if cache_config.cache_dtype == "auto":
+        dtype = model_config.dtype
+    else:
+        dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]
+    dtype_size = get_dtype_size(dtype)
+    return dtype_size * total


Probably not in this PR/re-arch, but eventually should we move this to the model code?

Hmm yes? I actually didn't care much because the code is small and didn't bring any complexity.

zhuohan123 · 2024-10-16T06:42:49Z

vllm_v1/worker/gpu_model_runner.py

+        self.top_p = torch.empty((max_num_reqs, ),
+                                 dtype=torch.float32,
+                                 device=device)
+        self.top_p_cpu_tensor = torch.empty((max_num_reqs, ),
+                                            dtype=torch.float32,
+                                            device="cpu",
+                                            pin_memory=pin_memory)


Can we have an abstraction for logic around self.x, self.x_cpu_tensor, self.x_cpu, self.x_reqs for different x?

Can you please elaborate more?

vllm_v1/worker/gpu_model_runner.py

zhuohan123 · 2024-10-17T06:18:30Z

vllm_v1/worker/gpu_model_runner.py

+        scheduler_output: "SchedulerOutput",
+    ) -> ModelRunnerOutput:
+        self._update_states(scheduler_output)
+        inputs = self._prepare_inputs(scheduler_output)


Why do we need scheduler_output to prepare the inputs if we cache all the request states in the model runner?

The scheduler output contains 1) the scheduling decision (req id -> num_tokens), and 2) all the data for new requests, and 3) new block ids for in-flight requests.

zhuohan123 · 2024-10-17T06:20:19Z

vllm_v1/worker/gpu_model_runner.py

+        # NOTE: CPU-GPU synchronization happens here.
+        sampled_token_ids = sampler_output.sampled_token_ids.cpu()
+        sampled_token_ids_list = sampled_token_ids.tolist()
+        # TODO: Optimize.


Can you be a bit more specific on what to optimize?

Added more comments.

vllm/config.py

comaniac · 2024-10-17T21:01:51Z

vllm/entrypoints/llm.py

+from vllm_v1.engine.llm_engine import LLMEngine as LLMEngineV1
+from vllm_v1.outputs import RequestOutput as RequestOutputV1


If the interface is compatible, would the following be easier?

if USE_V1: from vllm_v1.engine.llm_engine import LLMEngine from vllm_v1.outputs import RequestOutput else: from vllm.engine.llm_engine import LLMEngine from vllm.outputs import RequestOutput

Yeah I introduced the VLLM_USE_V1 env variable and added a similar if statement. PTAL.

comaniac · 2024-10-17T21:07:11Z

vllm_v1/core/kv_cache_manager.py

+    def get_computed_blocks(self, request: Request) -> List[int]:
+        if not self.enable_caching:
+            # No prefix caching.
+            return []
+        # TODO(woosuk): Implement hash-based caching.
+        return []


One thing to think about before implementing hash-based caching: Where to calculate the hash?

In block manager v1, the hash was calculated in the sequence (aka Request); while in block manager v2, the hash is calculated in the block manager. Calculating hash in Request makes sure the hash will be calculated only once during the Request life cycle, but calculating hash in block manager makes more sense because the hash should attach to cache blocks instead of sequences.

cc @rickyyx who is working on prefix-caching aware scheduler in v0.

comaniac · 2024-10-17T21:15:43Z

vllm_v1/core/kv_cache_manager.py

+        num_blocks = cdiv(request.num_computed_tokens + num_tokens,
+                          self.block_size)
+        req_block_ids = self.req_to_block_ids[request.request_id]
+        num_new_blocks = num_blocks - len(req_block_ids)


Can this be incrementally calculated? The only missing information here to determine how many new blocks we need is the number of empty slots of the last block, and block manager should have this information, so maybe we could do something like the following

req_block_ids = self.req_to_block_ids[request.request_id] empty_slots = req_block_ids[-1].empty_slots if num_tokens <= empty_slots: # No new block is needed. return [] num_new_blocks = (num_tokens - empty_slots) // self.block_size ...

Sorry, could you explain why do you prefer that over this implementation? I personally found the current implementation more concise and intuitive (if I didn't miss anything).

vllm_v1/core/kv_cache_manager.py

vllm_v1/sample/sampler.py

comaniac · 2024-10-17T22:32:15Z

vllm_v1/worker/gpu_model_runner.py

+        self.max_num_tokens = scheduler_config.max_num_batched_tokens
+
+        # Lazy initialization
+        self.model: nn.Module  # Set after load_model


Same question. Is it the new feature supported by later Python version? I got an error with Python 3.9:

>>> class A: ... def __init__(self): ... self.data: str ... >>> a = A() >>> a.data Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'A' object has no attribute 'data'

comaniac · 2024-10-17T22:36:03Z

vllm_v1/worker/gpu_model_runner.py

+        if removed_req_indices:
+            self.persistent_batch.condense(removed_req_indices)
+
+    def _prepare_inputs(self, scheduler_output: "SchedulerOutput"):


This function is unfortunately already complicate. I can't imagine how complicate it will be once other features (e.g., LoRA, multi-modal) are added...it seems also hard to extend these features after this function such as

model_inputs = self._prepare_inputs(...) model_inputs = slef._prepare_lora_inputs(model_inputs)

vllm_v1/worker/gpu_worker.py

WoosukKwon · 2024-10-21T05:52:29Z

@zhuohan123 Can you please take another look?

Signed-off-by: charlifu <[email protected]>

Signed-off-by: Vinay Damodaran <[email protected]>

Signed-off-by: Alvant <[email protected]>

Signed-off-by: Erkin Sagiroglu <[email protected]>

Signed-off-by: Amit Garg <[email protected]>

robertgshaw2-neuralmagic · 2024-10-28T11:59:37Z

vllm/v1/engine/llm_engine.py

+        # OPTIMIZATION: Cache the request output and update it incrementally.
+        # This is used to avoid creating a new RequestOutput object every step.
+        # Request id -> RequestOutput
+        self.request_outputs: Dict[str, RequestOutput] = {}


One (very late note) --> this caching may cause a bug. With AsyncLLMEngine, we will put these RequestOutput objects into the per request output queues which the OpenAI server then uses to make the objects sent back to the client. If the LLMEngine gets ahead of the AsyncLLMEngine, we will mutate the object before the OpenAI server has a chance to make its output.

Signed-off-by: qishuai <[email protected]>

Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Sumit Dubey <[email protected]>

Signed-off-by: Maxime Fournioux <[email protected]>

WoosukKwon added 5 commits October 10, 2024 23:00

Add vllm_v1

0d8a651

Max num seqs

9a5c899

Fix chunked prefill

b2b90a9

Fix flash-attn

0ee05d2

yapf

647cb1b

Fix memory

2f04e52

alexm-neuralmagic reviewed Oct 14, 2024

View reviewed changes

WoosukKwon added 11 commits October 15, 2024 09:09

Fix

9a159be

Remove time

00d3975

Minor

dff359d

Merge branch 'main' into re-arch-v1

0cb2454

Minor

90390bc

Fix

fa82f0d

Revert

5cf508c

Remove commit_id

8c476d5

Fix slot_mapping

10e474a

Remove comment

e35a3d2

Fix

4ce3470

zhuohan123 reviewed Oct 16, 2024

View reviewed changes

WoosukKwon added 7 commits October 16, 2024 01:14

Remove logits processor

815f137

Fix dummy run

e7605a7

comment

ae5089b

Fix

05934ea

Remove redundancy

fa5ad10

Minor

ea44286

Fix

789aeb8

zhuohan123 reviewed Oct 17, 2024

View reviewed changes

comaniac reviewed Oct 17, 2024

View reviewed changes

Merge branch 'main' into re-arch-v1

a0fa8eb

WoosukKwon requested a review from zhuohan123 October 21, 2024 03:42

WoosukKwon changed the title ~~Add vllm_v1~~ [V1] Implement vLLM V1 [1/N] Oct 21, 2024

WoosukKwon added 3 commits October 20, 2024 22:08

Preallocate instead of watermark

3ba8865

RequestOutput

8c4b84c

Minor

0d21798

WoosukKwon marked this pull request as ready for review October 21, 2024 13:51

WoosukKwon added 2 commits October 21, 2024 07:03

num_new_tokens

c13b503

Minor

804f0cd

WoosukKwon requested review from comaniac, alexm-neuralmagic and njhill October 21, 2024 14:19

zhuohan123 approved these changes Oct 22, 2024

View reviewed changes

Add __init__

e441f0a

WoosukKwon merged commit 6c5af09 into main Oct 22, 2024
30 checks passed

WoosukKwon deleted the re-arch-v1 branch October 22, 2024 08:24

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

307f9b2

Signed-off-by: charlifu <[email protected]>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

e72b50e

Signed-off-by: Vinay Damodaran <[email protected]>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

60052ed

Signed-off-by: Alvant <[email protected]>

MErkinSag pushed a commit to MErkinSag/vllm that referenced this pull request Oct 26, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

9e2d00c

Signed-off-by: Erkin Sagiroglu <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

0e12d4f

Signed-off-by: Amit Garg <[email protected]>

robertgshaw2-neuralmagic reviewed Oct 28, 2024

View reviewed changes

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

fa15460

Signed-off-by: qishuai <[email protected]>

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

dc08153

Signed-off-by: NickLucche <[email protected]>

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

5e73a6d

Signed-off-by: NickLucche <[email protected]>

lixiaolx mentioned this pull request Nov 13, 2024

[V1] AsyncLLM Implementation #9826

Merged

5 tasks

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

963f60c

Signed-off-by: Sumit Dubey <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

e981954

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024

[V1] Implement vLLM V1 [1/N] (vllm-project#9289)

43a1eb5

Signed-off-by: Maxime Fournioux <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Implement vLLM V1 [1/N] #9289

[V1] Implement vLLM V1 [1/N] #9289

WoosukKwon commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

robertgshaw2-neuralmagic commented Oct 11, 2024

alexm-neuralmagic left a comment

alexm-neuralmagic Oct 14, 2024

zhuohan123 Oct 16, 2024

WoosukKwon Oct 18, 2024

alexm-neuralmagic Oct 14, 2024

zhuohan123 left a comment

zhuohan123 Oct 16, 2024

zhuohan123 Oct 16, 2024

WoosukKwon Oct 17, 2024

zhuohan123 Oct 16, 2024

WoosukKwon Oct 18, 2024

zhuohan123 Oct 17, 2024

WoosukKwon Oct 17, 2024

zhuohan123 Oct 17, 2024

WoosukKwon Oct 21, 2024

comaniac Oct 17, 2024

WoosukKwon Oct 18, 2024

comaniac Oct 17, 2024

comaniac Oct 17, 2024

WoosukKwon Oct 21, 2024

comaniac Oct 17, 2024

comaniac Oct 17, 2024

WoosukKwon commented Oct 21, 2024

robertgshaw2-neuralmagic Oct 28, 2024

		from vllm_v1.engine.llm_engine import LLMEngine as LLMEngineV1
		from vllm_v1.outputs import RequestOutput as RequestOutputV1

[V1] Implement vLLM V1 [1/N] #9289

[V1] Implement vLLM V1 [1/N] #9289

Conversation

WoosukKwon commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

robertgshaw2-neuralmagic commented Oct 11, 2024

alexm-neuralmagic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Oct 21, 2024

Choose a reason for hiding this comment