[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic · 2024-10-30T03:51:20Z

SUMMARY:

AsyncLLM in V1 - better overlapping of GPU and CPU

TODO:

FOLLOW UP PRS:

Benchmarking with CUDAGraphs (todo as follow up given cudagraphs are broken)
Robustness (health checks, make sure abort is working properly everywhere)
More AsyncLLM and LLMEngine tests (abort, stop string, other unit)
Enable multiprocessing for LLM by default (need to figure out a way around fork) - currently, need to set VLLM_ENABLE_V1_MULTIPROCESSING=1

DIAGRAM:

Note: this diagram is a bit dated. There is an EngineCoreClient class that is used by the AsyncLLM to interact with the EngineCore, but the overall architecture is close to what we have.
Note: stop strings are detected in the detokenizer and we send an abort message from output_handler_loop to EngineCore

Co-authored-by: Tyler Michael Smith <[email protected]>

joerunde · 2024-11-08T19:21:38Z

vllm/entrypoints/openai/api_server.py

+if envs.VLLM_USE_V1:
+    from vllm.v1.engine.async_llm import AsyncLLMEngine  # type: ignore
+else:
+    from vllm.engine.async_llm_engine import AsyncLLMEngine  # type: ignore


This won't work with tests, anything that tries to monkeypatch with m.setenv("VLLM_USE_V1", True) won't take effect because this runs the check at import time

joerunde · 2024-11-08T19:32:41Z

@robertgshaw2-neuralmagic @njhill this looks super 🚀🚀🚀

Couple questions for my own understanding:

How should I interpret the name core? There's both a v1/core package as well as v1/engine/core.py
It looks like this is way faster than v0 + multistep decoding, are we planning on ditching multistep in v1 or is that still TBD?

Thanks!

joerunde · 2024-11-08T19:38:21Z

tests/v1/engine/test_async_llm.py

+    with monkeypatch.context() as m:
+        m.setenv("VLLM_USE_V1", "1")
+
+        engine = AsyncLLM.from_engine_args(ENGINE_ARGS)


Definitely out of scope for this PR but regarding the followup point

More AsyncLLM and LLMEngine tests (abort, stop string, other unit)

We should be able to reuse all the existing tests since the interfaces are the same, right? It'd just be a matter of hooking up a fixture to set the appropriate v1 environment variables and making sure we're initializing the engine under test with a method that returns the appropriate one. I'd be happy to make that my project for the next few days while y'all focus on building the fast stuff

yes but there are features (e.g. logprobs) that need to land before we can turn on existing unit tests. Additionally, some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine. So a lot of the tests may need refactoring given these changes

Ah, yeah of course. There's a @skip_v1 mark for tests for unsupported features.

some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine

Oof, yeah those sound like some unideal nosy tests, I'll have to look more at them

mergify · 2024-11-09T16:20:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py

WoosukKwon

Thanks for the great work!

WoosukKwon · 2024-11-11T22:09:11Z

vllm/envs.py

+    "VLLM_ENABLE_V1_MULTIPROCESSING":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))),


QQ: In which case should we turn this on?

VLLM_ENABLE_V1_MULTIPROCESSING=1 enables multiprocessing for EngineCore inside LLM (multiprocessing is always used for AsyncLLM right now). It is faster than the current implementation.

We will want to enable VLLM_ENABLE_V1_MULTIPROCESSING=1, but right now it is a problem for LLM since we cannot spawn without an if __name__ == "__main__" guard. We left solving this issue for follow up work.

robertgshaw2-neuralmagic · 2024-11-12T03:12:20Z

@robertgshaw2-neuralmagic @njhill this looks super 🚀🚀🚀

Couple questions for my own understanding:

How should I interpret the name core? There's both a v1/core package as well as v1/engine/core.py

It looks like this is way faster than v0 + multistep decoding, are we planning on ditching multistep in v1 or is that still TBD?

Thanks!

Its a bit unfortunate of a naming conflict. we can consider moving some files from this diff into v1/core
Goal of V1 is to simplify vLLM and make it faster such that multistep is not needed, since the code is complex and hard to maintain

lixiaolx · 2024-11-13T14:30:05Z

@robertgshaw2-neuralmagic I'm glad to see your optimized pr. I found some problems during the test and wanted to ask for advice. I set llama2-7b, 1gpu, batch=256, used V1-engine for testing and analysis, and used pr Comparing the test with your PR, the token gap is analyzed as follows:
pr-9289:

this-pr：

I am very happy that the new implementation has removed the token enqueue and dequeue time, but I found that the new version of update_schedule and schedule take longer. There is no major change in the total gap time
I carefully compared the code implementation. I found that there are no big changes.
I wonder if the new multi-threading of encode and decode causes the time consuming to become longer.

robertgshaw2-neuralmagic · 2024-11-13T14:50:48Z

Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify?

njhill · 2024-11-13T19:43:55Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.

Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

lixiaolx · 2024-11-14T02:34:38Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.

Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation.
I would like to ask if we have any plans to do asynchronous scheduling? Compared with sglang asynchronous, there is still a gap.
I recently analyzed that the overall test gap of sglang's asynchronous solution under the same conditions is between 200-300us. If you have a plan, are there any arrangements?

lixiaolx · 2024-11-14T02:39:14Z

Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify?
@robertgshaw2-neuralmagic
I compared the previous pr with your current pr, and did nsys analysis. I added nvtx to analyze the time overhead where the mainloop function is called, and split and analyzed the CPU overhead between the two forwards before and after the GPU.

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: OmerD <[email protected]>

lixiaolx · 2024-11-14T13:48:52Z

@robertgshaw2-neuralmagic @njhill Hello, does our pr support multiple gpu cards? Well, when testing llama2-70b 8gpu，occurs server log was stuck here.

I use nvidia-smi found that only 0 gpu card was occupying only about 500MB.

njhill · 2024-11-14T13:58:41Z

@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon.

lixiaolx · 2024-11-14T14:03:15Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.
Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation. I would like to ask if we have any plans to do asynchronous scheduling? Compared with sglang asynchronous, there is still a gap. I recently analyzed that the overall test gap of sglang's asynchronous solution under the same conditions is between 200-300us. If you have a plan, are there any arrangements?

@njhill ,Is there any arrangement for this asynchronous scheduling?

lixiaolx · 2024-11-14T14:03:29Z

@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon.

OK，thank you

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

njhill · 2024-11-14T19:30:20Z

@njhill ,Is there any arrangement for this asynchronous scheduling?

Not yet, our plan is to first optimize other aspects first since it will be complex to combine this with certain other optimizations.

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

robertgshaw2-neuralmagic added 30 commits October 26, 2024 22:01

prototype

8f8662e

revert spurious 2.5 changes

01c4ca8

stash

1ad8a48

cleanup

f9084f6

add MQLLMEnginev1

72bccd9

work with MQLLMEngine

a6cab52

format

885ed16

cleanup formatting

3ed66cf

revert exmple change

8ae8ce9

update comment

5c72515

formatting

f9b33fa

updated

82539b9

stash

d42a54e

format

3a2d02a

Merge branch 'main' into rs-prototype-2

6028ee1

update

6bd37c1

revert bind/connect

196d822

revert comment

a089cd1

formatting

974aa06

formatting tweaks

fe1e1b4

move detokenizer into engine

9c27fbb

format

95b5af1

stash

3999279

revert bad import

b4dd571

format

f01f992

format

be333fa

add files

aefb498

stash

6d7f473

update

f431f8a

update

be431e4

Update tests/v1/engine/test_engine_core.py

2fb76d9

Co-authored-by: Tyler Michael Smith <[email protected]>

joerunde reviewed Nov 8, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 9, 2024

njhill added 2 commits November 11, 2024 13:28

Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…

8c47b3c

…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py

Address some minor review comments

7cb08b7

mergify bot removed the needs-rebase label Nov 11, 2024

WoosukKwon approved these changes Nov 11, 2024

View reviewed changes

robertgshaw2-neuralmagic enabled auto-merge (squash) November 11, 2024 22:19

robertgshaw2-neuralmagic merged commit 6ace6fb into vllm-project:main Nov 11, 2024
72 checks passed

DarkLight1337 mentioned this pull request Nov 13, 2024

[Bug]: The Qwen series models produce garbled output when generating long texts. #9825

Open

1 task

ywang96 mentioned this pull request Nov 13, 2024

[V1] Add missing tokenizer options for Detokenizer #10288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] `AsyncLLM` Implementation #9826

[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic commented Oct 30, 2024 •

edited

Loading

joerunde Nov 8, 2024

joerunde commented Nov 8, 2024

joerunde Nov 8, 2024

robertgshaw2-neuralmagic Nov 8, 2024

joerunde Nov 8, 2024

mergify bot commented Nov 9, 2024

WoosukKwon left a comment

WoosukKwon Nov 11, 2024

robertgshaw2-neuralmagic Nov 11, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 12, 2024

lixiaolx commented Nov 13, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 13, 2024

njhill commented Nov 13, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

		"VLLM_ENABLE_V1_MULTIPROCESSING":
		lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))),

[V1] AsyncLLM Implementation #9826

[V1] AsyncLLM Implementation #9826

Conversation

robertgshaw2-neuralmagic commented Oct 30, 2024 • edited Loading

SUMMARY:

TODO:

FOLLOW UP PRS:

DIAGRAM:

joerunde Nov 8, 2024

Choose a reason for hiding this comment

joerunde commented Nov 8, 2024

joerunde Nov 8, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Nov 8, 2024

Choose a reason for hiding this comment

joerunde Nov 8, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 9, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Nov 11, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Nov 12, 2024

lixiaolx commented Nov 13, 2024 • edited Loading

robertgshaw2-neuralmagic commented Nov 13, 2024

njhill commented Nov 13, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

[V1] `AsyncLLM` Implementation #9826

[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic commented Oct 30, 2024 •

edited

Loading

robertgshaw2-neuralmagic Nov 11, 2024 •

edited

Loading

lixiaolx commented Nov 13, 2024 •

edited

Loading