-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLM
Implementation
#9826
[V1] AsyncLLM
Implementation
#9826
Conversation
Co-authored-by: Tyler Michael Smith <[email protected]>
if envs.VLLM_USE_V1: | ||
from vllm.v1.engine.async_llm import AsyncLLMEngine # type: ignore | ||
else: | ||
from vllm.engine.async_llm_engine import AsyncLLMEngine # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work with tests, anything that tries to monkeypatch with m.setenv("VLLM_USE_V1", True)
won't take effect because this runs the check at import time
@robertgshaw2-neuralmagic @njhill this looks super 🚀🚀🚀 Couple questions for my own understanding:
Thanks! |
with monkeypatch.context() as m: | ||
m.setenv("VLLM_USE_V1", "1") | ||
|
||
engine = AsyncLLM.from_engine_args(ENGINE_ARGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely out of scope for this PR but regarding the followup point
More AsyncLLM and LLMEngine tests (abort, stop string, other unit)
We should be able to reuse all the existing tests since the interfaces are the same, right? It'd just be a matter of hooking up a fixture to set the appropriate v1
environment variables and making sure we're initializing the engine under test with a method that returns the appropriate one. I'd be happy to make that my project for the next few days while y'all focus on building the fast stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but there are features (e.g. logprobs) that need to land before we can turn on existing unit tests. Additionally, some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine. So a lot of the tests may need refactoring given these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah of course. There's a @skip_v1
mark for tests for unsupported features.
some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine
Oof, yeah those sound like some unideal nosy tests, I'll have to look more at them
This pull request has merge conflicts that must be resolved before it can be |
…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work!
"VLLM_ENABLE_V1_MULTIPROCESSING": | ||
lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: In which case should we turn this on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM_ENABLE_V1_MULTIPROCESSING=1
enables multiprocessing for EngineCore
inside LLM
(multiprocessing is always used for AsyncLLM
right now). It is faster than the current implementation.
We will want to enable VLLM_ENABLE_V1_MULTIPROCESSING=1
, but right now it is a problem for LLM
since we cannot spawn
without an if __name__ == "__main__"
guard. We left solving this issue for follow up work.
|
@robertgshaw2-neuralmagic I'm glad to see your optimized pr. I found some problems during the test and wanted to ask for advice. I set llama2-7b, 1gpu, batch=256, used V1-engine for testing and analysis, and used pr Comparing the test with your PR, the token gap is analyzed as follows: I am very happy that the new implementation has removed the token enqueue and dequeue time, but I found that the new version of update_schedule and schedule take longer. There is no major change in the total gap time |
Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify? |
Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization. Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU. |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation. |
|
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: OmerD <[email protected]>
@robertgshaw2-neuralmagic @njhill Hello, does our pr support multiple gpu cards? Well, when testing llama2-70b 8gpu,occurs server log was stuck here. |
@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon. |
@njhill ,Is there any arrangement for this asynchronous scheduling? |
OK,thank you |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
Not yet, our plan is to first optimize other aspects first since it will be complex to combine this with certain other optimizations. |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
SUMMARY:
AsyncLLM
in V1 - better overlapping of GPU and CPUTODO:
io threads
FOLLOW UP PRS:
AsyncLLM
andLLMEngine
tests (abort, stop string, other unit)LLM
by default (need to figure out a way around fork) - currently, need to setVLLM_ENABLE_V1_MULTIPROCESSING=1
DIAGRAM:
EngineCoreClient
class that is used by theAsyncLLM
to interact with theEngineCore
, but the overall architecture is close to what we have.output_handler_loop
toEngineCore