-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM's V1 Engine Architecture #8779
Comments
I want to highlight that, the re-arch will only affect vllm developers who need to change vLLM's code, in a positive way to make their lives easier. For vLLM users who use vLLM directly, there would be no breaking changes except for beam-search. And we hope to bring better performance for users as well as an extensible architecture for developers. |
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results Take support for encode only models as an example Although the encode only models is much simpler than the decode model, they are very different. The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand. I call this architecture Workflow Defined Engine, or WDE for short. I'm implementing async scheduler (Async single-step scheduling). |
mark |
Workflow Definition Engine draft pull request is almost complete and there are almost 10,000 lines of code. as @DarkLight1337 said:
Therefore, we hope to invite more people to participate, including but not limited to providing suggestions, participating in discussions, align with vLLM's V2 engine architecture goals, and discussing how to break it into stages, help review code for future PRs Let me briefly introduce the content of this PR. Including
What new models need to be supportedThese models are all from issues and are also very famous:
These models is roughly divided into three categories:
What new features these new models haveWhat the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below. You can think of prefill only as encode only fancy writing. New features:
How engine Architecture needs to support these features flexibly and efficiently.If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand. I call this architecture Workflow Defined Engine, or WDE for short. I divided the Engine into the following modules.
With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.
|
Given Driver process + SPMD workers, it's there a chance to separate LLMEngine process and worker processes on different nodes(servers)? To be more concrete, the OpenAPI server process and LLMEngine process should live on a node with high performance CPU only, while the worker processes should live on normal GPU node(s). I guess this idea is somehow related to ray spmd worker: #6556, even though I suspect their current implementation is not supporting a distributed LLMEngine process. |
@simon-mo is the team considering moving away from python ? |
mark |
Probably easier to cythonize critical bits and wait for PY3.13 support in torch |
We notice that, when input lengths are short, for example less than 200, the prefill stages costs too much GPU idle. |
@simon-mo v1 seems a huge performance bumps in terms of sampling and multi-modality support. However beam search provides flexibility for users who don't care overall speed, do we have the solution for a stand-alone beam search right now? |
This issues describes the high level directions that "create LLM Engine V1". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.
Goal:
Non-goals, the following are important but orthogonal:
The scope is exclusively in the scheduler, memory manager, distributed architecture. We will not touch APIs, models, kernels, and most parts of the model runner.
Highlights of the new design:
Lessons we learned from V1:
self.running
queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.Timeline wise, we plan to execute the changes incrementally. Overtime we will add PRs and issues related to the new architecture here.
The design is led by the vLLM maintainers @WoosukKwon @zhuohan123 @youkaichao @simon-mo @LiuXiaoxuanPKU @comaniac @alexm-neuralmagic @njhill @robertgshaw2-neuralmagic @rkooo567 and many others!
The text was updated successfully, but these errors were encountered: