[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step#6338
Conversation
9711114 to
bfa1647
Compare
|
@comaniac addressed review comments. Looks much better now. |
comaniac
left a comment
There was a problem hiding this comment.
It's much better now!
Left some minor comments. The major question I have is whether it's beneficial to also include attention metadata updating in the CUDA kernel.
|
@comaniac addressed the second set of review comments, ready for final pass. Now working on getting all tests green. |
comaniac
left a comment
There was a problem hiding this comment.
LGTM. Thanks! Last batch of miner comments. If possible, it'd be great to have some e2e benchmarks (I could help with that if needed. Just let me know)
I'll leave to @cadedaniel to take a pass.
|
I will take a look at this tomorrow. One thing that needs validation is that the draft acceptance rate is still 100% when using the same draft and target model. I ran this PR locally and got <100%, which indicates some correctness issue. |
|
That's a good way to evaluate. We should add this test to the CI in this PR. |
|
@cadedaniel @comaniac addressed the acceptance rate issue, currently 100% for both GPU case and non GPU case of prepare_inputs |
|
Addressing other review comments now |
|
Addressed review comments |
cadedaniel
left a comment
There was a problem hiding this comment.
Looks good, all comments are nits/docs. Missing thing is a testing story: can we guarantee that this codepath runs in CI, and fail a test if it doesn't? e.g. in the test for this codepath we assert that the GPU backend is being used at least once. This is important as otherwise we could lose coverage of this codepath silently if there is a dependency change in CI.
Would be great to have an explicit case for when the GPU path is disabled as well, so both are tested regardless of CI conditions.
If these are too hard, let's add a big docstring about it and revisit.
There was a problem hiding this comment.
FYI @LiuXiaoxuanPKU , want to make sure you're aware for CUDA graph <> MQA integration
|
Added a test and fixed the fallback |
|
btw can you rebase or merge the latest main to make sure everything is up to date? |
bb9c4d8 to
e5f4265
Compare
|
rebased |
|
Added Cody's mock test |
cadedaniel
left a comment
There was a problem hiding this comment.
Great contribution @alexm-neuralmagic! thanks also to @comaniac for all the help
|
I'll merge this first and make a small patch for the variable naming. |
…e_step (vllm-project#6338) Signed-off-by: Alvant <alvasian@yandex.ru>
…e_step (vllm-project#6338) Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
This PR moves the implementation of prepare_inputs(..) for advance_step(..) (inside draft_model_runner.py) to GPU. It adds a new GPU kernel to modify prepare_inputs tensors on GPU directly, and also introduces some improvements to the sampler inside draft_model_runner to reduce unnecessary GPU<=>CPU overheads. Here is a quick performance check on A100 GPU. For 50% percentile latency, the current main gets 500ms, with sampler improvements of this PR, it gets to 477ms, and this sampler improvements + lowering of prepare_inputs to GPU it gets to 329ms, so 65% improvement vs main.
TODOs
draft_model_runner: new - prepare_inputs on GPU + sampler improvements
draft_model_runner: new - prepare_inputs on CPU + sampler improvements
draft_model_runner: main - prepare_inputs on CPU