-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][Spec Decode] Optimize ngram lookup performance #9333
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
btw does this approach still have speedup if the prompt length is much longer? I'm just thinking about the trade off between CPU-GPU sync overhead and (maybe) slower CPU computation. |
Yeah will do more benchmarks here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth considering just for the aspect of simplicity. It could even make sense to write a CPU kernel in C++ instead of trying to do it on GPU
Will change the PR so that we can change the device based on the sequence length. |
…oject#9333) Signed-off-by: charlifu <[email protected]>
…oject#9333) Signed-off-by: Vinay Damodaran <[email protected]>
…oject#9333) Signed-off-by: Alvant <[email protected]>
…oject#9333) Signed-off-by: Amit Garg <[email protected]>
…oject#9333) Signed-off-by: qishuai <[email protected]>
…oject#9333) Signed-off-by: Sumit Dubey <[email protected]>
…oject#9333) Signed-off-by: Maxime Fournioux <[email protected]>
After benchmarking the performance of ngram in vllm, it seems that the proposal time is longer than expected. The main reason is that there are (1) CPU <-> GPU communication when building the ngram lookup table. (2) Building the ngram contains many small kernels (duration < 5 microseconds) as show below:
Zoom in the propose time:
The PR tries to
(1) perform lookup operation on CPU.
(2) trigger CPU <-> GPU communication only when there is a match in lookup.
Some performance numbers on a single H100:
input_len: 550, output_len: 150
I changed the prompt to try different system efficiency (which might include the number of CPU <-> GPU sync).
input_len: 2048, output_len: 150