[Kernel] Optimize sample_recovered_tokens_kernel#34974
[Kernel] Optimize sample_recovered_tokens_kernel#34974vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly optimizes the sample_recovered_tokens_kernel by implementing tiled reduction over the vocabulary and replacing division with multiplication for improved efficiency. The changes align well with the stated purpose of reducing register pressure and enhancing occupancy, leading to a reported 5x kernel-level improvement. The addition of a native Python reference implementation and comprehensive unit tests ensures the correctness of the optimized kernel. The profiling and benchmarking results demonstrate the positive impact of these optimizations. Overall, this is a well-executed and beneficial performance improvement.
|
Hi @xyang16, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
b8f10d1 to
2567007
Compare
Signed-off-by: Xin Yang <xyangx@amazon.com>
2567007 to
5bade31
Compare
mgoin
left a comment
There was a problem hiding this comment.
Really good find, using a block size of vocab_size is indeed a very bad idea. This makes sense to me and I don't think it should be controversial
| inv_q = q.reciprocal() | ||
|
|
||
| recovered_token_ids = torch.empty_like(draft_token_ids) | ||
| BLOCK_SIZE = 8192 |
There was a problem hiding this comment.
nit: since you do test with a vocab_size of 100, this does seem wasteful to not take the min or something
There was a problem hiding this comment.
Thanks for review!
Yes I thought of having BLOCK_SIZE = min(8192, triton.next_power_of_2(vocab_size)).
But since most model's vocab_size is very big and vocab 100 only happens in tests, so I thought having next_power_of_2 would add some overhead.
But I'm happy to make the change. Please let me know how you think.
There was a problem hiding this comment.
I agree it is probably not worth worrying over
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Purpose
This PR optimize the
sample_recovered_tokens_kernelin rejection sampler.tl.arange(0, PADDED_VOCAB_SIZE)vector and loading the whole vocab in one program, load it in chunks of BLOCK_SIZE. This will have lower register pressure and better occupancy.score = prob / qdivision withscore = prob * inv_q, because division is slower than multiply.gpu_model_runner.sample())Test Plan
Added unit test in
test_rejection_sampler.pyTest Result
Unit tests passed.
Profiling
Main:
PR:

Main:
sample_recovered_tokens_kerneltakes around 159µs.PR:
sample_recovered_tokens_kerneltakes around 29µs.Benchmarking
Main:
PR:
e2e output token throughput improves 1%.
Accuracy Testing
Main:
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.