[Kernel] Triton-based Top-k and Top-p sampler kernels#33538
[Kernel] Triton-based Top-k and Top-p sampler kernels#33538njhill merged 118 commits intovllm-project:mainfrom
Conversation
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
Signed-off-by: js_park <cakeng@naver.com>
|
Hi @cakeng, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
This particular commit causes the first large-context inference attempt to OOM if VRAM is tight. Maybe there is a Triton JIT compilation spike causing this? It fails during the DCP all gather phase, which dynamically allocates memory for the gathered tensor. Patterns: Sounds like there should be some kind of warmup triggered if possible during the startup phase instead of rolling the dice during production? |
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…3538) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com>
Re-opening PR #25824, with correctness and benchmark scripts from @njhill's PR #32558.
Passes all correctness tests, faster overall compared to #32558 except for Top-p only cases. Compared to #32558, this algorithm includes a truncation step, which gathers a small "outlier" subset of the logits to reduce the search space using a stochastic cutoff. The kernel uses [num_program, vocab_size] shaped buffers to gather the outlier subset, requiring around ~80MiB of extra VRAM.
This implementation also uses
p_over_pivots_sum >= p AND p_over_pivots_sum - (p_over_pivots_min * num_p_over_pivots_min) < pfor its Top-p search termination condition. This condition looks for the pivot where "the sum of probabilities over the pivot is larger than p, but exclusion of the smallest probability over the pivot pushes the sum below p", which should be a more accurate Top-p condition than PR #32558. The algorithm also includes handling of duplicate logits or probabilities.Below are the execution latency and memory usage comparisons against PR #32558 and PyTorch.