[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

hyeygit · 2025-03-29T00:23:48Z

Top-k and top-p are slow on TPU because existing algorithms use torch.scatter. For some reason torch.scatter is extremely slow on TPU. There's ongoing work to optimize it, but until that's done, we need an alternative algorithm that circumvents scattering.

The algorithm in this PR avoids torch.scatter by finding a "cut-off" element in the original logit, and after thresholding the logit using this cut-off, the remaining elements shall constitute the top-p set. This is inspired by the apply_top_k_only algorithm created by @njhill in #15478.

Benchmark

Microbenchmark (on v6e-1) shows significant speed up -- "Running 32 elapsed time" is ~5 ms, down from the original scatter-based algorithm's ~500 ms, a 100x improvement.

Microbenchmark full results on v6e-1

$ VLLM_USE_V1=1 python sampler_microbenchmark.py 
INFO 03-31 14:40:10 [__init__.py:239] Automatically detected platform tpu.
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 03-31 14:40:15 [topk_topp_sampler.py:82] Using approximate top-p optimized for TPU. Result may in theory differ from the exact algorithm if there are tokens with near-identical probabilities (< 1e-9 diff).
Compiling/Warmup 1 elapsed time: 9.270433902740479
Compiling/Warmup 4 elapsed time: 9.104885816574097
Compiling/Warmup 16 elapsed time: 8.811976194381714
Compiling/Warmup 32 elapsed time: 8.926635026931763
Running 1 elapsed time: 0.004515171051025391
Running 1 elapsed time: 0.003937482833862305
Running 1 elapsed time: 0.003930091857910156
Running 1 elapsed time: 0.0038993358612060547
Average time:  0.0040705204010009766
Running 4 elapsed time: 0.0042819976806640625
Running 4 elapsed time: 0.004051685333251953
Running 4 elapsed time: 0.00403141975402832
Running 4 elapsed time: 0.004080057144165039
Average time:  0.004111289978027344
Running 16 elapsed time: 0.00475311279296875
Running 16 elapsed time: 0.0045032501220703125
Running 16 elapsed time: 0.0044858455657958984
Running 16 elapsed time: 0.19616365432739258
Average time:  0.052476465702056885
Running 32 elapsed time: 0.00586247444152832
Running 32 elapsed time: 0.005380868911743164
Running 32 elapsed time: 0.005262851715087891
Running 32 elapsed time: 0.0052433013916015625
Average time:  0.005437374114990234

Extra notes

The VLLM_TPU_DISABLE_TOPK_TOPP_OPTIMIZATION env (introduced in #15242) can now be removed. Not done in this PR since @NickLucche's pending PR #15489 already handles it (thanks!).

github-actions · 2025-03-29T00:23:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

brittrock

Great work, thank you for adding this @hyeygit, including the notes on expected speedups. I just have minor usability feedback :)

brittrock · 2025-03-30T19:19:52Z

vllm/v1/sample/ops/topk_topp_sampler.py

Thanks @hyeygit , this seems reasonable to me in the interim but I'll let the other folks chime in on the appropriateness. Cc @yaochengji @yarongmu-google

Given this can slightly impact the generated output during ties, this really feels like something we should be warning the user about. Not every time the function is called, of course, but at a minimum, we should be warning users when the argument is set to anything other than the default. I couldn't find a warning log but I'm also on my phone, so apologies if I just missed it.

Thanks for the review @brittrock . I added a one-time log message about this algorithm being approx in theory.

In practice, I think the tiny 1e-9 probability perturbation doesn't alter the result in any meaningful way. The only situation where the output differs from the exact algo is if there are multiple tokens whose probabilities are within 1e-9 (one in a billion) of each other. This means they practically have the same probability, so including either one of them in the top-p set should be acceptable.

thanks for adding!

I agree in practice, probably ok, but this could break accuracy tests and so good idea to include in any case.

nice job, again!

I signed in from my phone and must have created another github account >.<

ignore my alter ego's request for review @hyeygit 😆

I agree in practice, probably ok, but this could break accuracy tests and so good idea to include in any case.

Agreed, makes sense!

ignore my alter ego's request for review @hyeygit 😆

Haha no worries!

NickLucche

Great job here @hyeygit ! Left some comments about tests.

Please remember to enable topp like I've done for k here https://github.com/vllm-project/vllm/pull/15489/files.
Otherwise I can enable both in the same PR.

NickLucche · 2025-03-31T12:26:48Z

tests/v1/tpu/test_topk_topp_sampler.py

I think if the test is under v1/tpu we shouldn't test for CUDA but skip if platform is not tpu. Otherwise we can move the test into the shared directory.

Yep makes sense. Updated to TPU only.

NickLucche · 2025-03-31T12:27:53Z

tests/v1/tpu/test_topk_topp_sampler.py

Credit to @njhill's apply_top_k_only test!

NickLucche · 2025-03-31T12:30:51Z

tests/v1/tpu/test_topk_topp_sampler.py

can we add this test to run-tpu-v1-test.sh?

NickLucche · 2025-03-31T12:45:38Z

vllm/v1/sample/ops/topk_topp_sampler.py

is this top-p (to be used as topp-only) implementation also needed on gpu? @njhill

IIUC scattering isn't a bottleneck on GPU so this impl wouldn't bring much benefit (plus, this impl still involves a full vocab sort same as the forward_native version).

mergify · 2025-03-31T15:54:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hyeygit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hyeygit · 2025-03-31T15:57:15Z

Please remember to enable topp like I've done for k here https://github.com/vllm-project/vllm/pull/15489/files. Otherwise I can enable both in the same PR.

It might be cleaner to enable top-p in your PR (perhaps rename the PR to Enable both Top-k and Top-p) since it's blocked by this and the to-be-sent-out top-k PR. Let me know if that sounds alright! @NickLucche

NickLucche · 2025-03-31T17:10:10Z

@hyeygit Works for me!
I may still do the enablement one at a time to track benchmarks.

njhill · 2025-03-31T18:13:52Z

Thanks @hyeygit this looks good. However I don't really understand the need for the random perturbation / tiebreaking.

The behaviour without doing this will already be effectively the same as an arbitrary tie-break. The random pertubation doesn't even make it deterministic. If we want it to be deterministic we can set stable=True in the sort operation (but not suggesting that's needed).

Also this implicit tie-breaking behaviour is the same in the existing implementation, I don't see how it's something specific to the new one.

Re your related comment on the top_k PR:

However I think one corner case where this would break is if there are duplicate elements in the logit that equal the cut off value (i.e. top_k_mask). For example, given an input of [1, 2, 2, 2, 3] and k=3, the current apply_top_k_only would return [-inf, 2, 2, 2, 3] while the correct result should be [-inf, -inf, 2, 2, 3].

Again this is no different to how the pre-existing impl works. And I think it's ok for the actual shortlist to comprise more than k tokens in the case that there's a tie for the kth highest probability. I would not characterize that as "incorrect" since this case is not really well-defined. And it's reasonable from an intuition pov since the tied tokens have equal likelihood.

So hopefully this PR can be simplified to remove that part?

It would be interesting to also test whether this is meaningfully faster on GPUs. I assume most of the overhead is the sort but if this is even slightly faster we might as well change the existing impl to do the count + mask approach.

Signed-off-by: Hyesoo Yang <[email protected]>

yaochengji

LGTM, thanks!

mgoin · 2025-04-02T19:42:43Z

Please see the failing v1 test, it looks like the cuda sampler tests are failing https://buildkite.com/vllm/ci/builds/16819/steps?jid=0195f7bd-4a4f-4517-af78-dc4a6772ba71

Signed-off-by: Hyesoo Yang <[email protected]>

njhill · 2025-04-02T20:31:08Z

vllm/v1/sample/ops/topk_topp_sampler.py

    # topk.values tensor has shape [batch_size, max_top_k].
    # Convert top k to 0-based index in range [0, max_top_k).
-    k_index = k.sub_(1).unsqueeze(1)
+    k_index = k.sub_(1).unsqueeze(1).expand(logits.shape[0], 1)


@hyeygit is this because of a TPU torch broadcasting limitation?

Yes I think so. Without the explicit expand this fails on XLA due to shape mismatch.

hyeygit · 2025-04-02T22:42:25Z

Please see the failing v1 test, it looks like the cuda sampler tests are failing https://buildkite.com/vllm/ci/builds/16819/steps?jid=0195f7bd-4a4f-4517-af78-dc4a6772ba71

Oh this is probably caused by my incorrect rebase -- had some duplicate lines in the sampler. After resolving the conflicts the tests seem to pass.

Thanks for the approval!

Previously we found that using torch.topk resulted in significant speed up for TPU. Turns out that's not a viable solution because the return shape of torch.topk depends on k, which means an XLA recompilation is triggered everytime k changes. Additionally, we realized that torch.scatter was the main bottleneck for the original top-k impl on TPU. This PR circumvents both problems by using a threshold-based approach to find the top-k set. The algorithm is nearly identical to that of top-p; see vllm-project#15736 for more details. Signed-off-by: Hyesoo Yang <[email protected]>

…lm-project#15736) Signed-off-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Signed-off-by: xinyuxiao <[email protected]>

…lm-project#15736) Signed-off-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Signed-off-by: Louis Ulmer <[email protected]>

Previously we found that using torch.topk resulted in significant speed up for TPU. Turns out that's not a viable solution because the return shape of torch.topk depends on k, which means an XLA recompilation is triggered everytime k changes. Additionally, we realized that torch.scatter was the main bottleneck for the original top-k impl on TPU. This PR circumvents both problems by using a threshold-based approach to find the top-k set. The algorithm is nearly identical to that of top-p; see vllm-project#15736 for more details. Signed-off-by: Hyesoo Yang <[email protected]>

…lm-project#15736) Signed-off-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal>

…lm-project#15736) Signed-off-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Signed-off-by: Mu Huai <[email protected]>

mergify bot added the v1 label Mar 29, 2025

hyeygit force-pushed the tpu_topp branch 4 times, most recently from 15534fe to 00ab67b Compare March 30, 2025 18:53

mergify bot added the tpu Related to Google TPUs label Mar 30, 2025

hyeygit force-pushed the tpu_topp branch from 00ab67b to bc00647 Compare March 30, 2025 18:57

hyeygit changed the title ~~[V1][TPU] Speed up top-p for TPU by avoiding scattering.~~ [V1][TPU] TPU-optimized top-p implementation (avoids scattering). Mar 30, 2025

hyeygit marked this pull request as ready for review March 30, 2025 19:06

hyeygit requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners March 30, 2025 19:06

hyeygit mentioned this pull request Mar 30, 2025

[V1][Sampler] Faster top-k only implementation #15478

Merged

brittrock suggested changes Mar 30, 2025

View reviewed changes

hyeygit force-pushed the tpu_topp branch from bc00647 to 5f23c25 Compare March 30, 2025 20:18

bvrockwell approved these changes Mar 31, 2025

View reviewed changes

NickLucche requested changes Mar 31, 2025

View reviewed changes

hyeygit mentioned this pull request Mar 31, 2025

[V1][TPU] Enable Top K #15489

Merged

mergify bot added the ci/build label Mar 31, 2025

mergify bot added the needs-rebase label Mar 31, 2025

hyeygit force-pushed the tpu_topp branch from c70b4a9 to 7039fba Compare March 31, 2025 17:05

mergify bot removed the needs-rebase label Mar 31, 2025

hyeygit added 2 commits April 2, 2025 18:21

Update tests to perform assertion on CPU.

71d8cad

Signed-off-by: Hyesoo Yang <[email protected]>

Update tests to use explicit .cpu() calls.

78836ca

Signed-off-by: Hyesoo Yang <[email protected]>

hyeygit force-pushed the tpu_topp branch from 20a3ce5 to 78836ca Compare April 2, 2025 18:21

mergify bot removed the needs-rebase label Apr 2, 2025

yaochengji approved these changes Apr 2, 2025

View reviewed changes

mgoin approved these changes Apr 2, 2025

View reviewed changes

Fix merge mistake

e055c96

Signed-off-by: Hyesoo Yang <[email protected]>

njhill reviewed Apr 2, 2025

View reviewed changes

robertgshaw2-redhat approved these changes Apr 3, 2025

View reviewed changes

robertgshaw2-redhat merged commit 1b84eff into vllm-project:main Apr 3, 2025
31 checks passed

hyeygit mentioned this pull request Apr 3, 2025

[Benchmark] Add sampling parameters to benchmark_serving. #16022

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

NickLucche mentioned this pull request Apr 18, 2025

[TPU][V1] Enable Top-P #16843

Merged

Uh oh!

[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

Uh oh!

Conversation

hyeygit commented Mar 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Extra notes

Uh oh!

github-actions bot commented Mar 29, 2025

Uh oh!

brittrock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeygit Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 31, 2025

Uh oh!

hyeygit commented Mar 31, 2025

Uh oh!

NickLucche commented Mar 31, 2025

Uh oh!

njhill commented Mar 31, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Apr 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeygit Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeygit commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

hyeygit commented Mar 29, 2025 •

edited by github-actions bot

Loading

hyeygit Mar 31, 2025 •

edited

Loading

hyeygit Apr 2, 2025 •

edited

Loading