Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" by BBuf · Pull Request #22046 · sgl-project/sglang

BBuf · 2026-04-03T13:30:28Z

Reverts #20501

…eedup (#…" This reverts commit 7a59e05.

gemini-code-assist

Code Review

This pull request removes the fused Triton temperature-softmax kernel and its associated warmup logic, reverting to standard PyTorch operations for temperature scaling and softmax. In the sampler implementation, the code was updated to perform these operations using standard PyTorch calls. I have provided a suggestion to optimize the memory usage by avoiding a redundant copy operation during the softmax calculation.

gemini-code-assist · 2026-04-03T13:33:53Z

python/sglang/srt/layers/sampler.py

+                # In-place op to save memory
+                logits[:] = torch.softmax(logits, dim=-1)
+                probs = logits


The comment # In-place op to save memory is misleading because torch.softmax is not an in-place operation and still requires a temporary allocation of the same size as the input. Additionally, logits[:] = ... performs a redundant data copy into the existing buffer.

To improve efficiency, you can avoid the copy by assigning the result of softmax directly to probs and updating the reference in logits_output.next_token_logits. This avoids the element-wise copy while still ensuring the output object contains the probabilities.

Suggested change

# In-place op to save memory

logits[:] = torch.softmax(logits, dim=-1)

probs = logits

probs = torch.softmax(logits, dim=-1)

logits_output.next_token_logits = probs

…eedup" (sgl-project#22046)

…eedup" (#22046)

…eedup" (sgl-project#22046)

Revert "[Kernel] Fuse temperature + softmax in sampling for decode sp…

4230bb9

…eedup (#…" This reverts commit 7a59e05.

BBuf requested review from Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock and merrymercy as code owners April 3, 2026 13:30

BBuf merged commit ee9d922 into main Apr 3, 2026
58 of 67 checks passed

BBuf deleted the revert-20501-fused_sampling branch April 3, 2026 13:32

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Revert "[Kernel] Fuse temperature + softmax in sampling for decode sp…

9a35492

…eedup" (sgl-project#22046)

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026

Revert "[Kernel] Fuse temperature + softmax in sampling for decode sp…

d2d138b

…eedup" (#22046)

xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026

Revert "[Kernel] Fuse temperature + softmax in sampling for decode sp…

cf1f8f8

…eedup" (sgl-project#22046)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup"#22046

Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup"#22046
BBuf merged 1 commit intomainfrom
revert-20501-fused_sampling

BBuf commented Apr 3, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BBuf commented Apr 3, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant