Skip to content

[Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 Triton kernel#42201

Merged
yewentao256 merged 3 commits into
vllm-project:mainfrom
Flink-ddd:fix/deepgemm-silu-fp8-int32-overflow
May 11, 2026
Merged

[Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 Triton kernel#42201
yewentao256 merged 3 commits into
vllm-project:mainfrom
Flink-ddd:fix/deepgemm-silu-fp8-int32-overflow

Conversation

@Flink-ddd

@Flink-ddd Flink-ddd commented May 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Fixes #42173

_silu_mul_per_token_group_quant_fp8_colmajor computes row/column offsets using int32 arithmetic:

m_offset = pid_m * BLOCK_M
n_offset = pid_n * BLOCK_N

With large DeepGEMM MoE warmup/workspace shapes (e.g. DPEP=16, 36k max tokens per rank), the maximum element offset M * N - 1 = 18,882,756,607 far exceeds the int32 limit of 2,147,483,647, causing the Triton kernel to access illegal memory addresses.

This PR promotes m_offset and n_offset to tl.int64 before pointer arithmetic to ensure correct 64-bit memory addressing.

Test Plan

  1. Verified on NVIDIA H100 PCIe (80GB) using a minimal single-GPU reproducer with the first aligned overflow shape:
  • M = 524_416, N = 4096
  • max_offset = M * N - 1 = 2,148,007,935 (exceeds int32 max)
  • Environment: vllm 0.19.0, torch 2.10.0+cu128, triton 3.6.0, cuda 12.8
  1. Reproduction script:
import torch
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
    silu_mul_per_token_group_quant_fp8_colmajor,
)

# Single-card minimum overflow shape
M = 524_416
N = 4096

print(f"M={M}, N={N}")
print(f"max_offset = {M*N-1}")
print(f"int32_max  = {2**31-1}")
print(f"overflow?  = {M*N-1 > 2**31-1}")
print()

x = torch.empty((M, N), device="cuda", dtype=torch.bfloat16)
torch.cuda.synchronize()

print("Calling silu_mul_per_token_group_quant_fp8_colmajor ...")
y, scales = silu_mul_per_token_group_quant_fp8_colmajor(x, use_ue8m0=False)
torch.cuda.synchronize()

print(f"successful!output shape={y.shape}, scales shape={scales.shape}")

Test Result

Before fix:

M=524416, N=4096
max_offset = 2148007935
int32_max  = 2147483647
overflow?  = True
Calling silu_mul_per_token_group_quant_fp8_colmajor ...
File ".../fp8_utils.py", line 785, in silu_mul_per_token_group_quant_fp8_colmajor
_silu_mul_per_token_group_quant_fp8_colmajor[grid](https://github.com/vllm-project/vllm/compare/main...Flink-ddd:vllm:fix/...)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

After fix (testing in progress on H100 PCIe):

M=524416, N=4096
max_offset = 2148007935
int32_max  = 2147483647
overflow?  = True
Calling silu_mul_per_token_group_quant_fp8_colmajor ...
successful!output shape=torch.Size([524416, 2048]), scales shape=torch.Size([524416, 16])

@mergify mergify Bot added the bug Something isn't working label May 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Triton kernels in fp8_utils.py to use int64 for offset calculations to prevent potential integer overflows. The review feedback correctly points out that casting to int64 after the multiplication is insufficient, as the intermediate 32-bit product could still overflow. The reviewer suggests casting the program IDs to int64 before the multiplication to ensure robust overflow protection, consistent with other kernels in the codebase.

Comment thread vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated
Comment thread vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated
@Flink-ddd Flink-ddd changed the title [Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 colmajor Triton kernel for large MoE warmup shapes [Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 Triton kernel May 10, 2026
@Flink-ddd Flink-ddd force-pushed the fix/deepgemm-silu-fp8-int32-overflow branch from 23fc557 to ac90f94 Compare May 10, 2026 04:26
@Flink-ddd

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Triton kernels in fp8_utils.py to cast program IDs to int64 before calculating memory offsets. This change prevents potential integer overflow issues during offset computation in large-scale operations. As there were no review comments provided, I have no feedback to provide.

@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Pre-commit failures are seem like pre-existing main branch issues unrelated to this PR. all checks pass for the modified file through pre-commit run --files vllm/model_executor/layers/quantization/utils/fp8_utils.py.

Screenshot 2026-05-10 at 12 38 35

@Flink-ddd Flink-ddd marked this pull request as ready for review May 10, 2026 04:41

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify

mergify Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Hi @Flink-ddd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work! Also CC @ivanium

@ivanium

ivanium commented May 10, 2026

Copy link
Copy Markdown
Collaborator

LGTM too. Thanks for the fix! cc @zyongye as well

Flink-ddd and others added 3 commits May 11, 2026 10:57
…_group_quant_fp8_colmajor to fix int32 overflow for large DeepGEMM MoE warmup shapes

Signed-off-by: vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
@Flink-ddd Flink-ddd force-pushed the fix/deepgemm-silu-fp8-int32-overflow branch from ac90f94 to d0729a2 Compare May 11, 2026 03:37
@Flink-ddd Flink-ddd requested a review from zyongye as a code owner May 11, 2026 03:37
@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2026

@zyongye zyongye left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Hi @yewentao256 @ivanium @zyongye , All 69 CI checks are passed, ready for merge, Thanks!

@yewentao256 yewentao256 merged commit 6fdb493 into vllm-project:main May 11, 2026
71 checks passed
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepGEMM SiLU/mul FP8 quant Triton kernel overflows int32 addresses for large DPEP warmup shapes

4 participants