[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition by AndreasKaratzas · Pull Request #30719 · vllm-project/vllm

AndreasKaratzas · 2025-12-15T21:01:02Z

Summary

Fixes a race condition in the GPTQ GEMM kernels that caused incorrect results when input_size > BLOCK_KN_SIZE (128).

Problem

The GPTQ GEMM kernels use multiple thread blocks along the k-dimension (gridDim.z > 1) that accumulate partial results via atomicAdd. The output tensor was being zeroed inside the kernel by blockIdx.z == 0:

if (blockIdx.z == 0) {
  for (int m = 0; m < m_count; m++)
    *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
}
__syncthreads();  // Only syncs within block, not across blocks!

Since __syncthreads() only synchronizes threads within the same block, not across different blocks, this creates a race condition where:

Blocks with z > 0 may atomicAdd their results before block z=0 finishes zeroing
Block z=0 may overwrite results that other blocks have already added

This caused numerical errors up to 45x the expected values, particularly when:

input_size > 128 (triggers multiple k-blocks)
Errors concentrated in specific rows at m-block boundaries

Solution

Pre-zero the output tensor using torch::zeros() instead of torch::empty()
Remove the in-kernel zeroing logic from all GPTQ GEMM kernel variants

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This PR correctly identifies and fixes a race condition in the GPTQ GEMM kernels by moving the output tensor zeroing from inside the CUDA kernels to the host side before the kernel launch. The changes are in the right direction, but the fix is incomplete. I've left a critical comment pointing out that the in-kernel zeroing logic also needs to be removed from several other kernel variants to fully resolve the bug.

I am having trouble creating individual review comments. Click here to see my feedback.

csrc/quantization/gptq/q_gemm.cu (236-239)

This change correctly addresses the race condition for the 4-bit kernel. However, the fix is incomplete as the same problematic in-kernel zeroing logic exists in several other kernel variants. This is a critical issue because the bug will persist for other quantization bit-widths.

Please apply the same fix (i.e., remove the in-kernel zeroing) to the following kernels as well:

gemm_half_q_half_gptq_2bit_kernel (lines 375-378)
gemm_half_q_half_gptq_3bit_kernel (lines 497-500)
gemm_half_q_half_gptq_8bit_kernel (lines 626-629)
gemm_half_q_half_alt_4bit_kernel (lines 1227-1229)
gemm_half_q_half_alt_8bit_kernel (lines 1322-1324)

Additionally, it's better to remove this dead code entirely rather than commenting it out.

AndreasKaratzas · 2025-12-19T04:29:10Z

Code Review

This PR correctly identifies and fixes a race condition in the GPTQ GEMM kernels by moving the output tensor zeroing from inside the CUDA kernels to the host side before the kernel launch. The changes are in the right direction, but the fix is incomplete. I've left a critical comment pointing out that the in-kernel zeroing logic also needs to be removed from several other kernel variants to fully resolve the bug.

I am having trouble creating individual review comments. Click here to see my feedback.

Already done :)

robertgshaw2-redhat · 2025-12-23T19:58:02Z

maybe we should just deprecate this kernel...

AndreasKaratzas · 2025-12-23T21:53:33Z

maybe we should just deprecate this kernel...

For now, I think we can merge this PR though, since it resolves the GPTQ test bug on ROCm.

mgoin

This makes sense and simplifies the kernel, LGTM! I do think we were planning to deprecate this kernel now that @jinzhen-lin added SM75 support for Marlin #29901, but if ROCm needs this kernel we can keep it for now. I would highly recommend the ROCm team investigating if they can reuse the Marlin kernels

AndreasKaratzas · 2025-12-26T19:17:44Z

@mgoin There are some failures due to OSError: You are trying to access a gated repo. This looks like an HF issue. Can we merge this once the rest of the checks are done?

AndreasKaratzas · 2025-12-29T07:09:48Z

cc @tjtanaa Can we merge this one too? The failing tests are known to be problematic.

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added 2 commits December 15, 2025 20:25

fix output zeroing race condition in GPTQ GEMM kernels

7326fef

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fix output zeroing race condition in GPTQ GEMM kernels

a36191a

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify bot added the rocm Related to AMD ROCm label Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

This was referenced Dec 15, 2025

[CI Failure]: mi325_1: Transformers Nightly Models Test #29533

Open

[CI Failure]: mi325_1: Basic Models Tests (Other) #29468

Closed

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

036f6b8

tjtanaa requested a review from mgoin December 16, 2025 05:53

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

0cceed1

AndreasKaratzas added 2 commits December 24, 2025 17:39

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

275a131

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

829de12

mgoin approved these changes Dec 26, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Dec 26, 2025

mgoin enabled auto-merge (squash) December 26, 2025 17:34

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

bab137c

AndreasKaratzas added 2 commits December 27, 2025 19:45

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

657698c

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

82b6f71

AndreasKaratzas mentioned this pull request Dec 28, 2025

[ROCm] Migrate xgrammar to upstream release #31327

Merged

Merge remote-tracking branch 'origin/main' into fix/gptq-rocm

6003ed8

vllm-bot merged commit 3ecfdc3 into vllm-project:main Dec 29, 2025
88 of 93 checks passed

AndreasKaratzas deleted the fix/gptq-rocm branch December 29, 2025 18:06

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Dec 30, 2025

[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race conditi…

4cf47ae

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race conditi…

bf0e49e

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race conditi…

b0887eb

…on (vllm-project#30719) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition#30719

[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition#30719
vllm-bot merged 10 commits intovllm-project:mainfrom
ROCm:fix/gptq-rocm

AndreasKaratzas commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

AndreasKaratzas commented Dec 19, 2025

Code Review

Uh oh!

robertgshaw2-redhat commented Dec 23, 2025

Uh oh!

AndreasKaratzas commented Dec 23, 2025

Uh oh!

mgoin left a comment

Uh oh!

AndreasKaratzas commented Dec 26, 2025

Uh oh!

AndreasKaratzas commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

AndreasKaratzas commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

csrc/quantization/gptq/q_gemm.cu (236-239)

Uh oh!

AndreasKaratzas commented Dec 19, 2025

Code Review

Uh oh!

robertgshaw2-redhat commented Dec 23, 2025

Uh oh!

AndreasKaratzas commented Dec 23, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas commented Dec 26, 2025

Uh oh!

AndreasKaratzas commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndreasKaratzas commented Dec 15, 2025 •

edited by github-actions bot

Loading