Skip to content

webgpu: Fix buffer overflow in BufferManager::Upload causing data corruption#27948

Merged
guschmue merged 3 commits intomainfrom
fix/webgpu-upload-buffer-overflow
Apr 8, 2026
Merged

webgpu: Fix buffer overflow in BufferManager::Upload causing data corruption#27948
guschmue merged 3 commits intomainfrom
fix/webgpu-upload-buffer-overflow

Conversation

@qjia7
Copy link
Copy Markdown
Contributor

@qjia7 qjia7 commented Apr 2, 2026

Description

BufferManager::Upload() used NormalizeBufferSize() (16-byte alignment) to determine both the staging buffer size and the CopyBufferToBuffer copy size. When the actual data size was not a multiple of 16, the extra padding bytes in the staging buffer were uninitialized, and CopyBufferToBuffer would copy those garbage bytes into the destination GPU buffer beyond the intended range.

This caused data corruption when external code (e.g., onnxruntime-genai) uploaded partial data to a pre-allocated static GPU buffer using ORT's CopyTensors API. For example, uploading 24 bytes (3 x int64) of attention mask data would copy 32 bytes (rounded to 16), writing 8 garbage bytes at position 24-31 of the destination buffer, corrupting the 4th element.

This manifested as a 'device lost' crash in FlashAttention when running LLM inference with graph capture enabled and odd prompt lengths (e.g., 1 or 3 tokens), because the corrupted attention mask caused ReduceSum to produce wrong seqlen_k values, leading to out-of-bounds GPU memory access.

Fix:

  • Use NormalizeCopySize() (4-byte alignment, the WebGPU minimum for CopyBufferToBuffer) instead of NormalizeBufferSize() (16-byte alignment) for both the staging buffer allocation and the copy command.
  • Zero any padding bytes between actual size and copy size to prevent garbage from being written to the destination buffer.
  • Apply the same 4-byte alignment fix to MemCpy() for consistency.

… corruption

BufferManager::Upload() used NormalizeBufferSize() (16-byte alignment) to
determine both the staging buffer size and the CopyBufferToBuffer copy size.
When the actual data size was not a multiple of 16, the extra padding bytes
in the staging buffer were uninitialized, and CopyBufferToBuffer would copy
those garbage bytes into the destination GPU buffer beyond the intended range.

This caused data corruption when external code (e.g., onnxruntime-genai)
uploaded partial data to a pre-allocated static GPU buffer using ORT's
CopyTensors API. For example, uploading 24 bytes (3 x int64) of attention
mask data would copy 32 bytes (rounded to 16), writing 8 garbage bytes at
position 24-31 of the destination buffer, corrupting the 4th element.

This manifested as a 'device lost' crash in FlashAttention when running
LLM inference with graph capture enabled and odd prompt lengths (e.g., 1 or
3 tokens), because the corrupted attention mask caused ReduceSum to produce
wrong seqlen_k values, leading to out-of-bounds GPU memory access.

Fix:
- Use NormalizeCopySize() (4-byte alignment, the WebGPU minimum for
  CopyBufferToBuffer) instead of NormalizeBufferSize() (16-byte alignment)
  for both the staging buffer allocation and the copy command.
- Zero any padding bytes between actual size and copy size to prevent
  garbage from being written to the destination buffer.
- Apply the same 4-byte alignment fix to MemCpy() for consistency.
@qjia7 qjia7 requested review from fs-eire and guschmue April 2, 2026 09:00
guschmue
guschmue previously approved these changes Apr 6, 2026
Comment thread onnxruntime/core/providers/webgpu/buffer_manager.cc Outdated
The zero-padding of trailing bytes when copy_size > size does not prevent
dirty data in the destination buffer, since the destination may already
have non-zero values in those positions that get overwritten by the
aligned CopyBufferToBuffer. Replace with a comment documenting the issue
and noting that a CopyBuffer + compute shader approach could fix it.
@qjia7 qjia7 requested review from fs-eire and guschmue April 8, 2026 07:26
@guschmue guschmue enabled auto-merge (squash) April 8, 2026 15:22
@guschmue guschmue merged commit f7751fe into main Apr 8, 2026
100 of 102 checks passed
@guschmue guschmue deleted the fix/webgpu-upload-buffer-overflow branch April 8, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants