webgpu: Fix buffer overflow in BufferManager::Upload causing data corruption#27948
Merged
webgpu: Fix buffer overflow in BufferManager::Upload causing data corruption#27948
Conversation
… corruption BufferManager::Upload() used NormalizeBufferSize() (16-byte alignment) to determine both the staging buffer size and the CopyBufferToBuffer copy size. When the actual data size was not a multiple of 16, the extra padding bytes in the staging buffer were uninitialized, and CopyBufferToBuffer would copy those garbage bytes into the destination GPU buffer beyond the intended range. This caused data corruption when external code (e.g., onnxruntime-genai) uploaded partial data to a pre-allocated static GPU buffer using ORT's CopyTensors API. For example, uploading 24 bytes (3 x int64) of attention mask data would copy 32 bytes (rounded to 16), writing 8 garbage bytes at position 24-31 of the destination buffer, corrupting the 4th element. This manifested as a 'device lost' crash in FlashAttention when running LLM inference with graph capture enabled and odd prompt lengths (e.g., 1 or 3 tokens), because the corrupted attention mask caused ReduceSum to produce wrong seqlen_k values, leading to out-of-bounds GPU memory access. Fix: - Use NormalizeCopySize() (4-byte alignment, the WebGPU minimum for CopyBufferToBuffer) instead of NormalizeBufferSize() (16-byte alignment) for both the staging buffer allocation and the copy command. - Zero any padding bytes between actual size and copy size to prevent garbage from being written to the destination buffer. - Apply the same 4-byte alignment fix to MemCpy() for consistency.
guschmue
previously approved these changes
Apr 6, 2026
fs-eire
reviewed
Apr 7, 2026
The zero-padding of trailing bytes when copy_size > size does not prevent dirty data in the destination buffer, since the destination may already have non-zero values in those positions that get overwritten by the aligned CopyBufferToBuffer. Replace with a comment documenting the issue and noting that a CopyBuffer + compute shader approach could fix it.
guschmue
approved these changes
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
BufferManager::Upload() used NormalizeBufferSize() (16-byte alignment) to determine both the staging buffer size and the CopyBufferToBuffer copy size. When the actual data size was not a multiple of 16, the extra padding bytes in the staging buffer were uninitialized, and CopyBufferToBuffer would copy those garbage bytes into the destination GPU buffer beyond the intended range.
This caused data corruption when external code (e.g., onnxruntime-genai) uploaded partial data to a pre-allocated static GPU buffer using ORT's CopyTensors API. For example, uploading 24 bytes (3 x int64) of attention mask data would copy 32 bytes (rounded to 16), writing 8 garbage bytes at position 24-31 of the destination buffer, corrupting the 4th element.
This manifested as a 'device lost' crash in FlashAttention when running LLM inference with graph capture enabled and odd prompt lengths (e.g., 1 or 3 tokens), because the corrupted attention mask caused ReduceSum to produce wrong seqlen_k values, leading to out-of-bounds GPU memory access.
Fix: