[AMD] fix CI: workspace buffer OOM and tuned GEMM torchao compatibility#20890
[AMD] fix CI: workspace buffer OOM and tuned GEMM torchao compatibility#20890michaelzhang-ai wants to merge 1 commit intomainfrom
Conversation
Two fixes for aiter backend failures surfaced by PR #20392: 1. aiter_backend.py: Cap max_num_partitions by min(max_context_len, max_total_num_tokens). The workspace buffer was sized for the model's theoretical max context (e.g. 131K = 512 partitions = 16 GiB) when the KV cache only held 25K tokens (100 partitions = 3 GiB), causing OOM on memory-constrained CI GPUs. 2. unquant.py: Add aiter tgemm.mm fast path for unquantized linear ops, guarded by type(layer.weight.data) is torch.Tensor. Torchao-quantized weights (AffineQuantizedTensor) fail the strict type() check and fall through to F.linear, preventing NotImplementedError on gemm_a16w16.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amd-bot review |
|
Note: This is a simpler fix for the same root cause described in #18262 / #18263. Rather than jointly solving the memory partition equation upfront (as in #18263), this caps |
|
@amd-bot review |
Claude Code Review
Code Review: [AMD] fix CI: workspace buffer OOM and tuned GEMM torchao compatibility1. SummaryThis PR fixes two CI failures on AMD (ROCm) GPUs when using the aiter backend: (1) an OOM error caused by over-allocating the workspace buffer for paged attention based on 2. Code QualityBugs / Logic Errors
# Potentially safer:
if _use_aiter and type(layer.weight) is torch.Tensor:
return tgemm.mm(x, layer.weight, bias, otype=x.dtype)Code Style and Readability
Error Handling
3. PerformanceWorkspace buffer change:
Tuned GEMM addition:
4. SecurityNo security concerns. Changes are internal computation/allocation logic with no user-facing input handling changes. 5. Testing
6. SuggestionsSuggestion 1: Fix the logic bug mentioned in PR descriptionThe PR description identifies a logic bug in # In aiter_backend.py, the workspace allocation guard should be:
if not (self.use_mla or self.use_triton_unified_attention):
# allocate workspace_bufferIf this is intentionally left for a follow-up, a TODO comment would be helpful. Suggestion 2: Use
|
|
@amd-bot review |
Claude Code Review
Code Review: [AMD] fix CI: workspace buffer OOM and tuned GEMM torchao compatibility1. SummaryThis PR fixes two AMD (ROCm) CI failures: (1) an OOM crash caused by oversized workspace buffer allocation in the aiter attention backend, where the buffer was sized based on 2. Code QualityWorkspace buffer fix (
|
|
Open again if needed |
Motivation
Two aiter backend failures surfaced by PR #20392 which defaults AMD HIP GPUs to the aiter backend:
Shard 8 (CI log):
test_no_overlap_scheduler.pyOOM duringAiterAttnBackend.__init__:Shard 10 (CI log):
test_torchao.pycrash on quantized weights:Modifications
1.
aiter_backend.py— workspace buffer OOMThe workspace buffer for
paged_attention_raggedwas sized usingmax_context_len(e.g. 131K for Llama 3.1 → 512 partitions → 16.25 GiB), but the CI GPU's KV cache only held 25K tokens. Since no single sequence can exceedmax_total_num_tokens, we capmax_num_partitionsaccordingly:max_num_partitions = ceil(131072 / 256) = 512→ workspace ~16.25 GiBmax_num_partitions = ceil(25432 / 256) = 100→ workspace ~3.2 GiB2.
unquant.py— tuned GEMM with torchao guardAdd aiter's
tgemm.mmfast path for unquantized linear ops on AMD, guarded bytype(layer.weight.data) is torch.Tensor. Torchao-quantized weights (AffineQuantizedTensor, atorch.Tensorsubclass) fail the stricttype()check and correctly fall through toF.linear.Additional Note for PR #20392
The PR also has a logic bug in its workspace buffer guard condition:
workspace_bufferis only used bypaged_attention_ragged, which is only called when bothuse_mla=FalseANDuse_triton_unified_attention=False.Checklist