Skip to content

fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888

Closed
michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
michaelzhang-ai:fix/aiter-workspace-buffer-oom
Closed

fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888
michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
michaelzhang-ai:fix/aiter-workspace-buffer-oom

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

Motivation

The aiter attention backend allocates a workspace_buffer for paged_attention_ragged during __init__, sized proportional to max_num_partitions which is derived from max_context_len (the model's theoretical maximum, e.g. 131,072 for Llama 3.1). On memory-constrained GPUs — such as CI runners where only ~24 GB of a 256 GB GPU is available — this produces a workspace buffer (~16 GiB) that exceeds the remaining free memory (~4 GiB), causing an OOM crash during server startup.

This was surfaced by PR #20392 which defaults AMD HIP GPUs to the aiter backend, triggering the OOM on CI tests like test_no_overlap_scheduler.py (shard 8):

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.25 GiB.
GPU 0 has a total capacity of 255.98 GiB of which 4.23 GiB is free.

Modifications

Cap the effective sequence length used to compute max_num_partitions by min(max_context_len, max_total_num_tokens). Since no single sequence can exceed the actual KV cache capacity (max_total_num_tokens), this is always sufficient and right-sizes the allocation.

For the failing CI case:

  • Before: max_num_partitions = ceil(131072 / 256) = 512 → workspace ~16.25 GiB
  • After: max_num_partitions = ceil(25432 / 256) = 100 → workspace ~3.2 GiB (fits in 4.23 GiB)

Uses getattr with a fallback to max_context_len for safety, in case max_total_num_tokens is not yet set on the model runner.

Additional Note for PR #20392

The PR also has a logic bug in its workspace buffer guard condition:

# Current (incorrect): skips allocation only when BOTH are true
if not (self.use_mla and self.use_triton_unified_attention):

# Should be (correct): skips allocation when EITHER is true
if not (self.use_mla or self.use_triton_unified_attention):

The workspace_buffer is only used by paged_attention_ragged, which is only called when both use_mla=False AND use_triton_unified_attention=False. The current and condition unnecessarily allocates the buffer when use_mla=True, use_triton_unified_attention=False (MLA models that don't use SWA).

Checklist

…revent OOM

The aiter backend computes max_num_partitions from max_context_len
(e.g. 131K for Llama 3.1), which can produce a workspace buffer
exceeding available GPU memory on constrained setups. Since no single
sequence can exceed max_total_num_tokens (the actual KV cache capacity),
use that as an upper bound to right-size the allocation.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant