fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM by michaelzhang-ai · Pull Request #20888 · sgl-project/sglang

michaelzhang-ai · 2026-03-18T23:22:25Z

Motivation

The aiter attention backend allocates a workspace_buffer for paged_attention_ragged during __init__, sized proportional to max_num_partitions which is derived from max_context_len (the model's theoretical maximum, e.g. 131,072 for Llama 3.1). On memory-constrained GPUs — such as CI runners where only ~24 GB of a 256 GB GPU is available — this produces a workspace buffer (~16 GiB) that exceeds the remaining free memory (~4 GiB), causing an OOM crash during server startup.

This was surfaced by PR #20392 which defaults AMD HIP GPUs to the aiter backend, triggering the OOM on CI tests like test_no_overlap_scheduler.py (shard 8):

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.25 GiB.
GPU 0 has a total capacity of 255.98 GiB of which 4.23 GiB is free.

Modifications

Cap the effective sequence length used to compute max_num_partitions by min(max_context_len, max_total_num_tokens). Since no single sequence can exceed the actual KV cache capacity (max_total_num_tokens), this is always sufficient and right-sizes the allocation.

For the failing CI case:

Before: max_num_partitions = ceil(131072 / 256) = 512 → workspace ~16.25 GiB
After: max_num_partitions = ceil(25432 / 256) = 100 → workspace ~3.2 GiB (fits in 4.23 GiB)

Uses getattr with a fallback to max_context_len for safety, in case max_total_num_tokens is not yet set on the model runner.

Additional Note for PR #20392

The PR also has a logic bug in its workspace buffer guard condition:

# Current (incorrect): skips allocation only when BOTH are true
if not (self.use_mla and self.use_triton_unified_attention):

# Should be (correct): skips allocation when EITHER is true
if not (self.use_mla or self.use_triton_unified_attention):

The workspace_buffer is only used by paged_attention_ragged, which is only called when both use_mla=False AND use_triton_unified_attention=False. The current and condition unnecessarily allocates the buffer when use_mla=True, use_triton_unified_attention=False (MLA models that don't use SWA).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…revent OOM The aiter backend computes max_num_partitions from max_context_len (e.g. 131K for Llama 3.1), which can produce a workspace buffer exceeding available GPU memory on constrained setups. Since no single sequence can exceed max_total_num_tokens (the actual KV cache capacity), use that as an upper bound to right-size the allocation.

gemini-code-assist · 2026-03-18T23:22:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

michaelzhang-ai requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners March 18, 2026 23:22

michaelzhang-ai closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888

fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888
michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
michaelzhang-ai:fix/aiter-workspace-buffer-oom

michaelzhang-ai commented Mar 18, 2026

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelzhang-ai commented Mar 18, 2026

Motivation

Modifications

Additional Note for PR #20392

Checklist

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant