fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888
Closed
michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
Closed
fix(aiter): cap workspace buffer partitions by KV cache capacity to prevent OOM#20888michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
michaelzhang-ai wants to merge 1 commit intosgl-project:mainfrom
Conversation
…revent OOM The aiter backend computes max_num_partitions from max_context_len (e.g. 131K for Llama 3.1), which can produce a workspace buffer exceeding available GPU memory on constrained setups. Since no single sequence can exceed max_total_num_tokens (the actual KV cache capacity), use that as an upper bound to right-size the allocation.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The aiter attention backend allocates a
workspace_bufferforpaged_attention_raggedduring__init__, sized proportional tomax_num_partitionswhich is derived frommax_context_len(the model's theoretical maximum, e.g. 131,072 for Llama 3.1). On memory-constrained GPUs — such as CI runners where only ~24 GB of a 256 GB GPU is available — this produces a workspace buffer (~16 GiB) that exceeds the remaining free memory (~4 GiB), causing an OOM crash during server startup.This was surfaced by PR #20392 which defaults AMD HIP GPUs to the aiter backend, triggering the OOM on CI tests like
test_no_overlap_scheduler.py(shard 8):Modifications
Cap the effective sequence length used to compute
max_num_partitionsbymin(max_context_len, max_total_num_tokens). Since no single sequence can exceed the actual KV cache capacity (max_total_num_tokens), this is always sufficient and right-sizes the allocation.For the failing CI case:
max_num_partitions = ceil(131072 / 256) = 512→ workspace ~16.25 GiBmax_num_partitions = ceil(25432 / 256) = 100→ workspace ~3.2 GiB (fits in 4.23 GiB)Uses
getattrwith a fallback tomax_context_lenfor safety, in casemax_total_num_tokensis not yet set on the model runner.Additional Note for PR #20392
The PR also has a logic bug in its workspace buffer guard condition:
The
workspace_bufferis only used bypaged_attention_ragged, which is only called when bothuse_mla=FalseANDuse_triton_unified_attention=False. The currentandcondition unnecessarily allocates the buffer whenuse_mla=True, use_triton_unified_attention=False(MLA models that don't use SWA).Checklist