Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models#38261
Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models#38261malaiwah wants to merge 35 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces hybrid KV cache offloading (HMA) support, enabling more granular management of KV cache groups like Mamba and Attention. It implements a HybridOffloadPlanner to handle fixed-size offload units, a HybridChunkBlockHashList for multi-group hashing, and updates the MultiConnector with weighted selection logic and per-connector metrics. The offloading scheduler and worker are enhanced to support partial block transfers, backpressure for concurrent I/O, and improved error handling for stale cache files. Additionally, the PR includes extensive unit tests for the new hashing and planning logic, and ensures robustness in metrics reporting by clamping negative token counts. I have no feedback to provide.
| if any(block_size <= 0 for block_size in self.gpu_block_sizes): | ||
| raise ValueError("gpu_block_sizes must be positive") |
| assert all( | ||
| offloaded_block_size_int % gpu_block_size == 0 | ||
| for gpu_block_size in self.gpu_block_size | ||
| ), ( | ||
| "If 'block_size' is specified in kv_connector_extra_config, " | ||
| "there must be at least one KV cache group, " | ||
| "and all groups must have the same block size." | ||
| "it must be divisible by every KV cache group block size." | ||
| ) |
There was a problem hiding this comment.
This assertion is critical for correctness. If block_size is specified in kv_connector_extra_config, it must be divisible by every KV cache group block size. Failure to meet this condition would lead to incorrect block calculations and memory management, potentially causing data corruption or crashes.
| raise ValueError( | ||
| "fixed_chunk_size must be greater than or equal to " | ||
| "hash_block_size" | ||
| ) |
There was a problem hiding this comment.
| logger.error( | ||
| "Hybrid offloading is effectively disabled: " | ||
| "first_hashable_chunk_idx=%d requires %d tokens " | ||
| "but max_model_len=%d. No chunks can ever be " | ||
| "stored. Set max_model_len to a multiple of " | ||
| "hybrid_chunk_size=%d (e.g. %d).", | ||
| self.hybrid_planner.first_hashable_chunk_idx, | ||
| self.hybrid_planner.first_hashable_chunk_idx | ||
| * chunk_size_int, | ||
| max_model_len, | ||
| chunk_size_int, | ||
| (max_model_len // chunk_size_int) * chunk_size_int, | ||
| ) |
There was a problem hiding this comment.
Logging an error when hybrid offloading is effectively disabled due to first_hashable_chunk_idx being too large relative to max_model_len is critical. This indicates a severe misconfiguration where no chunks can ever be stored, rendering the feature useless. The error message provides clear guidance for resolution.
| raise TypeError( | ||
| "OffloadingConnector requires metadata with reqs_to_load, " | ||
| "reqs_to_store, and reqs_to_flush fields." | ||
| ) |
There was a problem hiding this comment.
| raise TypeError( | ||
| f"MultiConnector has HMA enabled but these child " | ||
| f"connectors do not support it: {non_hma}. Either " | ||
| f"use --disable-hybrid-kv-cache-manager or replace " | ||
| f"the non-HMA connectors." | ||
| ) |
| logger.warning( | ||
| "KV group %d has gpu_block_size=%d which is not " | ||
| "divisible by hybrid_chunk_size=%d. This group " | ||
| "cannot be split into chunks and will require " | ||
| "%d tokens before any offloading can occur. " | ||
| "Consider setting max_model_len to a multiple " | ||
| "of hybrid_chunk_size.", | ||
| i, gbs, chunk_size_int, | ||
| gbs, | ||
| ) |
There was a problem hiding this comment.
Logging a warning when gpu_block_size is not divisible by hybrid_chunk_size is important. This configuration can lead to inefficient offloading or unexpected behavior, as the group cannot be split into chunks as intended. The warning helps users understand the implications and adjust their max_model_len accordingly.
| logger.warning( | ||
| "offloading worker load submission failed for " | ||
| "req_id=%s job_id=%s (stale cache files?), " | ||
| "falling back to recompute", |
There was a problem hiding this comment.
Logging a warning when an offloading worker load submission fails is important. This indicates that a cache file might be stale or corrupted, and the system is falling back to recompute. This warning provides crucial information for debugging cache-related issues and understanding performance implications.
| logger.warning( | ||
| "KV groups disagree on computed prefix length: %s. " | ||
| "Using minimum (%d tokens) to avoid partial loads.", | ||
| computed_tokens_per_group, | ||
| num_computed_tokens, | ||
| ) |
There was a problem hiding this comment.
| logger.warning( | ||
| "Negative prompt_tokens_by_source[%s]=%d " | ||
| "(external KV transfer accounting skew), clamping to 0", | ||
| source, value, |
There was a problem hiding this comment.
02f458a to
bfdc78f
Compare
|
Hi @malaiwah, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Thanks for the pre-commit note. Will fix the formatting in the next push — this machine doesn't have a full dev environment (bare Ubuntu with GPU), so running pre-commit locally requires some setup. Working on it. Re: Gemini's review — all the flagged items are intentional validations and safety checks. No changes needed there. |
|
Pre-commit failures addressed in the latest push:
All 22 pre-commit hooks pass locally. |
|
This pull request has merge conflicts that must be resolved before it can be |
0a24daa to
9bddaef
Compare
|
Rebased onto latest main — one conflict in |
9bddaef to
7d52303
Compare
Instead of "first hit wins", MultiConnector now scores each connector's hit as tokens * load_weight and picks the highest score. This lets a fast CPU cache (high weight) beat a slow disk cache unless the disk hit is substantially larger. Configured via "load_weight" in each connector's kv_connector_extra_config (default 1.0). Also adds runtime HMA validation: if HMA is enabled, all child connectors must support it or a clear TypeError is raised at init. Co-authored-by: Claude Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Stock vLLM 0.18.0 calls handle_preemptions(preempted_req_ids: set[str]) but MultiConnector asserted MultiKVConnectorMetadata, crashing under memory pressure when preemption is triggered. Accept both: forward raw set[str] to all children, or unwrap MultiKVConnectorMetadata per-child as before. Same pattern as the OffloadingConnector preemption fix (8ca977d67). Co-authored-by: Claude Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Expose five new metrics on /metrics so operators can see which child
connector is serving cache hits, how many tokens are being loaded, and
how many requests are waiting — broken down by connector name:
vllm_kv_connector_queries_total{connector}
Total lookup queries issued to each child connector.
vllm_kv_connector_hits_total{connector}
Requests where this connector won the weighted selection.
vllm_kv_connector_hit_tokens_total{connector}
Total matched tokens served by the winning connector.
vllm_kv_connector_misses_total{connector}
Requests where this connector had no cache hit.
vllm_kv_connector_pending_loads{connector} (gauge)
Requests currently in-flight for an external load from this connector.
Previously all external hits were aggregated under
vllm:external_prefix_cache_hits_total with no connector breakdown,
making it impossible to distinguish LMCache (CPU) vs llm-d (disk)
contributions.
Co-authored-by: Claude
Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…line
Module-level prometheus_client Counter/Gauge objects registered in the
EngineCore subprocess do not appear in the APIServer's /metrics endpoint.
Replace them with a _SelectionStats dataclass that accumulates per-connector
queries/hits/hit_tokens/misses in the scheduler and flows the data through
vLLM's existing KVConnectorStats pipeline:
EngineCore: MultiConnector.get_kv_connector_stats()
→ MultiKVConnectorStats["__selection__"] = _SelectionStats
cross-process (msgspec pickle): stats.data sent in SchedulerStats
APIServer: MultiKVConnectorPromMetrics.observe()
→ _observe_selection() → Prometheus Counters registered in APIServer
New metrics (all labelled with model_name, engine, connector):
vllm:kv_connector_mc_queries_total
vllm:kv_connector_mc_hits_total
vllm:kv_connector_mc_hit_tokens_total
vllm:kv_connector_mc_misses_total
Co-authored-by: Claude
Signed-off-by: mbelleau
Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
offload_unit_sizes, first_hashable_chunk_idx, and group_hash_factors were recomputed on every property access. offload_unit_sizes in particular is called inside the chunk_count_for_tokens binary search loop (via group_covered_tokens_for_chunk_count), which runs per-request per-step. Pre-compute all three in __post_init__ using object.__setattr__ (the standard pattern for frozen dataclasses with derived cached values). Properties now return the cached tuple directly — O(1) instead of O(groups). Co-authored-by: Claude Signed-off-by: mbelleau Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
_get_block_hashes returns a lazy iterator (islice over HybridChunkBlockHashList). The previous code called it twice with the same start/end indices: once for prepare_load (which fully consumed the iterator) and a second time to update _reqs_being_loaded. Materialise to a list on the first call so both consumers share the same object. This removes one full HybridChunkBlockHashList construction and one islice traversal per load-scheduled request. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…duler Previously, every call to _get_block_hashes created a new HybridChunkBlockHashList with fresh RequestBlockHashList instances. Those per-group lists lazily cache computed hashes, but the cache was discarded at the end of each call, forcing a full recomputation from token index 0 on every subsequent call — including the 2–3 calls per request per scheduler step in get_num_new_matched_tokens. For a 23k-token prompt with attention block_size=1056, each fresh RequestBlockHashList had to compute up to ~22 group-level hashes per call. With 2 groups and 3 calls per step, that is ~130 hash_function invocations that can be avoided after the first step. Fix: store one HybridChunkBlockHashList per active request in _hybrid_hash_lists. The same instance is returned on subsequent calls, so RequestBlockHashList's internal cache is reused. The instance is seeded lazily on the first _get_block_hashes call and cleaned up in request_finished. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
_get_value_at previously recomputed the combined chunk hash (a hash_function call over a tuple of per-group BlockHashes) on every invocation, even for indices that had been computed before. Add _chunk_hashes, a lazily-grown list mirroring RequestBlockHashList's _hashes cache. Sequential accesses (the common case via islice) are served from the list after the first visit. Out-of-order accesses skip caching to keep the list dense. Combined with the scheduler-level _hybrid_hash_lists cache (which preserves the HybridChunkBlockHashList instance across scheduler steps), this eliminates all redundant hash_function calls for previously-seen chunk indices — both within a step and across steps. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
The assert inside _build_gpu_transfer_spec_from_chunk_range checked that gpu_block_size % unit_size == 0 on every call, for every group. Both values are constants derived from the spec at construction time, so the check only needs to run once. Move it to __init__ where it fails early with a clear message at startup rather than silently passing (when asserts are optimised away with -O) or checking repeatedly on the hot path. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Verify that _chunk_hashes grows as indices are accessed and that repeated accesses return the cached value without growing the cache further. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Stock vLLM 0.18.0 guards _mamba_block_aligned_split with: assert num_external_computed_tokens == 0, "External KV connector is not verified yet" This blocks any external KV cache connector (including our hybrid offloading path) from working with mamba/hybrid models. The function already correctly adds external tokens to num_computed_tokens (line 298- 301) and the alignment logic is agnostic to the source of those tokens — it only cares about the total count to decide block-aligned splitting. External tokens are loaded into GPU KV cache before the forward pass; from the model's perspective they are indistinguishable from locally computed tokens. The mamba state for external chunks is already populated at block boundaries by the offloading connector's store/load path. Validated: Qwen3.5-397B-A17B-NVFP4 on 4×GPU TP=4, the crash occurred after ~23 minutes of successful serving when a request finally triggered a cache hit path through the mamba block alignment codepath. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…h logs to DEBUG - Remove _ensure_transfer_supported() stub (body was just `return`) and its two call sites in update_state_after_alloc and _get_reqs_to_store. - Downgrade per-transfer-job logger.info calls in offloading/worker.py to DEBUG; these fire on every store/load submission and completion and are too noisy at INFO level under real inference load. Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…nment - MultiConnector: don't block all connectors when one returns None (backpressured). Let resolved connectors answer immediately; only defer if ALL connectors are unresolved. - OffloadingConnectorWorker: graceful fallback when load submission fails (stale cache files) instead of assert. Reports failed loads as finished so the scheduler falls back to recompute. - Scheduler: skip mamba block-aligned split during async KV load (num_new_tokens is intentionally 0, not a real prefill). Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
When some groups fail file size validation on load (e.g., attention group kernel block size changed between restarts), the scheduler now takes the minimum computed tokens across groups and warns instead of asserting. The request falls back to recomputing the unloaded portion. Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
When KV groups disagree on computed prefix length (e.g., attention group loaded from disk but mamba groups didn't), the fallback must set both num_computed_tokens AND num_external_tokens to 0. Previously only num_computed_tokens was zeroed, leaving num_external_tokens non-zero which failed the chunk_prefix_tokens round-trip assertion. Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
When a load from external storage (NFS, shared disk) takes longer than load_timeout_seconds (default 30s), cancel the wait and fall back to recompute. Prevents requests from stalling indefinitely on slow or hung storage. The timeout is checked each scheduler step in _try_promote_blocked_waiting_request. Timed-out loads are marked as failed, their block hashes released, and the request re-enters the WAITING queue for recompute. Configurable via kv_connector_extra_config: "load_timeout_seconds": 30.0 Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
- typos: rename storeable_prefix_tokens → storable_prefix_tokens in planner.py and test_planner.py - ruff E501: break long log format strings in offloading/worker.py and offloading/scheduler.py - mypy [no-redef]: annotate variables redefined after early return in _build_gpu_transfer_spec_from_chunk_range - mypy [attr-defined]: add type: ignore for hasattr-guarded accesses and rename offloaded_hashes variable in non-hybrid path to avoid type mismatch with HybridChunkBlockHashList | None - mypy [assignment]: annotate reconstructed_data as dict[str, KVConnectorStats] in MultiKVConnectorStats.from_dict_data; annotate extra_config as dict[str, Any] in test helper - mypy [arg-type]: guard get_timed_out_loads with hasattr check in scheduler; type: ignore on set[str] forwarded to handle_preemptions - cpu.py: use block_size_factors[0] instead of removed block_size_factor - spec.py: add type: ignore[arg-type] on int(hybrid_chunk_size) where extra_config.get() can return Any | None - ruff format: reformat long lines throughout Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
|
Rebased onto latest
Cross-node + cross-restart validation — 10/10 PASS (Creativity RTX 4080 Super SM 8.9 ↔ AIBoss RTX 5090 SM 12.0,
Both containers now run |
…solution After rebasing onto upstream/main, the automated conflict resolver merged both sides of import conflicts, leaving: - vllm/v1/kv_offload/spec.py: unused AttentionBackend import - vllm/v1/kv_offload/worker/cpu_gpu.py: duplicate BlockIDsLoadStoreSpec import and unused AttentionBackend import, causing I001/F401/F811 - vllm/v1/kv_offload/cpu/spec.py: get_handlers() passed undefined attn_backends and gpu_block_size to CpuGpuOffloadingHandlers constructor (F821); also used removed self.block_size_factor (singular) instead of self.block_size_factors[0] All ruff checks now pass on all changed files. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
|
Pushed a follow-up commit (
All pre-commit hooks pass locally: Re the The Intel XPU failures ( |
The merge-both conflict resolver during rebase onto upstream/main incorrectly appended old test_offloading_connector.py content to the upstream's offloading_connector/utils.py, producing syntax errors from line 567 onward. Restore to the upstream version. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
|
Latest push fixes a mangled CI gate: the Intel XPU failures ( |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This PR is the fix a real user is currently blocked on — see LMCache/LMCache#3655 (and #45268 / #3655's thread): Qwen3.6-class hybrid + Two notes:
|
Summary
Enables external KV cache offloading for hybrid models (mamba + attention) like Qwen3.5. The stock offload path requires LCM of all group block sizes, which is impractical when mamba groups have very different sizes from attention groups.
Core changes
HybridOffloadPlanner (
v1/kv_offload/planner.py):hybrid_chunk_sizesplits groups wheregpu_block_size % chunk_size == 0MultiConnector (
multi_connector.py):MultiConnectormatched_tokens × load_weightscoringSupportsHMAmixin) for hybrid memory allocator compatibilityset[str]signatureScheduler (
scheduler.py):Metrics (
loggers.py):prompt_tokens_by_sourcevalues that crash Prometheus counters under concurrent external cache hitsTest plan
Validated on Qwen3.5-4B-FP8 (24 mamba + 8 attention layers, RTX 4080 Super):
PYTHONHASHSEED=0)Closes #38230
🤖 Generated with Claude Code