Skip to content

[Metrics] [KVConnector] Add Offloading Connector metrics#27942

Merged
markmc merged 51 commits intovllm-project:mainfrom
omerpaz95:offload-connector-metrics
Jan 27, 2026
Merged

[Metrics] [KVConnector] Add Offloading Connector metrics#27942
markmc merged 51 commits intovllm-project:mainfrom
omerpaz95:offload-connector-metrics

Conversation

@omerpaz95
Copy link
Contributor

@omerpaz95 omerpaz95 commented Nov 2, 2025

Added queries and hits metrics for the Offloading Connector.
Also added timing metrics for store and load operations, which take the average time it takes to load/store, per-token.
The metrics are available from Prometheus and from the StatLogger.

Purpose

Allows collection of timing metrics for the Offloading Connector, which is essential for future development.
@orozery please review.

Test Plan

Test Result


Note

Cursor Bugbot is generating a summary for commit c035d2f. Configure here.


Note

Introduces metrics and instrumentation for KV offloading transfers.

  • Adds OffloadingConnectorStats and OffloadPromMetrics with histograms, gauges, and counters for CPU_to_GPU and GPU_to_CPU transfer size and throughput
  • Instruments worker: TransferResult now includes job id, success, num blocks, duration, and transfer type; CUDA start/end events capture timings; results aggregated per request to bytes via computed bytes_per_block
  • Extends APIs: LoadStoreSpec.num_blocks and implementation in BlockIDsLoadStoreSpec; OffloadingConnector now requires kv_cache_config for WORKER, computes bytes_per_block, exposes get_kv_connector_stats, and builders for stats/Prom metrics
  • Updates handlers to validate BlockIDsLoadStoreSpec usage and record transfer types; scheduler/worker paths unchanged functionally aside from stats collection

Written by Cursor Bugbot for commit 1981e15. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 4a67a80. Configure here.


Note

Introduces per-transfer KV offloading observability and minimal API changes to support it.

  • Adds OffloadingConnectorStats and OffloadPromMetrics with histograms, gauges, and counters for CPU_to_GPU/GPU_to_CPU transfer size and throughput; exposes get_kv_connector_stats plus builders on the connector
  • Instruments worker/handlers: capture CUDA start/end events, enrich TransferResult with job id, success, block count, duration, and transfer type; aggregate to bytes via computed bytes_per_block; validate BlockIDsLoadStoreSpec usage
  • Extends API: LoadStoreSpec.num_blocks and implementation in BlockIDsLoadStoreSpec; OffloadingConnector WORKER now requires kv_cache_config to compute bytes_per_block

Written by Cursor Bugbot for commit 4a67a80. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Nov 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds metrics for the Offloading Connector, including queries, hits, and timing for store/load operations. The changes are generally good, but I've found a critical bug in a type hint that would cause a NameError, and a flaw in the timing metric calculation that would lead to inaccurate results. I've provided suggestions to fix these issues. I also suggested a small improvement for robustness in the stats reset logic.

# req_id -> (job_id, store)
self._jobs: dict[int, tuple[ReqId, bool]] = {}
# req_id -> (job_id, store, start_time, num_blocks)
self._jobs: dict[int, tuple[ReqId, bool, float, num_blocks]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The type hint for self._jobs uses num_blocks, which is an undefined variable at class scope, not a type. This will cause a NameError. It should be int.

Suggested change
self._jobs: dict[int, tuple[ReqId, bool, float, num_blocks]] = {}
self._jobs: dict[int, tuple[ReqId, bool, float, int]] = {}

Comment on lines +39 to +64
class OffloadTiming:
data: dict[str, Any] = field(default_factory=dict)
num_stores: int = 0
num_loads: int = 0
def __post_init__(self):
if not self.data:
# Empty container init, no data is passed in.
self.reset()

def reset(self):
self.data: dict[str, float] = {
"total_load_time": 0,
"total_store_time": 0
}
self.num_stores = 0
self.num_loads = 0

# Time is already normalized by the number of blocks.
def record_time(self, time: float, is_store: bool):
if is_store:
self.data["total_store_time"] += time
self.num_stores += 1
else:
self.data["total_load_time"] += time
self.num_loads += 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for calculating average load/store time is flawed. It computes the average of per-operation rates (sum(duration_i / num_blocks_i) / num_ops), which is not mathematically equivalent to the true average rate (sum(duration_i) / sum(num_blocks_i)) when num_blocks_i varies across operations. This can lead to inaccurate metrics.

To fix this, OffloadTiming should accumulate the total duration and total number of blocks for all operations, and then the average time per token can be correctly calculated. This comment refactors OffloadTiming to support this. Subsequent comments will adjust the call sites.

@dataclass
class OffloadTiming:
    data: dict[str, Any] = field(default_factory=dict)

    def __post_init__(self):
        if not self.data:
            # Empty container init, no data is passed in.
            self.reset()
    
    def reset(self):
        self.data: dict[str, float] = {
            "total_load_duration": 0.0,
            "total_store_duration": 0.0,
            "total_loaded_blocks": 0,
            "total_stored_blocks": 0,
            }
        
    def record_op(self, duration: float, num_blocks: int, is_store: bool):
        if is_store:
            self.data["total_store_duration"] += duration
            self.data["total_stored_blocks"] += num_blocks
        else:
            self.data["total_load_duration"] += duration
            self.data["total_loaded_blocks"] += num_blocks

Comment on lines +76 to +80
def reset(self):
self.data: dict[str, float] = {
"total_queries" : 0,
"total_hits": 0,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It's good practice to initialize all keys that will be accessed later in the reset method. The reduce method accesses avg_load_time and avg_store_time, but they are not initialized here. While the current code flow seems to ensure they are set before reduce is called, adding them here makes the code more robust against future changes.

Suggested change
def reset(self):
self.data: dict[str, float] = {
"total_queries" : 0,
"total_hits": 0,
}
def reset(self):
self.data: dict[str, float] = {
"total_queries" : 0,
"total_hits": 0,
"avg_load_time": 0.0,
"avg_store_time": 0.0,
}

Comment on lines +119 to +129
def aggregate_time_data(self, offload_timing: OffloadTiming):
# Avoid division by zero:
if offload_timing.num_loads == 0 or self.offloaded_block_size == 0:
self.data["avg_load_time"] = 0
else:
self.data["avg_load_time"] = offload_timing.data["total_load_time"] / (offload_timing.num_loads * self.offloaded_block_size)
if offload_timing.num_stores == 0 or self.gpu_block_size == 0:
self.data["avg_store_time"] = 0
else:
self.data["avg_store_time"] = offload_timing.data["total_store_time"] / (offload_timing.num_stores * self.gpu_block_size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the refactoring of OffloadTiming, this method needs to be updated to correctly calculate the average load and store times. The average time per token should be total_duration / total_tokens.

Suggested change
def aggregate_time_data(self, offload_timing: OffloadTiming):
# Avoid division by zero:
if offload_timing.num_loads == 0 or self.offloaded_block_size == 0:
self.data["avg_load_time"] = 0
else:
self.data["avg_load_time"] = offload_timing.data["total_load_time"] / (offload_timing.num_loads * self.offloaded_block_size)
if offload_timing.num_stores == 0 or self.gpu_block_size == 0:
self.data["avg_store_time"] = 0
else:
self.data["avg_store_time"] = offload_timing.data["total_store_time"] / (offload_timing.num_stores * self.gpu_block_size)
def aggregate_time_data(self, offload_timing: OffloadTiming):
# Avoid division by zero:
total_loaded_tokens = offload_timing.data["total_loaded_blocks"] * self.offloaded_block_size
if total_loaded_tokens > 0:
self.data["avg_load_time"] = offload_timing.data["total_load_duration"] / total_loaded_tokens
else:
self.data["avg_load_time"] = 0.0
total_stored_tokens = offload_timing.data["total_stored_blocks"] * self.gpu_block_size
if total_stored_tokens > 0:
self.data["avg_store_time"] = offload_timing.data["total_store_duration"] / total_stored_tokens
else:
self.data["avg_store_time"] = 0.0

@@ -450,7 +575,8 @@ def get_finished(self, finished_req_ids: set[str]) -> tuple[set[str], set[str]]:
for job_id, success in self.worker.get_finished():
# we currently do not support job failures
assert success
req_id, store = self._jobs.pop(job_id)
req_id, store, start_time, num_blocks = self._jobs.pop(job_id)
self._timing_stats.record_time((time.perf_counter() - start_time) / num_blocks, store)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the refactoring of OffloadTiming, this needs to be updated to call the new record_op method with the total duration and number of blocks for the finished operation. This also fixes a potential ZeroDivisionError if num_blocks is 0.

Suggested change
self._timing_stats.record_time((time.perf_counter() - start_time) / num_blocks, store)
duration = time.perf_counter() - start_time
if num_blocks > 0:
self._timing_stats.record_op(duration, num_blocks, store)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +225 to +230
def get_kv_connector_stats(self) -> KVConnectorStats | None:
# if self.connector_worker is None:
# return None
if self.connector_worker:
self.kv_connector_stats.aggregate_time_data(self.connector_worker._timing_stats)
return self.kv_connector_stats

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset KV offload metrics before logging

The connector counts cache queries/hits in get_num_new_matched_tokens and stores them on self.kv_connector_stats, but get_kv_connector_stats just returns the same instance without cloning or clearing it. The Prometheus logger treats the returned dict as a delta and calls counter_offload_kv_connector_* .inc(...) on every iteration. Because the same cumulative totals are reported over and over, the counters grow faster than the real number of queries/hits (e.g., the second call increments by the whole lifetime total instead of the new activity). The stats object should be reset or snapshot before being handed to the logger to avoid double‑counting.

Useful? React with 👍 / 👎.

Omer Paz and others added 3 commits November 2, 2025 17:44
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
@omerpaz95 omerpaz95 force-pushed the offload-connector-metrics branch from ff71bba to fb23282 Compare November 2, 2025 15:44
omerpaz95 and others added 2 commits November 2, 2025 18:28
)

#
# Offloading connector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connectors can now add their own metrics via the build_prom_metrics() added in #26811

Signed-off-by: omerpaz95 <73347585+omerpaz95@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Jan 21, 2026
@mergify
Copy link

mergify bot commented Jan 21, 2026

Hi @omerpaz95, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Comment on lines +94 to +105
for k, v in self.data.items():
assert isinstance(v, list)
for op in v:
for stat, value in [
("_total_bytes", op["op_size"]),
("_total_time", op["op_time"]),
]:
log_key = k + stat
if log_key not in return_dict:
return_dict[log_key] = value
else:
return_dict[log_key] += value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
for k, v in self.data.items():
assert isinstance(v, list)
for op in v:
for stat, value in [
("_total_bytes", op["op_size"]),
("_total_time", op["op_time"]),
]:
log_key = k + stat
if log_key not in return_dict:
return_dict[log_key] = value
else:
return_dict[log_key] += value
for transfer_type, ops_list in self.data.items():
assert isinstance(ops_list, list)
return_dict[f"{transfer_type}_total_bytes"] = sum(op["op_size"] for op in ops)
return_dict[f"{transfer_type}_total_time"] = sum(op["op_time"] for op in ops)

return return_dict

def is_empty(self) -> bool:
return len(self.data.items()) == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return len(self.data.items()) == 0
return not self.data

Comment on lines +148 to +149
if kv_cache_config is None:
raise ValueError("kv_cache_config cannot be None for WORKER role")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted, no idea how it got here. Sorry.

Comment on lines +231 to +232
if self.connector_worker is None:
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.connector_worker is None:
return None
assert self.connector_worker is not None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I just noticed that get_kv_connector_stats is used by both the scheduler and worker.
We just emit from the worker.
So can you just add a comment above the return None?
e.g. # We only emit stats from the worker-side

if self.kv_connector_stats.is_empty():
return None
# Clear stats for next iteration
return self.kv_connector_stats.clone_and_reset()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers a copy of the old KVConnectorStats.
I see NixlConnector also does it, but it seems to me that a more simpler and efficient way is to simply:

Suggested change
return self.kv_connector_stats.clone_and_reset()
kv_connector_stats = self.kv_connector_stats
self.kv_connector_stats = OffloadingConnectorStats()
return kv_connector_stats

@markmc your thoughts?

Comment on lines +55 to +57

def __getitem__(self, key):
return getattr(self, key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that we can access the op's fields in a dict-like manner from reduce() , to conform with how we access them in observe(). Otherwise, it fails pre-commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested it, and it seems that after de-serialization, type(op) is actually a dict.
So I think we can remove the code here, and instead assert(isinstance(op, dict)) on the observe function.

self.reset()
return old

def aggregate(self, other: KVConnectorStats) -> KVConnectorStats:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see cursor's comment below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed the issue (To my understanding)

omerpaz95 and others added 5 commits January 21, 2026 16:14
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
@mergify
Copy link

mergify bot commented Jan 22, 2026

Hi @omerpaz95, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Comment on lines +93 to +95
transfer_size=None, # For now - just to get the test working
transfer_time=None, # For now - just to get the test working
transfer_type=None, # For now - just to get the test working
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would remove the comments



class TestOffloadingConnectorStats:
"""Tests for MultiConnector stats reconstruction and operations."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Tests for MultiConnector stats reconstruction and operations."""
"""Tests for OffloadingConnector stats reconstruction and operations."""

stream=stream,
start_event=start_event,
end_event=end_event,
num_bytes=dst_sub_block_count * self.block_size_in_bytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self.block_size_in_bytes is a list.
Can you define self.total_block_size_in_bytes which is the sum of this list and use it instead?

== handler.block_size_in_bytes
* handler.dst_block_size_factor
* len(dst_blocks)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also assert 0 < finished[0].time < (time.time() - start_time)?

omerpaz95 and others added 2 commits January 22, 2026 18:06
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Copy link
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the all hard work!!!
cc @markmc

@omerpaz95 omerpaz95 requested review from markmc and sagearc January 25, 2026 07:36
@github-project-automation github-project-automation bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Jan 26, 2026
@markmc markmc enabled auto-merge (squash) January 26, 2026 08:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026
@markmc markmc removed documentation Improvements or additions to documentation performance Performance-related issues new-model Requests to new models rocm Related to AMD ROCm frontend speculative-decoding ci/build labels Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants