Refactor omni metrics into stats → aggregator → logger pipeline#1
Refactor omni metrics into stats → aggregator → logger pipeline#1Copilot wants to merge 7 commits into
Conversation
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com>
|
@copilot vision encoder time, audio encoder time, thinker prefill time, thinker decode time, talker prefill time, talker decode如何获得呢 |
当前框架里还没有采集 vision encoder / audio encoder / thinker prefill&decode / talker prefill&decode 这些细粒度耗时。要获得它们需要在对应 stage 的引擎或输出处理处埋点,把时间写入传给 orchestrator 的 metrics(例如 StageRequestStats/StageRequestMetrics 的自定义字段),再由 aggregator/logger 统一输出。现有代码不会自动生成这些耗时。 |
|
@copilot vision encoder / audio encoder 在vllm_omni\model_executor\models\qwen3_omni\qwen3_omni_moe_thinker.py的embed_multimodal的_process_image_input和_process_audio_input,thinker prefill&decode / talker prefill&decode 在vllm_omni\entrypoints\async_omni_llm.py的_run_output_handler会调用logger_manager.record记录数据,我感觉这些函数的调用在vllm侧,不方便修改vllm的代码,是不是只能通过prometheus_client传递 |
这些耗时仍需在调用点埋点:
|
|
@copilot 在 async_omni_llm._run_output_handler 里,捕获 thinker/talker prefill/decode 的时间点并写入同一 metrics 结构(如 thinker_prefill_ms/thinker_decode_ms/talker_prefill_ms/talker_decode_ms),logger_manager.record 会随 stats 传出 |
可以在 _run_output_handler 拿到阶段性时间戳后直接写入 metrics 字典,示例(示意性代码): import time
# 在处理某个请求的循环里
start_prefill = time.perf_counter()
# ... 调用 thinker/talker prefill ...
end_prefill = time.perf_counter()
start_decode = time.perf_counter()
# ... 调用 thinker/talker decode ...
end_decode = time.perf_counter()
m = result.get("metrics") or {}
# thinker
m["thinker_prefill_ms"] = (end_prefill - start_prefill) * 1000
m["thinker_decode_ms"] = (end_decode - start_decode) * 1000
# 如果是 talker,同理写 talker_prefill_ms / talker_decode_ms
# 传给 orchestrator
metrics.on_stage_metrics(stage_id, req_id, m)关键点:
|
|
@copilot 代码里怎么改? |
可以在 # 假设当前处于 result 处理分支
import time
start_prefill = time.perf_counter()
# 调用 thinker/talker prefill 逻辑
end_prefill = time.perf_counter()
start_decode = time.perf_counter()
# 调用 thinker/talker decode 逻辑
end_decode = time.perf_counter()
m = result.get("metrics") or {}
m["thinker_prefill_ms"] = (end_prefill - start_prefill) * 1000
m["thinker_decode_ms"] = (end_decode - start_decode) * 1000
# 对 talker 同理写 talker_prefill_ms / talker_decode_ms
metrics.on_stage_metrics(stage_id, req_id, m)核心是用 |
|
@copilot 如何把perf_counter 记录时长并把毫秒数写入 metrics 字典?给我一份直接能用的代码 |
示例可直接放进 import time
# ... 取到 result, stage_id, req_id, metrics 对象 (OrchestratorMetrics) ...
# Thinker prefill
prefill_t0 = time.perf_counter()
# 调用 thinker prefill 逻辑
prefill_t1 = time.perf_counter()
# Thinker decode
decode_t0 = time.perf_counter()
# 调用 thinker decode 逻辑
decode_t1 = time.perf_counter()
m = result.get("metrics") or {}
# 写毫秒
m["thinker_prefill_ms"] = (prefill_t1 - prefill_t0) * 1000
m["thinker_decode_ms"] = (decode_t1 - decode_t0) * 1000
# 如果是 talker,同理记录 talker_prefill_ms / talker_decode_ms
# 继续交给聚合器
metrics.on_stage_metrics(stage_id, req_id, m)关键点:用 |
|
@copilot 介绍一下StageRequestStats, TransferEdgeStats, RequestE2EStats, RunSummary, and OrchestratorAggregator、OmniStatLoggerBase, OmniLoggingStatLogger的作用 |
|
@copilot 我要写一个merge函数,对StageRequestStats,TransferEdgeStats,RequestE2EStats里的event计算并输出平均数 |
可以用已有对象直接聚合,示例(放在你需要的位置,输入三类事件列表): from statistics import mean
from vllm_omni.metrics import RunSummary, StageRequestStats, TransferEdgeStats, RequestE2EStats
def merge_and_avg(stage_events: list[StageRequestStats],
transfer_events: list[TransferEdgeStats],
request_events: list[RequestE2EStats]) -> dict:
out: dict[str, float] = {}
if stage_events:
out["stage_tokens_out_avg"] = mean(ev.num_tokens_out for ev in stage_events)
out["stage_latency_ms_avg"] = mean(ev.stage_gen_time_ms for ev in stage_events)
if transfer_events:
out["transfer_size_bytes_avg"] = mean(ev.size_bytes for ev in transfer_events)
out["transfer_total_ms_avg"] = mean(ev.total_time_ms for ev in transfer_events)
if request_events:
out["e2e_time_ms_avg"] = mean(ev.e2e_time_ms for ev in request_events)
out["e2e_tokens_avg"] = mean(ev.e2e_total_tokens for ev in request_events)
return out如已使用 |
|
@copilot 整理事件流程 |
事件数据流(开启 enable_debug_events 时):
|
|
@copilot 现在有一个问题:我在前一个周期得到了stage0的信息,后一个周期得到了stage1、2,但这样最后的stage统计就少了stage0,这应该怎么办 |
避免漏掉 stage0 的两种方式:
|
Refines vllm-omni metrics to mirror vLLM’s three-layer design (Stats → Aggregator → Loggers/Exporters) with per-transfer, per-stage, and e2e summaries plus optional Prometheus export.
vllm_omni.metricspackage withStageRequestStats,TransferEdgeStats,RequestE2EStats,RunSummary, andOrchestratorAggregator(keepsOrchestratorMetricsalias for compatibility).OmniStatLoggerBase,OmniLoggingStatLogger,OmniPrometheusStatLogger, andOmniStatLoggerManagerfor periodic interval logging/export.entrypoints/log_utils.pynow re-exports the new metrics API for backward compatibility.Example:
根本原因 1:_proj_buf 持久缓冲区跨请求污染
PR vllm-project#1758 使用了持久的 self._proj_buf,多个并发请求共享同一块 GPU 内存。请求 A 写入的 embedding 数据会被请求 B 覆盖,导致后续 AR step 读到错误的历史 embedding。这是并发测试间歇性失败的主因。
修复:每次 forward() 调用局部分配 proj_buf = torch.zeros(...)。
根本原因 2:summed_embeddings 3D + text_step 2D 广播错误
summed_embeddings shape 为 [B, S, H](3D),text_step shape 为 [B*S, H](2D)。两者相加时 PyTorch 不会报错,而是静默广播——当 B > 1 时结果完全错误。
修复:summed_embeddings.reshape(-1, H) 统一为 2D 再相加。
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.