[Metrics] Adding vllm-omni diffusion metrics support#1977
Conversation
Signed-off-by: Chen Yang <2082464740@qq.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
wait for PR1908 refactoring merged
|
@Bounty-hunter @david6666666 @ZJY0516 @lishunyang12 PTAL,thanks |
ok |
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Removed metrics from the output representation. Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: Chen Yang <2082464740@qq.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments — the stats normalization looks good but the omni_base changes need a rebase.
| stage_meta["stage_type"], | ||
| req_id, | ||
| engine_outputs, | ||
| ) |
There was a problem hiding this comment.
This will conflict with main — process_stage_metrics already handles both the accumulate_diffusion_metrics call and final_output_type passing. Needs a rebase after #1908.
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Refactor output handling and metrics accumulation in the Omni request processing. Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
|
I do not understand the meaning of the diffusion examples: in your test result, some of them contain prepocecesing but some does not inlcude it. do we have a arg/param list design for the log stats? In addition, what's the relationship with profiler? @gcanlin |
Not really related actually. This PR is focusing on rough profiling, such as e2e time. Torch profiler is for kernel level profiling. |
|
image_num/resolution should be in integer |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds important profiling capabilities for vLLM-Omni diffusion pipelines, but has blocking issues that must be resolved before merge.
✅ What Works Well
- Clean addition of
--log-statsflag to example scripts - Better type handling in metrics (int → int | float)
- All CI checks pass
- Good example test coverage in PR description
🚫 Blocking Issues
1. Merge Conflicts
The PR is in CONFLICTING state. Please rebase against main and resolve conflicts.
2. Missing Unit Tests
The core changes to vllm_omni/metrics/stats.py lack unit test coverage:
- New
_normalize_diffusion_metric_value()function - Modified
accumulate_diffusion_metrics()logic - Edge cases: bool conversion, Real types, invalid types
Required: Add unit tests in tests/metrics/test_stats.py (or equivalent) covering:
- Bool → int conversion
- Real → float conversion
- Invalid type handling (should return None)
- Accumulation with various metric types
- None value filtering in
_as_stage_request_stats
⚠️ Code Quality Issues
3. Breaking Change: stage_durations Removal
OmniRequestOutput.stage_durations was removed but:
- No deprecation warning or migration guide
- Could break existing code relying on this field
- Not documented in PR description
Recommendation: If this field is no longer needed, document the breaking change. If still useful, restore it.
4. Metric Accumulation Timing
In omni_base.py:252-263, accumulate_diffusion_metrics() is called before checking if finished. This means metrics may be accumulated for incomplete requests.
Recommendation: Move the accumulation call inside the if finished block to ensure we only accumulate completed requests.
5. Silent Type Conversion Failures
The normalization function silently skips invalid types without logging. This could hide data quality issues.
Recommendation: Add debug-level logging when skipping unexpected types:
if normalized_value is None:
logger.debug("Skipping unsupported metric value type: %s for key %s", type(value).__name__, key)Next Steps
- Resolve merge conflicts
- Add unit tests for stats.py changes
- Address the code quality issues above
- Update PR description to document any breaking changes
| _m = result.get("metrics") | ||
| if finished and _m is not None: | ||
| metrics.on_stage_metrics(stage_id, req_id, _m) | ||
| metrics.accumulate_diffusion_metrics( |
There was a problem hiding this comment.
This accumulates metrics before checking if the request is finished. Should this be moved inside the if finished and _m is not None: block below? Otherwise we might accumulate metrics for incomplete/partial requests.
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| def _normalize_diffusion_metric_value(value: Any) -> int | float | None: |
There was a problem hiding this comment.
Consider adding debug logging when returning None:
if normalized_value is None:
logger.debug("Skipping unsupported metric type: %s", type(value).__name__)This helps with debugging if unexpected types appear in production.
| if diffusion_metrics: | ||
| for key, value in diffusion_metrics.items(): | ||
| self.diffusion_metrics[req_id][key] += value | ||
| normalized_value = _normalize_diffusion_metric_value(value) |
There was a problem hiding this comment.
Using pop() has side effects. Consider whether a defensive copy would be safer:
if req_id in self.diffusion_metrics:
metrics = self.diffusion_metrics[req_id].copy()
del self.diffusion_metrics[req_id]
stats.diffusion_metrics = {k: normalized_value for k, v in metrics.items() ...}This makes the mutation explicit and avoids surprises if the dict is accessed elsewhere.
| f"prompt={self.prompt!r}", | ||
| f"latents={self.latents}", | ||
| f"metrics={self.metrics}", | ||
| f"multimodal_output={self._multimodal_output}", |
There was a problem hiding this comment.
Breaking Change: stage_durations is removed from OmniRequestOutput.
If this field is no longer needed, please document this in the PR description as a breaking change. If it's still useful for debugging/profiling, consider restoring it or providing an alternative way to access this data.
|
I remember there was a doc written by @LJH-LBJ about the log-stats, please check and change accordingly |
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: Chen Yang <2082464740@qq.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: Chen Yang <2082464740@qq.com>
|
I think this can close this PR since #3069 opened |
Adding profiling for vllm-omni
Purpose
In the vllm-omni project, the logs printed by the Diffusion/DiT Single diffusion Pipeline model lack some diffusion feature information. This PR supplements this information and improves the log printing format.
Test Plan
Test Result glm_image
Test Result text_to_image
Test Result image_to_image
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)