[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLLM#2551
[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLLM#2551gcanlin merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
This PR is totally vibe coded. But looks clean. @princepride @fake0fan @yinpeiqi Could you help take a look? |
| manager = getattr(self.engine, "logger_manager", None) | ||
| if manager is None: | ||
| return | ||
| try: | ||
| manager.log() | ||
| except Exception: | ||
| logger.exception("[AsyncOmni] do_log_stats failed") |
There was a problem hiding this comment.
Cross-thread data race on StatLoggerManager between record() and log()
StatLoggerManager is accessed from two different threads without synchronization. record() is called from the orchestrator's background thread in Orchestrator._process_stage_outputs() (vllm_omni/engine/orchestrator.py:621), while log() is called from the main caller's thread in AsyncOmni.do_log_stats() (vllm_omni/entrypoints/async_omni.py:755). The orchestrator runs in a dedicated threading.Thread created at vllm_omni/engine/async_omni_engine.py:268. In upstream vLLM's AsyncLLM, both record() and log() execute within the same asyncio event loop / thread context. Here they are split across threads, creating a data race on the internal accumulators of StatLoggerManager.
|
|
||
| async def do_log_stats(self) -> None: | ||
| """Log statistics. | ||
| """Log statistics by flushing per-stage StatLoggerManagers. |
There was a problem hiding this comment.
I think we'd better don't operate on the orchestration thread from the async_omni thread. In my view, here should be:
class AsyncOmniEngine:
async def do_log_stats(self):
await self.engine.do_log_stats()
class AsyncOmniEngine:
async def do_log_stats(self):
# let the orchestrator thread do the call
Maybe could be regard as a collect rpc call? I am not very sure. But definitally we'd better don't direct operate on the orchestrator thread from AsyncOmni.
| self.output_processors: list[Any] = output_processors | ||
| self.stage_vllm_configs: list[Any] = stage_vllm_configs | ||
| self.log_stats = log_stats | ||
| self.logger_manager: StatLoggerManager | None = logger_manager |
There was a problem hiding this comment.
Is log_stats still useful in orchestrator? Could we just
self.log_stats = (self.logger_manager != None)
| # Mirror vLLM AsyncLLM output_handler: feed stats to the logger | ||
| # manager so LoggingStatLogger can periodically print KV cache / | ||
| # prefix cache hit rate, and PrometheusStatLogger can publish. | ||
| if self.logger_manager is not None: |
There was a problem hiding this comment.
The diffusion engine don't go into this branch. Do we have any plan for diffusion?
There was a problem hiding this comment.
Currently, no idea about diffusion logger. Reusing vLLM's logger makes this PR simple. But something like KV cache isn't appropriate to diffusion.
| self.num_stages = len(self.stage_configs) | ||
| stage0_args = getattr(self.stage_configs[0], "engine_args", None) if self.num_stages > 0 else None | ||
| self.async_chunk = bool(getattr(stage0_args, "async_chunk", False)) | ||
| self.log_stats = not bool(getattr(stage0_args, "disable_log_stats", False)) |
There was a problem hiding this comment.
Overall this looks fine to me. If the StatLoggerManager concurrency issue has been properly resolved, I don't have other blockers.
One small nit: this seems to rely too heavily on the stage0 configuration, which feels somewhat awkward. Probably okay for now, but worth cleaning up later. cc @yinpeiqi
Also, it may be worth taking another look at the logging/stat system for the diffusion path in a follow-up as well, since it seems not fully covered by the current branch yet. @chickeyton
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@princepride @fake0fan @yinpeiqi Thanks for the valuable review! I fixed them now. Please take another look. |
|
overall LGTM, please fix the ci |
vllm-project#2551)" This reverts commit 5d58abb.
vllm-project#2551)" This reverts commit 5d58abb. Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
…project#2551) Signed-off-by: gcanlin <canlinguosdu@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Stat logging wired end-to-end (mirrors AsyncLLM)
Single-threaded StatLoggerManager access
StatLoggerManager is not thread-safe, and record() runs in the orchestrator thread while do_log_stats() is called from the API-server main thread — a data race on the internal accumulators.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)