[Misc][DP] support customized aggregated logger for dp#24354
[Misc][DP] support customized aggregated logger for dp#24354luccafong merged 6 commits intovllm-project:mainfrom
Conversation
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for a customized global logger for data parallel (DP) training, which is a valuable addition for monitoring distributed training. The changes primarily affect the V1 engine, with corresponding updates to examples and tests. My review has identified a few issues: a bug in the updated example code where a request ID is not unique within a loop, a leftover debug print statement in the new logger logic, and an unused parameter in the V0 engine's API that could cause confusion. Addressing these points will improve the quality and correctness of the implementation.
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
cb2d450 to
893ea0f
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
cb2d450 to
7bfddba
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
|
This pull request has merge conflicts that must be resolved before it can be |
7bfddba to
8156746
Compare
8633cf4 to
562f552
Compare
| # Each EngineCore's metrics are expressed as a unique label. | ||
| self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs) | ||
| self.global_logger: Optional[StatLoggerBase] = None |
There was a problem hiding this comment.
let's name this as aggregated_logger?
There was a problem hiding this comment.
Perhaps type here better as GlobalStatLoggerBase
| @@ -145,6 +156,63 @@ def log_engine_initialized(self): | |||
| self.vllm_config.cache_config.num_gpu_blocks) | |||
|
|
|||
|
|
|||
| class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase): | |||
There was a problem hiding this comment.
AggregatedLogger is a better name.
There was a problem hiding this comment.
Add some comments to explain the difference?
| @@ -222,6 +230,7 @@ def from_engine_args( | |||
| start_engine_loop: bool = True, | |||
| usage_context: UsageContext = UsageContext.ENGINE_CONTEXT, | |||
| stat_loggers: Optional[list[StatLoggerFactory]] = None, | |||
| stat_logger_global: Optional[GlobalStatLoggerFactory] = None, | |||
There was a problem hiding this comment.
I think we should rename this per_engine_loggers and aggregated_logger.
There was a problem hiding this comment.
Thanks @luccafong!
I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).
For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.
Then we also wouldn't need a special case/field for the PrometheusLogger.
A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.
(btw my comments below might not be applicable in the same way if we refactor per above suggestion).
| # Each EngineCore's metrics are expressed as a unique label. | ||
| self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs) | ||
| self.global_logger: Optional[StatLoggerBase] = None |
There was a problem hiding this comment.
Perhaps type here better as GlobalStatLoggerBase
| now = time.monotonic() | ||
| prompt_throughput = self._get_throughput(self.num_prompt_tokens, now) | ||
| generation_throughput = self._get_throughput( | ||
| self.num_generation_tokens, now) | ||
|
|
||
| self._reset(now) | ||
|
|
||
| scheduler_stats = self.last_scheduler_stats | ||
|
|
||
| log_fn = logger.info | ||
| if not any( | ||
| (prompt_throughput, generation_throughput, | ||
| self.last_prompt_throughput, self.last_generation_throughput)): | ||
| # Avoid log noise on an idle production system | ||
| log_fn = logger.debug | ||
| self.last_generation_throughput = generation_throughput | ||
| self.last_prompt_throughput = prompt_throughput |
There was a problem hiding this comment.
Could we restructure so that this isn't duplicated with the superclass?
thanks @njhill for the review and suggestions. Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger. |
Re PrometheusLogger... this is already a "global" one right? so it would just mean we can include that in the list as one of the default ones and not have a separate field. Thinking more, maybe we can keep the global stats logger abstract subclass after all (but I agree that it would be better to name it And then have a simple adapter - if a provided stats logger is not an AggregateStatsLogger, wrap it in this adapter which converts non-aggregate into aggregate (just contains dict with n instances). Then our internal field can just be a list of AggregateStatsLoggers and we invoke them all in the same way. PrometheusLogger would just be updated to extend AggregateStatsLogger. And your concrete impl of the aggregate logging one we could name e.g. I hope that makes sense! |
|
@luccafong it looks like there's a test that needs updating: https://buildkite.com/vllm/ci/builds/34468#0199d15b-035d-435e-a030-a453e91e436e and will need another rebase now that all those formatting changes have been made :( |
|
This pull request has merge conflicts that must be resolved before it can be |
5484535 to
ce5c02f
Compare
Signed-off-by: Lu Fang <fanglu@fb.com> fix the test Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> add aggregator interface and abstract common logic Signed-off-by: Lu Fang <fanglu@fb.com> add corrupted request aggregation Signed-off-by: Lu Fang <fanglu@fb.com> more refactor Signed-off-by: Lu Fang <fanglu@fb.com> fix ut Signed-off-by: Lu Fang <fanglu@fb.com> fix kv_connector_logging Signed-off-by: Lu Fang <fanglu@fb.com> fix merge conflicts Signed-off-by: Lu Fang <fanglu@fb.com> fix lint Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address commnet Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
7ab416c to
554625d
Compare
…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com>
…24354) Signed-off-by: Lu Fang <fanglu@fb.com>
Purpose
Test Plan
-aggregate-engine-loggingTest Result
logs:
Add option
--aggregate-engine-loggingto enable aggregated log stats.serve
benchmark
Logs:
Essential Elements of an Effective PR Description Checklist