[misc] feat: mlflow improvements#2153
Conversation
📝 WalkthroughWalkthroughThe PR extends MLflow logging integration throughout the training pipeline. It adds metric logging to evaluation results, implements a more robust metric key sanitizer handling special characters and slashes, and introduces conditional guards in training utilities to safely log metrics only when logging backends are configured. Changes
Sequence Diagram(s)sequenceDiagram
participant Eval as Evaluation<br/>Process
participant Sanitizer as Metric<br/>Sanitizer
participant MLflow as MLflow<br/>Logger
participant TensorBoard as TensorBoard/<br/>WandB
Eval->>Eval: calculate validation metrics
Eval->>Sanitizer: sanitize metric keys
Sanitizer->>Sanitizer: replace @ with _at_<br/>collapse slashes<br/>clean invalid chars
Sanitizer-->>Eval: return sanitized key map
Eval->>MLflow: log val/{key} metrics
Eval->>TensorBoard: log to WandB/TB<br/>(existing path)
MLflow-->>Eval: acknowledged
TensorBoard-->>Eval: acknowledged
sequenceDiagram
participant Training as Training<br/>Loop
participant Guard as Conditional<br/>Check
participant Logger as Logger<br/>Backend
Training->>Guard: check if writer exists?
alt writer configured
Guard->>Logger: log learning rate
Guard->>Logger: log loss metrics
Guard->>Logger: log throughput
Guard->>Logger: log gradients/norms
Logger-->>Guard: metrics recorded
else no writer configured
Guard->>Guard: skip all logging
end
Guard-->>Training: continue iteration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 6e6e846 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/utils/mlflow_utils.py`:
- Around line 87-99: The helper _sanitize_key inside _sanitize_mlflow_metrics
currently reuses the original key when handling "/" which discards earlier
transforms; change _sanitize_key to accept and return str (add type hints) and
perform the "@ -> _at_" replacement and slash-collapsing first, then use that
sanitized string for the "/"-branch split (i.e., split the already-sanitized
value and replace remaining "/" in the rest), keep the final regex replacement
on the sanitized string, and ensure _sanitize_mlflow_metrics returns the
sanitized-key typed dict as before.
|
@yaoyu-33 @paul-gibbons is there anything else blocking from merging? |
|
/ok to test 543303c |
543303c to
9822b90
Compare
|
updated unit tests @yaoyu-33 |
|
@yaoyu-33 could you help run ci again? |
|
/ok to test 9822b90 |
|
/ok to test 5c0e565 |
|
Any blockers for merging this? @yaoyu-33 The workflow failures are unrelated to this pr. Thanks |
|
/ok to test dd6651f |
|
/ok to test 683cda3 |
|
/ok to test a976137 |
|
@ryxli please help check if my code merge was done correctly |
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
@cuichenx fixed some linting issues due to addition of comet logger |
|
/ok to test 73ea5dd |
|
@cuichenx is code coverage okay? I took a look and seems majority is from merging |
cuichenx
left a comment
There was a problem hiding this comment.
LGTM, low code cov is due to L1 tests not running. they're not required for this pr
What does this PR do ?
Ability to use mlflow logging without enabling tensorboard logging.
Add missing log_metrics equivalents for mlflow.
Stylistic choices to organize metrics better in mlflow.
Also recommend env
MLFLOW_ENABLE_ASYNC_LOGGING=TrueChangelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger
the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.