Skip to content

[misc] feat: mlflow improvements#2153

Merged
cuichenx merged 7 commits intoNVIDIA-NeMo:mainfrom
ryxli:mlflow-changes
Mar 23, 2026
Merged

[misc] feat: mlflow improvements#2153
cuichenx merged 7 commits intoNVIDIA-NeMo:mainfrom
ryxli:mlflow-changes

Conversation

@ryxli
Copy link
Contributor

@ryxli ryxli commented Jan 30, 2026

What does this PR do ?

Ability to use mlflow logging without enabling tensorboard logging.
Add missing log_metrics equivalents for mlflow.
Stylistic choices to organize metrics better in mlflow.

Also recommend env MLFLOW_ENABLE_ASYNC_LOGGING=True

Changelog

  • remove conditional check for existence of tensorboard logger, to allow for mlflow only tracking
  • (quality of life change) change mlflow metrics dict key sanitizing
    • mlflow treats "/" as directories, meaning that they can be organized in their own section
    • see attached screenshot for example
  • add missing metrics logged to mlflow which exist for wandb / tensorboard (such as critical validation loss metrics)
Screenshot 2026-01-30 at 12 24 43 PM

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger

the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • [ x] Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added MLflow integration to log validation metrics, providing an additional monitoring and tracking option alongside existing systems
  • Improvements

    • Enhanced metric key formatting with improved special character handling for better compatibility
    • Improved logging reliability by conditionally executing logging operations only when appropriate backends are configured and available

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

The PR extends MLflow logging integration throughout the training pipeline. It adds metric logging to evaluation results, implements a more robust metric key sanitizer handling special characters and slashes, and introduces conditional guards in training utilities to safely log metrics only when logging backends are configured.

Changes

Cohort / File(s) Summary
MLflow Evaluation Logging
src/megatron/bridge/training/eval.py
Adds MLflow integration to evaluation result logging, sanitizing and pushing per-metric validation loss and optional validation perplexity metrics to MLflow, mirroring existing WandB/TensorBoard logging behavior on the last rank.
Metric Sanitization
src/megatron/bridge/training/utils/mlflow_utils.py
Replaces simple key sanitization with robust multi-step logic: converts @ to _at_, collapses consecutive slashes, preserves first segment before slash while replacing subsequent slashes with underscores, and replaces invalid characters with underscores using regex patterns.
Training Logging Guards
src/megatron/bridge/training/utils/train_utils.py
Adds conditional guards throughout to prevent logging when writers/loggers are absent. Reorganizes TensorBoard interval logic, wraps per-metric logging (learning-rate, loss, throughput, gradients, etc.) with writer existence checks, and refines throughput logging with separate per-device vs. global breakdown in MLflow.

Sequence Diagram(s)

sequenceDiagram
    participant Eval as Evaluation<br/>Process
    participant Sanitizer as Metric<br/>Sanitizer
    participant MLflow as MLflow<br/>Logger
    participant TensorBoard as TensorBoard/<br/>WandB

    Eval->>Eval: calculate validation metrics
    Eval->>Sanitizer: sanitize metric keys
    Sanitizer->>Sanitizer: replace @ with _at_<br/>collapse slashes<br/>clean invalid chars
    Sanitizer-->>Eval: return sanitized key map
    
    Eval->>MLflow: log val/{key} metrics
    Eval->>TensorBoard: log to WandB/TB<br/>(existing path)
    
    MLflow-->>Eval: acknowledged
    TensorBoard-->>Eval: acknowledged
Loading
sequenceDiagram
    participant Training as Training<br/>Loop
    participant Guard as Conditional<br/>Check
    participant Logger as Logger<br/>Backend

    Training->>Guard: check if writer exists?
    
    alt writer configured
        Guard->>Logger: log learning rate
        Guard->>Logger: log loss metrics
        Guard->>Logger: log throughput
        Guard->>Logger: log gradients/norms
        Logger-->>Guard: metrics recorded
    else no writer configured
        Guard->>Guard: skip all logging
    end
    
    Guard-->>Training: continue iteration
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • MLFlow Integration #2112: Directly related as this PR continues the MLflow integration work, expanding logging coverage in evaluation and training modules with improved metric sanitization.

Suggested reviewers

  • cuichenx
  • ananthsub
  • ko3n1g
🚥 Pre-merge checks | ✅ 1 | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR includes instrumentation changes (MLflow logging) but lacks explicit test results or testing documentation despite checklist marking testing incomplete. Include test results demonstrating MLflow logging works correctly across configurations. Fix identified sanitization bug in _sanitize_key function. Mark testing checklist as completed.
Title check ❓ Inconclusive The title '[misc] feat: mlflow improvements' is vague and generic, using non-descriptive terms that don't clearly convey the specific nature of the changes. Consider a more specific title that highlights the primary change, such as 'Enable MLflow logging independently of TensorBoard' or 'Add MLflow metric logging and improve key sanitization'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ananthsub
Copy link
Contributor

/ok to test 6e6e846

yaoyu-33
yaoyu-33 previously approved these changes Jan 30, 2026
@yaoyu-33 yaoyu-33 enabled auto-merge (squash) January 30, 2026 20:16
@yaoyu-33 yaoyu-33 changed the title mlflow improvements [misc] feat: mlflow improvements Jan 30, 2026
@ryxli ryxli marked this pull request as draft January 30, 2026 20:18
auto-merge was automatically disabled January 30, 2026 20:18

Pull request was converted to draft

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/utils/mlflow_utils.py`:
- Around line 87-99: The helper _sanitize_key inside _sanitize_mlflow_metrics
currently reuses the original key when handling "/" which discards earlier
transforms; change _sanitize_key to accept and return str (add type hints) and
perform the "@ -> _at_" replacement and slash-collapsing first, then use that
sanitized string for the "/"-branch split (i.e., split the already-sanitized
value and replace remaining "/" in the rest), keep the final regex replacement
on the sanitized string, and ensure _sanitize_mlflow_metrics returns the
sanitized-key typed dict as before.

@ryxli ryxli marked this pull request as ready for review January 30, 2026 20:25
@ryxli ryxli requested a review from yaoyu-33 January 30, 2026 20:53
@ryxli
Copy link
Contributor Author

ryxli commented Feb 19, 2026

@yaoyu-33 @paul-gibbons is there anything else blocking from merging?

@yaoyu-33
Copy link
Contributor

/ok to test 543303c

@ryxli
Copy link
Contributor Author

ryxli commented Feb 20, 2026

updated unit tests @yaoyu-33

@ryxli
Copy link
Contributor Author

ryxli commented Feb 23, 2026

@yaoyu-33 could you help run ci again?

@yaoyu-33
Copy link
Contributor

/ok to test 9822b90

yaoyu-33
yaoyu-33 previously approved these changes Feb 24, 2026
@yaoyu-33
Copy link
Contributor

/ok to test 5c0e565

@ryxli
Copy link
Contributor Author

ryxli commented Mar 18, 2026

Any blockers for merging this? @yaoyu-33

The workflow failures are unrelated to this pr. Thanks

@cuichenx
Copy link
Contributor

/ok to test dd6651f

@cuichenx
Copy link
Contributor

/ok to test 683cda3

@cuichenx
Copy link
Contributor

/ok to test a976137

@cuichenx
Copy link
Contributor

@ryxli please help check if my code merge was done correctly

Signed-off-by: Chen Cui <chcui@nvidia.com>
@ryxli
Copy link
Contributor Author

ryxli commented Mar 22, 2026

@cuichenx fixed some linting issues due to addition of comet logger

@cuichenx
Copy link
Contributor

/ok to test 73ea5dd

@ryxli
Copy link
Contributor Author

ryxli commented Mar 23, 2026

@cuichenx is code coverage okay? I took a look and seems majority is from merging

Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, low code cov is due to L1 tests not running. they're not required for this pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants