[misc] feat: mlflow improvements by ryxli · Pull Request #2153 · NVIDIA-NeMo/Megatron-Bridge

ryxli · 2026-01-30T20:10:43Z

What does this PR do ?

Ability to use mlflow logging without enabling tensorboard logging.
Add missing log_metrics equivalents for mlflow.
Stylistic choices to organize metrics better in mlflow.

Also recommend env MLFLOW_ENABLE_ASYNC_LOGGING=True

Changelog

remove conditional check for existence of tensorboard logger, to allow for mlflow only tracking
(quality of life change) change mlflow metrics dict key sanitizing
- mlflow treats "/" as directories, meaning that they can be organized in their own section
- see attached screenshot for example
add missing metrics logged to mlflow which exist for wandb / tensorboard (such as critical validation loss metrics)

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger

the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

[ x] Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added MLflow integration to log validation metrics, providing an additional monitoring and tracking option alongside existing systems
Improvements
- Enhanced metric key formatting with improved special character handling for better compatibility
- Improved logging reliability by conditionally executing logging operations only when appropriate backends are configured and available

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-30T20:10:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-30T20:13:21Z

📝 Walkthrough

Walkthrough

The PR extends MLflow logging integration throughout the training pipeline. It adds metric logging to evaluation results, implements a more robust metric key sanitizer handling special characters and slashes, and introduces conditional guards in training utilities to safely log metrics only when logging backends are configured.

Changes

Cohort / File(s)	Summary
MLflow Evaluation Logging `src/megatron/bridge/training/eval.py`	Adds MLflow integration to evaluation result logging, sanitizing and pushing per-metric validation loss and optional validation perplexity metrics to MLflow, mirroring existing WandB/TensorBoard logging behavior on the last rank.
Metric Sanitization `src/megatron/bridge/training/utils/mlflow_utils.py`	Replaces simple key sanitization with robust multi-step logic: converts `@` to `_at_`, collapses consecutive slashes, preserves first segment before slash while replacing subsequent slashes with underscores, and replaces invalid characters with underscores using regex patterns.
Training Logging Guards `src/megatron/bridge/training/utils/train_utils.py`	Adds conditional guards throughout to prevent logging when writers/loggers are absent. Reorganizes TensorBoard interval logic, wraps per-metric logging (learning-rate, loss, throughput, gradients, etc.) with writer existence checks, and refines throughput logging with separate per-device vs. global breakdown in MLflow.

Sequence Diagram(s)

sequenceDiagram
    participant Eval as Evaluation<br/>Process
    participant Sanitizer as Metric<br/>Sanitizer
    participant MLflow as MLflow<br/>Logger
    participant TensorBoard as TensorBoard/<br/>WandB

    Eval->>Eval: calculate validation metrics
    Eval->>Sanitizer: sanitize metric keys
    Sanitizer->>Sanitizer: replace @ with _at_<br/>collapse slashes<br/>clean invalid chars
    Sanitizer-->>Eval: return sanitized key map
    
    Eval->>MLflow: log val/{key} metrics
    Eval->>TensorBoard: log to WandB/TB<br/>(existing path)
    
    MLflow-->>Eval: acknowledged
    TensorBoard-->>Eval: acknowledged

sequenceDiagram
    participant Training as Training<br/>Loop
    participant Guard as Conditional<br/>Check
    participant Logger as Logger<br/>Backend

    Training->>Guard: check if writer exists?
    
    alt writer configured
        Guard->>Logger: log learning rate
        Guard->>Logger: log loss metrics
        Guard->>Logger: log throughput
        Guard->>Logger: log gradients/norms
        Logger-->>Guard: metrics recorded
    else no writer configured
        Guard->>Guard: skip all logging
    end
    
    Guard-->>Training: continue iteration

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

MLFlow Integration #2112: Directly related as this PR continues the MLflow integration work, expanding logging coverage in evaluation and training modules with improved metric sanitization.

Suggested reviewers

cuichenx
ananthsub
ko3n1g

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR includes instrumentation changes (MLflow logging) but lacks explicit test results or testing documentation despite checklist marking testing incomplete.	Include test results demonstrating MLflow logging works correctly across configurations. Fix identified sanitization bug in _sanitize_key function. Mark testing checklist as completed.
Title check	❓ Inconclusive	The title '[misc] feat: mlflow improvements' is vague and generic, using non-descriptive terms that don't clearly convey the specific nature of the changes.	Consider a more specific title that highlights the primary change, such as 'Enable MLflow logging independently of TensorBoard' or 'Add MLflow metric logging and improve key sanitization'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ananthsub · 2026-01-30T20:14:40Z

/ok to test 6e6e846

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/utils/mlflow_utils.py`:
- Around line 87-99: The helper _sanitize_key inside _sanitize_mlflow_metrics
currently reuses the original key when handling "/" which discards earlier
transforms; change _sanitize_key to accept and return str (add type hints) and
perform the "@ -> _at_" replacement and slash-collapsing first, then use that
sanitized string for the "/"-branch split (i.e., split the already-sanitized
value and replace remaining "/" in the rest), keep the final regex replacement
on the sanitized string, and ensure _sanitize_mlflow_metrics returns the
sanitized-key typed dict as before.

src/megatron/bridge/training/utils/mlflow_utils.py

ryxli · 2026-02-19T00:30:02Z

@yaoyu-33 @paul-gibbons is there anything else blocking from merging?

yaoyu-33 · 2026-02-20T22:28:14Z

/ok to test 543303c

ryxli · 2026-02-20T23:17:20Z

updated unit tests @yaoyu-33

ryxli · 2026-02-23T22:32:54Z

@yaoyu-33 could you help run ci again?

yaoyu-33 · 2026-02-24T00:26:03Z

/ok to test 9822b90

yaoyu-33 · 2026-02-25T18:40:33Z

/ok to test 5c0e565

ryxli · 2026-03-18T20:31:17Z

Any blockers for merging this? @yaoyu-33

The workflow failures are unrelated to this pr. Thanks

cuichenx · 2026-03-18T22:21:41Z

/ok to test dd6651f

cuichenx · 2026-03-19T23:37:26Z

/ok to test 683cda3

cuichenx · 2026-03-20T23:23:07Z

/ok to test a976137

cuichenx · 2026-03-20T23:23:27Z

@ryxli please help check if my code merge was done correctly

Signed-off-by: Chen Cui <chcui@nvidia.com>

ryxli · 2026-03-22T04:57:55Z

@cuichenx fixed some linting issues due to addition of comet logger

cuichenx · 2026-03-23T03:12:57Z

/ok to test 73ea5dd

ryxli · 2026-03-23T20:29:37Z

@cuichenx is code coverage okay? I took a look and seems majority is from merging

cuichenx

LGTM, low code cov is due to L1 tests not running. they're not required for this pr

github-actions bot added the community-request label Jan 30, 2026

ryxli force-pushed the mlflow-changes branch from 7dd91c3 to 6e6e846 Compare January 30, 2026 20:12

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 20:15 Inactive

copy-pr-bot bot temporarily deployed to test January 30, 2026 20:15 Inactive

yaoyu-33 previously approved these changes Jan 30, 2026

View reviewed changes

yaoyu-33 enabled auto-merge (squash) January 30, 2026 20:16

yaoyu-33 changed the title ~~mlflow improvements~~ [misc] feat: mlflow improvements Jan 30, 2026

ryxli marked this pull request as draft January 30, 2026 20:18

auto-merge was automatically disabled January 30, 2026 20:18
Pull request was converted to draft

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

src/megatron/bridge/training/utils/mlflow_utils.py Show resolved Hide resolved

ryxli marked this pull request as ready for review January 30, 2026 20:25

ryxli dismissed yaoyu-33’s stale review via 04211ab January 30, 2026 20:28

ryxli force-pushed the mlflow-changes branch from 6e6e846 to 04211ab Compare January 30, 2026 20:28

ryxli requested a review from yaoyu-33 January 30, 2026 20:53

ananthsub mentioned this pull request Feb 3, 2026

Bridging perf from NeMo2 to Mbridge for certain configs #2199

Merged

5 tasks

copy-pr-bot bot temporarily deployed to test February 20, 2026 22:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 20, 2026 22:38 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 20, 2026 22:46 Failure

ryxli force-pushed the mlflow-changes branch from 543303c to 9822b90 Compare February 20, 2026 23:16

yaoyu-33 previously approved these changes Feb 24, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test February 24, 2026 00:26 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 20:38 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 17, 2026 20:38 Error

copy-pr-bot bot requested a deployment to nemo-ci March 17, 2026 20:38 In progress

copy-pr-bot bot had a problem deploying to nemo-ci March 17, 2026 20:38 Error

Merge branch 'main' into mlflow-changes

dd6651f

copy-pr-bot bot temporarily deployed to test March 18, 2026 22:22 Inactive

Merge branch 'main' into mlflow-changes

683cda3

Merge branch 'main' into mlflow-changes

2828c9e

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into mlflow-changes

73ea5dd

cuichenx approved these changes Mar 23, 2026

View reviewed changes

Conversation

ryxli commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

ananthsub commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryxli commented Feb 19, 2026

Uh oh!

yaoyu-33 commented Feb 20, 2026

Uh oh!

ryxli commented Feb 20, 2026

Uh oh!

ryxli commented Feb 23, 2026

Uh oh!

yaoyu-33 commented Feb 24, 2026

Uh oh!

yaoyu-33 commented Feb 25, 2026

Uh oh!

ryxli commented Mar 18, 2026

Uh oh!

cuichenx commented Mar 18, 2026

Uh oh!

cuichenx commented Mar 19, 2026

Uh oh!

cuichenx commented Mar 20, 2026

Uh oh!

cuichenx commented Mar 20, 2026

Uh oh!

ryxli commented Mar 22, 2026

Uh oh!

cuichenx commented Mar 23, 2026

Uh oh!

ryxli commented Mar 23, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ryxli commented Jan 30, 2026 •

edited

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading