Conversation
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Co-authored-by: Philip Petrakian <pgpetrak@gmail.com> Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
|
Hi @Phlip79 - implemented all the changes. |
|
thanks for the contribution, all look good. Had one comment. |
maanug-nv
left a comment
There was a problem hiding this comment.
Thanks for adding the support. Left 1 suggestion. Can I also ask you to add unit tests similar to what we do for wandb?
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
|
/ok to test 3148469 |
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
| logger.log(*args, **kwargs) | ||
|
|
||
|
|
||
| def safe_serialize(obj) -> str: |
There was a problem hiding this comment.
i'm not sure if we need this anymore after NVIDIA/Megatron-LM#2055
cc @suiyoubi
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
|
Hello, is the MLFlow integration to be released soon? |
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
|
@Phlip79 / @maanug-nv - Can we run the CI? |
|
/ok to test e07db35 |
|
/ok to test cd4a36a |
📝 WalkthroughWalkthroughThis PR adds comprehensive MLFlow logging integration to the training framework, parallel to existing W&B support. Changes include MLFlow configuration options, logger initialization, checkpoint artifact logging, and metric collection throughout the training loop. Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Config Validation
participant GlobalState as GlobalState
participant MLFlow as MLFlow
participant Training as Training Loop
participant Checkpoint as Checkpoint Manager
Config->>Config: Validate MLFlow config<br/>(finalize())
Config->>GlobalState: Config ready
Training->>GlobalState: Request mlflow_logger
GlobalState->>MLFlow: Initialize run<br/>(experiment, tracking_uri)
MLFlow-->>GlobalState: MLFlow run handle
GlobalState-->>Training: Return mlflow_logger
loop Each training iteration
Training->>Training: Compute metrics<br/>(loss, throughput, etc.)
Training->>MLFlow: Log sanitized metrics<br/>(log_metrics)
Training->>MLFlow: Log timer data<br/>(write_to_mlflow)
end
Training->>Checkpoint: Save checkpoint
Checkpoint->>MLFlow: Log checkpoint artifacts<br/>(on_save_checkpoint_success)
MLFlow-->>Checkpoint: Artifacts logged
Training->>Checkpoint: Load checkpoint
Checkpoint->>MLFlow: Set MLFlow tags<br/>(on_load_checkpoint_success)
MLFlow-->>Checkpoint: Tags set
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@tutorials/recipes/llama/conf/llama32_1b_finetune.yaml`:
- Around line 75-82: The mlflow_tracking_uri is currently nested under
mlflow_tags in the YAML, which will treat it as a tag instead of a top-level
LoggerConfig field; update the YAML so mlflow_tracking_uri is a sibling of
mlflow_tags (same indentation level) rather than nested under it, ensuring it
matches the LoggerConfig field name used in config.py and will be parsed as the
logger's tracking URI.
In `@tutorials/recipes/llama/conf/llama32_1b_pretrain.yaml`:
- Around line 65-71: The mlflow_tracking_uri key is incorrectly nested under
mlflow_tags; move mlflow_tracking_uri out of the mlflow_tags mapping and place
it as a top-level MLflow logger field (peer to mlflow_experiment /
mlflow_run_name), i.e., remove mlflow_tracking_uri from under mlflow_tags and
add mlflow_tracking_uri: <your_uri> at the same indentation level as
mlflow_experiment so the logger can read it correctly.
🧹 Nitpick comments (6)
src/megatron/bridge/training/utils/log_utils.py (1)
178-194: Consider adding type hint for theobjparameter.The function is well-designed for its purpose of safely serializing objects that may have broken
__str__methods. The broadExceptioncatch is appropriate here since this is a last-resort fallback.Optional: Add type hint
-def safe_serialize(obj) -> str: +def safe_serialize(obj: object) -> str:docs/training/logging.md (2)
173-175: Improve the install command formatting.The current format
pip install mlflow / uv add mlflowis confusing. Consider separating these as distinct options.Suggested improvement
1) Install MLFlow (installed by default with Megatron Bridge): - ```bash - pip install mlflow / uv add mlflow - ``` + ```bash + pip install mlflow + # or with uv: + # uv add mlflow + ```
177-179: Fix markdown list indentation.The list items should not have leading indentation to comply with markdown linting standards.
Suggested fix
2) Configure the tracking server (Optional): - - Either set `MLFLOW_TRACKING_URI` in the environment, or - - Pass an explicit `mlflow_tracking_uri` in the logger config. +- Either set `MLFLOW_TRACKING_URI` in the environment, or +- Pass an explicit `mlflow_tracking_uri` in the logger config.src/megatron/bridge/training/utils/train_utils.py (1)
538-540: Loss metrics should be sanitized for MLFlow.The
loss_dictkeys may contain "/" characters (e.g., "lm_loss/validation"). Other metric dictionaries are sanitized via_sanitize_mlflow_metrics, but here the raw keys are used directly.♻️ Proposed fix
if mlflow_logger: - loss_metrics = {key: float(val) for key, val in loss_dict.items()} + loss_metrics = _sanitize_mlflow_metrics({key: float(val) for key, val in loss_dict.items()}) mlflow_logger.log_metrics(loss_metrics, step=iteration)tests/unit_tests/training/utils/test_mlflow_utils.py (1)
26-27: Consider adding pytest markers for test categorization.As per coding guidelines, tests should use
pytest.markto categorize tests (unit, integration, system). These are unit tests and should be marked accordingly.♻️ Proposed fix
+import pytest + + +@pytest.mark.unit class TestOnSaveCheckpointSuccess: """Test cases for on_save_checkpoint_success function."""Apply similar
@pytest.mark.unitdecorator toTestOnLoadCheckpointSuccessandTestSanitizeMlflowMetricsclasses.Also applies to: 137-138, 212-213
src/megatron/bridge/training/state.py (1)
446-449: Addstacklevelparameter towarnings.warn.The warning message will point to this line rather than the caller. Setting
stacklevel=2would improve debuggability.♻️ Proposed fix
except Exception: import warnings - warnings.warn("Failed to log timer metrics to MLFlow; continuing without timer metrics.") + warnings.warn("Failed to log timer metrics to MLFlow; continuing without timer metrics.", stacklevel=2)
| # mlflow_experiment: llama32_1b_finetuned # Uncomment to enable MLFlow logging | ||
| # mlflow_run_name: llama32_1b_finetuned | ||
| # mlflow_tags: | ||
| # project: llama32 | ||
| # phase: finetune | ||
| # variant: mlflow_example | ||
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server | ||
|
|
There was a problem hiding this comment.
mlflow_tracking_uri is incorrectly nested under mlflow_tags.
Based on the LoggerConfig structure in config.py, mlflow_tracking_uri should be a sibling field to mlflow_tags, not nested within it. The current indentation would cause the tracking URI to be treated as a tag value rather than the configuration parameter.
Suggested fix
# mlflow_experiment: llama32_1b_finetuned # Uncomment to enable MLFlow logging
# mlflow_run_name: llama32_1b_finetuned
# mlflow_tags:
# project: llama32
# phase: finetune
# variant: mlflow_example
- # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server
+ # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # mlflow_experiment: llama32_1b_finetuned # Uncomment to enable MLFlow logging | |
| # mlflow_run_name: llama32_1b_finetuned | |
| # mlflow_tags: | |
| # project: llama32 | |
| # phase: finetune | |
| # variant: mlflow_example | |
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server | |
| # mlflow_experiment: llama32_1b_finetuned # Uncomment to enable MLFlow logging | |
| # mlflow_run_name: llama32_1b_finetuned | |
| # mlflow_tags: | |
| # project: llama32 | |
| # phase: finetune | |
| # variant: mlflow_example | |
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server |
🤖 Prompt for AI Agents
In `@tutorials/recipes/llama/conf/llama32_1b_finetune.yaml` around lines 75 - 82,
The mlflow_tracking_uri is currently nested under mlflow_tags in the YAML, which
will treat it as a tag instead of a top-level LoggerConfig field; update the
YAML so mlflow_tracking_uri is a sibling of mlflow_tags (same indentation level)
rather than nested under it, ensuring it matches the LoggerConfig field name
used in config.py and will be parsed as the logger's tracking URI.
| # mlflow_experiment: llama32_1b_pretrain # Uncomment to enable MLFlow logging | ||
| # mlflow_run_name: llama32_1b_pretrain_run | ||
| # mlflow_tags: | ||
| # project: llama32 | ||
| # phase: pretrain | ||
| # variant: mlflow_example | ||
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server |
There was a problem hiding this comment.
mlflow_tracking_uri is incorrectly nested under mlflow_tags.
Same issue as in the finetune config - mlflow_tracking_uri should be a top-level logger field, not nested under mlflow_tags.
Suggested fix
# mlflow_experiment: llama32_1b_pretrain # Uncomment to enable MLFlow logging
# mlflow_run_name: llama32_1b_pretrain_run
# mlflow_tags:
# project: llama32
# phase: pretrain
# variant: mlflow_example
- # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server
+ # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # mlflow_experiment: llama32_1b_pretrain # Uncomment to enable MLFlow logging | |
| # mlflow_run_name: llama32_1b_pretrain_run | |
| # mlflow_tags: | |
| # project: llama32 | |
| # phase: pretrain | |
| # variant: mlflow_example | |
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server | |
| # mlflow_experiment: llama32_1b_pretrain # Uncomment to enable MLFlow logging | |
| # mlflow_run_name: llama32_1b_pretrain_run | |
| # mlflow_tags: | |
| # project: llama32 | |
| # phase: pretrain | |
| # variant: mlflow_example | |
| # mlflow_tracking_uri: http://localhost:5000 # Optional: use remote MLflow server |
🤖 Prompt for AI Agents
In `@tutorials/recipes/llama/conf/llama32_1b_pretrain.yaml` around lines 65 - 71,
The mlflow_tracking_uri key is incorrectly nested under mlflow_tags; move
mlflow_tracking_uri out of the mlflow_tags mapping and place it as a top-level
MLflow logger field (peer to mlflow_experiment / mlflow_run_name), i.e., remove
mlflow_tracking_uri from under mlflow_tags and add mlflow_tracking_uri:
<your_uri> at the same indentation level as mlflow_experiment so the logger can
read it correctly.
Add comprehensive MLFlow support to Megatron Bridge for experiment tracking and artifact logging. - Add MLFlow logger support in GlobalState with configurable experiment, run name, tracking URI, and tags - Log training metrics (losses, learning rate, batch size, throughput, timers, memory, runtime, norms, energy) to MLFlow - Log checkpoint artifacts to MLFlow with iteration-based artifact paths - Add MLFlow configuration options to LoggerConfig (mlflow_experiment, mlflow_run_name, mlflow_tracking_uri, mlflow_tags) - Add validation in LoggerConfig.finalize() to check MLFlow availability - Move safe_serialize to log_utils.py for reuse across WandB and MLFlow - Add comprehensive unit tests for MLFlow utilities - Add documentation for MLFlow logging configuration Based on community contribution from @therealnaveenkamal in PR #1542. Co-authored-by: Naveenraj Kamalakannan <therealnaveenkamal@users.noreply.github.com>
What does this PR do ?
Add MLFlow integration for experiment tracking and artifact logging.
This PR adds comprehensive MLFlow support to Megatron Bridge, enabling users to log training metrics, configuration parameters, and checkpoint artifacts to MLFlow tracking servers. The integration includes automatic checkpoint artifact logging with iteration-based artifact paths.
MLFlow logging follows the same pattern as the existing W&B integration and can be enabled via LoggerConfig with mlflow_experiment and mlflow_run_name parameters.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
cc @Phlip79
Summary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.