NVIDIA-NeMo · therealnaveenkamal · Jun 21, 2025 · Jun 27, 2025 · Jun 27, 2025 · Jun 28, 2025
@@ -1,6 +1,6 @@
 # Logger
 
-The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB and Tensorboard.
+The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, and MLflow.
 
 ## Requirements
 
@@ -9,12 +9,13 @@ The logger is designed to track key training metrics (including distributed metr
 * Logging:
    * WandB
    * Tensorboard
+   * MLflow
 
 ## Overall Design
 
 Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging.
 
-To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>` and {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>` will implement:
+To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>`, {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>`, and {py:class}`MLflowLogger <nemo_rl.utils.logger.MLflowLogger>` will implement:
 
 ```python
 class LoggerInterface(ABC):
@@ -34,10 +35,11 @@ class LoggerInterface(ABC):
 A {py:class}`Logger <nemo_rl.utils.logger.Logger>` wrapper class will also implement {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:
 
 ```python
-# Initialize logger with both wandb and tensorboard enabled
+# Initialize logger with wandb, tensorboard, and mlflow enabled
 logging_config = {
     "wandb_enabled": True,
     "tensorboard_enabled": False,
+    "mlflow_enabled": True,
 
     "wandb": {
         "project": "grpo-dev",
@@ -46,17 +48,72 @@ logging_config = {
     "tensorboard": {
         "log_dir": "logs",
     },
+    "mlflow": {
+        "experiment_name": "nemo-rl-experiment",
+        "run_name": "grpo-dev-run",
+        "tracking_uri": None,  # Use local tracking
+    },
 }
 logger = Logger(
     cfg=logger_config,
 )
 
-# Log metrics, will go to both wandb and tensorboard
+# Log metrics, will go to all enabled backends
 logger.log_metrics({
     "loss": 0.123,
 }, step=10)
 ```
 
+## Supported Logging Backends
+
+The logger supports three main logging backends:
+
+### WandB (Weights & Biases)
+- Provides cloud-based experiment tracking
+- Supports custom step metrics for better visualization
+- Includes built-in hyperparameter logging
+- Offers rich visualization and collaboration features
+
+### Tensorboard
+- Local file-based logging
+- Standard TensorBoard visualization
+- Supports hyperparameter logging via HParams
+- Lightweight and self-contained
+
+### MLflow
+- Comprehensive platform for experiment tracking and model management
+- Supports both local and remote tracking servers
+- Provides model versioning and artifact management
+- Includes a web UI for experiment visualization
+- Supports model deployment and serving
+
+#### MLflow Configuration
+
+MLflow can be configured with the following parameters:
+
+```python
+mlflow:
+  experiment_name: "nemo-rl-experiment"  # Name of the MLflow experiment
+  run_name: "my-training-run"            # Run name
+  tracking_uri: "http://localhost:5000"  # Optional tracking server URI
+```
+
+
+#### MLflow UI
+
+After starting training with MLflow enabled, you can view the MLflow UI to monitor your experiments:
+
+```bash
+# Start MLflow UI (run in a separate terminal)
+mlflow ui --host 0.0.0.0 --port 5000
+```
+
+Then access the UI at `http://127.0.0.1:5000/` to view:
+- Training runs and experiments
+- Metrics (loss, validation metrics, etc.)
+- Hyperparameters
+- Model artifacts and checkpoints
+
 ## Validation Pretty Logging
 
 The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter.
@@ -65,6 +122,7 @@ The logger supports pretty-formatted logging of validation samples to help visua
 logger:
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false
   num_val_samples_to_print: 10
 ```
 
@@ -82,16 +140,17 @@ When enabled, the pretty logging will generate formatted text similar to:
 
 ## GPU Metric Logging
 
-NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or WandB.
+NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, and/or MLflow.
 
-This approach allows us to offer the same GPU metric tracking on all loggers (not just Wandb) and simplifies the implementation greatly.
+This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly.
 
 This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds.
 
 ```python
 logger:
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false
   monitor_gpus: true
   gpu_monitoring:
     collection_interval: 10
@@ -103,7 +162,7 @@ While it is feasible to monitor using remote workers, the implementation require
 * Logs sent back to the driver do not introduce significant overhead.
 * Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
 * Workers can gracefully flush their logs in case of failure.
-* Logging behaves consistently across TensorBoard and Wandb.
+* Logging behaves consistently across TensorBoard, WandB, and MLflow.
 * Workers that spawn other workers accurately report the total resource usage of any grandchild workers.
 
 Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.

@@ -153,6 +153,7 @@ logger:
   log_dir: "logs"  # Base directory for all logs
   wandb_enabled: false # Make sure you do a ``wandb login [Your API key]'' before running
   tensorboard_enabled: false
+  mlflow_enabled: false  # Disable MLflow logging
   monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "dpo-dev"

@@ -127,6 +127,7 @@ logger:
   num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false
   monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "grpo-dev"

@@ -132,6 +132,7 @@ logger:
   num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false  # Disable MLflow logging
   monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "grpo-dev"

@@ -153,6 +153,7 @@ logger:
   num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false  # Disable MLflow logging
   monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "grpo-dev"

@@ -52,6 +52,7 @@ logger:
   num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
   wandb_enabled: false
   tensorboard_enabled: false
+  mlflow_enabled: false
   monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "grpo-dev"

@@ -78,6 +78,7 @@ logger:
   log_dir: "logs"
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -78,6 +78,7 @@ logger:
   log_dir: "logs"
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -111,6 +111,7 @@ logger:
   log_dir: "logs"
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -111,6 +111,7 @@ logger:
   log_dir: "logs"
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -78,6 +78,7 @@ logger:
   log_dir: "logs"
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -108,6 +108,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -109,6 +109,7 @@ logger:
   num_val_samples_to_print: 0
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -58,6 +58,7 @@ logger:
   log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp2tp1-long
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -58,6 +58,7 @@ logger:
   log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp2tp2sp
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -102,6 +102,7 @@ logger:
   log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp1
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -50,14 +50,15 @@ policy:
       fused: false
 data:
   max_input_seq_length: 1024
-  dataset_name: squad
+  dataset_name: squad 
   add_bos: true
   add_eos: true
   add_generation_prompt: false
 logger:
   log_dir: logs/sft-llama3.2-1b-1n8g-fsdp2tp1
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -58,6 +58,7 @@ logger:
   log_dir: logs/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt
   wandb_enabled: true
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true
   wandb:
     project: nemo-rl

@@ -133,6 +133,7 @@ logger:
   log_dir: "logs"  # Base directory for all logs
   wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "sft-dev"

@@ -71,6 +71,7 @@ logger:
   log_dir: "logs"  # Base directory for all logs
   wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
   tensorboard_enabled: true
+  mlflow_enabled: false
   monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
   wandb:
     project: "sft-dev"