Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 66 additions & 7 deletions docs/design-docs/logger.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Logger

The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB and Tensorboard.
The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, and MLflow.

## Requirements

Expand All @@ -9,12 +9,13 @@ The logger is designed to track key training metrics (including distributed metr
* Logging:
* WandB
* Tensorboard
* MLflow

## Overall Design

Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging.

To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>` and {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>` will implement:
To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>`, {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>`, and {py:class}`MLflowLogger <nemo_rl.utils.logger.MLflowLogger>` will implement:

```python
class LoggerInterface(ABC):
Expand All @@ -34,10 +35,11 @@ class LoggerInterface(ABC):
A {py:class}`Logger <nemo_rl.utils.logger.Logger>` wrapper class will also implement {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:

```python
# Initialize logger with both wandb and tensorboard enabled
# Initialize logger with wandb, tensorboard, and mlflow enabled
logging_config = {
"wandb_enabled": True,
"tensorboard_enabled": False,
"mlflow_enabled": True,

"wandb": {
"project": "grpo-dev",
Expand All @@ -46,17 +48,72 @@ logging_config = {
"tensorboard": {
"log_dir": "logs",
},
"mlflow": {
"experiment_name": "nemo-rl-experiment",
"run_name": "grpo-dev-run",
"tracking_uri": None, # Use local tracking
},
}
logger = Logger(
cfg=logger_config,
)

# Log metrics, will go to both wandb and tensorboard
# Log metrics, will go to all enabled backends
logger.log_metrics({
"loss": 0.123,
}, step=10)
```

## Supported Logging Backends

The logger supports three main logging backends:

### WandB (Weights & Biases)
- Provides cloud-based experiment tracking
- Supports custom step metrics for better visualization
- Includes built-in hyperparameter logging
- Offers rich visualization and collaboration features

### Tensorboard
- Local file-based logging
- Standard TensorBoard visualization
- Supports hyperparameter logging via HParams
- Lightweight and self-contained

### MLflow
- Comprehensive platform for experiment tracking and model management
- Supports both local and remote tracking servers
- Provides model versioning and artifact management
- Includes a web UI for experiment visualization
- Supports model deployment and serving

#### MLflow Configuration

MLflow can be configured with the following parameters:

```python
mlflow:
experiment_name: "nemo-rl-experiment" # Name of the MLflow experiment
run_name: "my-training-run" # Run name
tracking_uri: "http://localhost:5000" # Optional tracking server URI
```


#### MLflow UI

After starting training with MLflow enabled, you can view the MLflow UI to monitor your experiments:

```bash
# Start MLflow UI (run in a separate terminal)
mlflow ui --host 0.0.0.0 --port 5000
```

Then access the UI at `http://127.0.0.1:5000/` to view:
- Training runs and experiments
- Metrics (loss, validation metrics, etc.)
- Hyperparameters
- Model artifacts and checkpoints

## Validation Pretty Logging

The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter.
Expand All @@ -65,6 +122,7 @@ The logger supports pretty-formatted logging of validation samples to help visua
logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
num_val_samples_to_print: 10
```

Expand All @@ -82,16 +140,17 @@ When enabled, the pretty logging will generate formatted text similar to:

## GPU Metric Logging

NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard and/or WandB.
NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, and/or MLflow.

This approach allows us to offer the same GPU metric tracking on all loggers (not just Wandb) and simplifies the implementation greatly.
This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly.

This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds.

```python
logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
monitor_gpus: true
gpu_monitoring:
collection_interval: 10
Expand All @@ -103,7 +162,7 @@ While it is feasible to monitor using remote workers, the implementation require
* Logs sent back to the driver do not introduce significant overhead.
* Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
* Workers can gracefully flush their logs in case of failure.
* Logging behaves consistently across TensorBoard and Wandb.
* Logging behaves consistently across TensorBoard, WandB, and MLflow.
* Workers that spawn other workers accurately report the total resource usage of any grandchild workers.

Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.
Expand Down
1 change: 1 addition & 0 deletions examples/configs/dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: false # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "dpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo-deepscaler-1.5b-8K.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ logger:
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ logger:
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_math_1B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ logger:
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_sliding_puzzle.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ logger:
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ logger:
log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp2tp1-long
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ logger:
log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp2tp2sp
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ logger:
log_dir: logs/sft-llama3.1-8b-instruct-1n8g-fsdp1
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,15 @@ policy:
fused: false
data:
max_input_seq_length: 1024
dataset_name: squad
dataset_name: squad
add_bos: true
add_eos: true
add_generation_prompt: false
logger:
log_dir: logs/sft-llama3.2-1b-1n8g-fsdp2tp1
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ logger:
log_dir: logs/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
1 change: 1 addition & 0 deletions examples/configs/sft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "sft-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/sft_openmathinstruct2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: true
mlflow_enabled: false
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "sft-dev"
Expand Down
Loading
Loading