Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a4a4519
feat: code execution & tool use (#322)
KiddoZhu Jul 29, 2025
d8bf537
unittest(configs): add swanlab
tpoisonooo Jul 18, 2025
232c751
unittest(logger.py): test passed
tpoisonooo Jul 30, 2025
7df3743
Update uv.lock
tpoisonooo Jul 30, 2025
85aeeab
Update pyproject.toml
tpoisonooo Jul 30, 2025
1466cc9
Update nemo_rl/utils/logger.py
tpoisonooo Jul 31, 2025
420a580
Update nemo_rl/utils/logger.py
tpoisonooo Jul 31, 2025
0141b5b
uv lock
terrykong Jul 31, 2025
95c7ed8
feat: add AIME-2025 eval dataset. (#777)
xxman-google Jul 30, 2025
a0ab666
docs: clarification of where you can find nsys profiles (#771)
terrykong Jul 30, 2025
8f35f39
feat: refit metadata optimization (#686)
ZhiyuLi-Nvidia Jul 30, 2025
b051b70
fix: fix the return type of `execute()`. (#808)
xxman-google Jul 31, 2025
8636efa
docs: add usage example for mcore --> hf converter (#807)
ashors1 Jul 31, 2025
dc93f58
docs: documentation and unit test for env var precedence (#806)
ashors1 Jul 31, 2025
9ddf0db
fix: Fix incorrect indexing of message which cuts off user message wh…
parthchadha Jul 31, 2025
b48faf3
docs: add a section on our config design (#810)
terrykong Jul 31, 2025
843735f
fix: fix dynamo cache (#784)
yuki-97 Jul 31, 2025
2fb7f84
feat: add throughput/prompt_length/total_num_tokens metrics (#781)
ZhiyuLi-Nvidia Aug 1, 2025
98774d3
fix: avoid duplicate bos by adding add_special_tokens=False (#747)
ZhiyuLi-Nvidia Aug 3, 2025
21a70ee
ci: Refactor unit tests to run in concurrent jobs (#617)
chtruong814 Aug 3, 2025
11cdfe3
style(tests): linkt check
tpoisonooo Aug 4, 2025
4302aa7
Merge branch 'main' into support-swanlab
tpoisonooo Aug 4, 2025
ffaea62
ci(tests): pre-commit check
tpoisonooo Aug 6, 2025
f5e4e0c
style(tests): remote empty line
tpoisonooo Aug 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions docs/design-docs/logger.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Logger

The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, and MLflow.
The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, MLflow and Swanlab.

## Requirements

Expand All @@ -10,12 +10,13 @@ The logger is designed to track key training metrics (including distributed metr
* WandB
* Tensorboard
* MLflow
* Swanlab

## Overall Design

Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging.

To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>`, {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>`, and {py:class}`MLflowLogger <nemo_rl.utils.logger.MLflowLogger>` will implement:
To handle multiple logger backends, we will have a {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` interface that the {py:class}`TensorboardLogger <nemo_rl.utils.logger.TensorboardLogger>`, {py:class}`WandbLogger <nemo_rl.utils.logger.WandbLogger>`, {py:class}`MLflowLogger <nemo_rl.utils.logger.MLflowLogger>` and {py:class}`SwanlabLogger <nemo_rl.utils.logger.SwanlabLogger>` will implement:

```python
class LoggerInterface(ABC):
Expand All @@ -35,7 +36,7 @@ class LoggerInterface(ABC):
A {py:class}`Logger <nemo_rl.utils.logger.Logger>` wrapper class will also implement {py:class}`LoggerInterface <nemo_rl.utils.logger.LoggerInterface>` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:

```python
# Initialize logger with wandb, tensorboard, and mlflow enabled
# Initialize logger with wandb, tensorboard, mlflow and swanlab enabled
logging_config = {
"wandb_enabled": True,
"tensorboard_enabled": False,
Expand All @@ -45,6 +46,10 @@ logging_config = {
"project": "grpo-dev",
"name": "grpo-dev-logging",
},
"swanlab": {
"project": "nemo-rl",
"name": "grpo-dev-logging",
},
"tensorboard": {
"log_dir": "logs",
},
Expand Down Expand Up @@ -74,6 +79,13 @@ The logger supports three main logging backends:
- Includes built-in hyperparameter logging
- Offers rich visualization and collaboration features

### Swanlab
- Training visualization (Android, iOS, Wechat public account and Web)
- Automatic logging
- Hyperparameter recording
- Experiment comparison
- Multi-user collaboration

### Tensorboard
- Local file-based logging
- Standard TensorBoard visualization
Expand Down Expand Up @@ -121,6 +133,7 @@ The logger supports pretty-formatted logging of validation samples to help visua
```python
logger:
wandb_enabled: false
swanlab_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
num_val_samples_to_print: 10
Expand All @@ -140,7 +153,7 @@ When enabled, the pretty logging will generate formatted text similar to:

## GPU Metric Logging

NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, and/or MLflow.
NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, MLflow and/or SwanLab.

This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly.

Expand All @@ -149,6 +162,7 @@ This feature is enabled with the `monitor_gpus` configuration parameter. The fre
```python
logger:
wandb_enabled: false
swanlab_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
monitor_gpus: true
Expand All @@ -162,8 +176,8 @@ While it is feasible to monitor using remote workers, the implementation require
* Logs sent back to the driver do not introduce significant overhead.
* Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
* Workers can gracefully flush their logs in case of failure.
* Logging behaves consistently across TensorBoard, WandB, and MLflow.
* Logging behaves consistently across TensorBoard, WandB, MLflow and Swanlab.
* Workers that spawn other workers accurately report the total resource usage of any grandchild workers.

Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.
:::
:::
1 change: 1 addition & 0 deletions examples/configs/dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ logger:
wandb_enabled: false # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo-deepscaler-1.5b-8K.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_math_1B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_sliding_puzzle.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ logger:
log_dir: "logs"
wandb_enabled: true
tensorboard_enabled: true
swanlab_enabled: false # Disable SwanLab logging
mlflow_enabled: false
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ logger:
num_val_samples_to_print: 0
wandb_enabled: true
tensorboard_enabled: true
swanlab_enabled: false # Disable SwanLab logging
mlflow_enabled: false
monitor_gpus: true
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
wandb:
project: nemo-rl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ logger:
wandb_enabled: true
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
1 change: 1 addition & 0 deletions examples/configs/sft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ logger:
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
1 change: 1 addition & 0 deletions examples/configs/sft_openmathinstruct2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ logger:
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: true
mlflow_enabled: false
swanlab_enabled: false # Disable SwanLab logging
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb:
Expand Down
Loading