Skip to content

feat(logger.py): support swanlab#716

Closed
tpoisonooo wants to merge 24 commits intoNVIDIA-NeMo:mainfrom
tpoisonooo:support-swanlab
Closed

feat(logger.py): support swanlab#716
tpoisonooo wants to merge 24 commits intoNVIDIA-NeMo:mainfrom
tpoisonooo:support-swanlab

Conversation

@tpoisonooo
Copy link
Contributor

What does this PR do ?

Same as #691 , wandb is unreachable for our district, add swanlab by instead. Just like github --> gitee

Issues

List issues that this PR closes (syntax):

No

Usage

  • You can potentially add a usage example below

Here is my example config:

logger:
  log_dir: "logs"  # Base directory for all logs
  num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
  wandb_enabled: false
  swanlab_enabled: true
  tensorboard_enabled: true
  monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
  wandb:
    project: "grpo-dev"
    name: "sj_megatron_1B"
  swanlab:
    project: "grpo-dev"
    name: "sj_megatron_1B"

Here is swanlab UI and public project

image
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 23, 2025
@tpoisonooo
Copy link
Contributor Author

@terrykong
I already read #534 and update the code, but how to run the pytest.

When I uv run pytest tests/unit/utils/test_logger.py, it calls ray env and download some gated models. Looks like not a unittest.

run ray cluster

image

download some models

image

@terrykong
Copy link
Collaborator

@tpoisonooo can you try these instructions:

https://docs.nvidia.com/nemo/rl/latest/testing.html#unit-tests

So to just test the logger, i'd run:

uv run --group test bash tests/run_unit.sh -k test_logger

You may also run bare pytest, but we've had some issues where pytest was sensitive to the current working directory, so we have this script that can be run more robustly for convenience.

@SahilJain314
Copy link
Contributor

SahilJain314 commented Jul 23, 2025

I think you do actually need HF access to those models @terrykong, the unit tests use them (fwiw, I've been able to run bare pytest in the past so long as I have the HF key). Maybe we should switch fully to qwen models in the tests to make it easier for community members to run locally?

@tpoisonooo I'd recommend doing a hf login with a key that has access to the repo (it's easy to get access)

@tpoisonooo
Copy link
Contributor Author

image
  • Meta reject the download request
  • and CN blocked HF and GitHub

What ancestral grave did we open-source developers dig up to deserve this .. ?

@terrykong
Copy link
Collaborator

@tpoisonooo Sorry you're running into this. I didn't realize they actually rejected anyone...

I'll PR something today that'll filter out the gated repos so you can at least run the tests that don't require those models.

@SahilJain314
Copy link
Contributor

@tpoisonooo Updating here, PR #755 is merged. You should be able to run tests without HF with this.

@tpoisonooo tpoisonooo force-pushed the support-swanlab branch 2 times, most recently from e51fe34 to ec98ae4 Compare July 30, 2025 06:13
@tpoisonooo
Copy link
Contributor Author

image

@SahilJain314 @terrykong unittest passed.

Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work @tpoisonooo ! thanks for adding support, i've left some comments

@tpoisonooo
Copy link
Contributor Author

awesome work @tpoisonooo ! thanks for adding support, i've left some comments

done.

terrykong
terrykong previously approved these changes Aug 1, 2025
@terrykong terrykong enabled auto-merge August 1, 2025 20:50
@terrykong terrykong added this pull request to the merge queue Aug 1, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2025
KiddoZhu and others added 11 commits August 4, 2025 14:18
Signed-off-by: KiddoZhu <zhaochengz@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
@tpoisonooo
Copy link
Contributor Author

@terrykong lint error fixed.

terrykong
terrykong previously approved these changes Aug 5, 2025
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! thanks for the contribution!

@terrykong terrykong enabled auto-merge August 5, 2025 17:35
@terrykong terrykong added this pull request to the merge queue Aug 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
@terrykong
Copy link
Collaborator

@tpoisonooo looks like the linter failed. can you run it through pre-commit and make sure everything passes and we can try again?

Signed-off-by: tpoisonooo <khj.application@aliyun.com>
@tpoisonooo
Copy link
Contributor Author

tpoisonooo commented Aug 6, 2025

For this PR

ok, I have installed pre-commit and passed lint check part, pls try again. @terrykong

pyrefly failed on other irrevalant files (convert_dcp_to_hf.py)

image

For code lint

I have seen a better practice: github actions bot merge format-code to human PR.

Here is an example: Tencent/ncnn@a2e23d9 in Tencent/ncnn#4782

@terrykong terrykong enabled auto-merge August 6, 2025 04:26
terrykong
terrykong previously approved these changes Aug 6, 2025
@terrykong terrykong added this pull request to the merge queue Aug 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
Signed-off-by: tpoisonooo <khj.application@aliyun.com>
@tpoisonooo
Copy link
Contributor Author

my local pre-commit allows 2 empty lines here (that is why I dislike pre-commit), my python version is 3.10

image

@terrykong
Copy link
Collaborator

@tpoisonooo could you install pre-commit hooks like this?

uv run --group dev pre-commit install

if it doesn't work for you, i will help shepherd this PR in

@tpoisonooo
Copy link
Contributor Author

@tpoisonooo could you install pre-commit hooks like this?

uv run --group dev pre-commit install

if it doesn't work for you, i will help shepherd this PR in

My mistake, python3 -m pip install pre-commit does not work for 2 empty lines.

Since I only changed 2 .py files and already manully fixed the lint error, nothing can be committed, please retry the CI. @terrykong

@terrykong terrykong enabled auto-merge August 7, 2025 07:41
@terrykong terrykong added this pull request to the merge queue Aug 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 7, 2025
@tpoisonooo
Copy link
Contributor Author

finally.. the CI passed.. QvQ

@Zeyi-Lin
Copy link
Contributor

Zeyi-Lin commented Aug 8, 2025

finally.. the CI passed.. QvQ

It seems not to have passed the Docs-Test QAQ, https://github.com/NVIDIA-NeMo/RL/actions/runs/16798324457/job/47576893936

@tpoisonooo
Copy link
Contributor Author

tpoisonooo commented Aug 11, 2025

finally.. the CI passed.. QvQ

It seems not to have passed the Docs-Test QAQ, https://github.com/NVIDIA-NeMo/RL/actions/runs/16798324457/job/47576893936

This looks like some errors in CI https://github.com/NVIDIA-NeMo/RL/actions/runs/16798324457/job/47576893936

  • ModuleNotFoundError: No module named 'swanlab' , I already add dependency in pyproject.yaml
  • NameError: name 'flatten_dict' is not defined

@terrykong
Copy link
Collaborator

hey @tpoisonooo . I'll take a look at this later today and help you get your PR in (i'll make sure to credit you for the contribution)

@terrykong
Copy link
Collaborator

closing since this is replaced with #923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation external x-shailab

Projects

None yet

Development

Successfully merging this pull request may close these issues.