Skip to content

add support for nsys profile#667

Merged
Kipok merged 5 commits intomainfrom
wedu/nsys_profile
Aug 14, 2025
Merged

add support for nsys profile#667
Kipok merged 5 commits intomainfrom
wedu/nsys_profile

Conversation

@wedu-nvidia
Copy link
Collaborator

Added support for nsys_profile

Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia requested a review from Kipok August 13, 2025 20:24
@wedu-nvidia
Copy link
Collaborator Author

wedu-nvidia commented Aug 13, 2025

@Kipok let me know if you have any comments, I only added support for sft now and wait for
#653 to get merged to revise for grpo

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also do the same for grpo script. And ideally reuse the logic, so that the nsight_cmd is defined in one function and called in both scripts

wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"),
wandb_group: str = typer.Option(None, help="Weights & Biases group name."),
disable_wandb: bool = typer.Option(False, help="Disable wandb logging"),
nsys_profile: bool = typer.Option(False, help="Profile GPU with Nsight Systems for selected Ray workers via env var matching."),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this, let's just have one nsys_step_range and if it's specified, we do profiling with that range


if nsys_profile:
if "RAY_LOG_SYNC_FREQUENCY" not in env_variables:
raise typer.BadParameter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't raise error, just add it directly to cluster config's env vars with a logic like this https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/pipeline/utils/exp.py#L452

@wedu-nvidia
Copy link
Collaborator Author

@Kipok I revised according to the comments

disable_wandb: bool = typer.Option(False, help="Disable wandb logging"),
profile_step_range: str = typer.Option(
None,
help="Environment variable to control which training steps the profiler captures. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Environment variable to control which training steps the profiler captures. "
help="Controls which training steps the nsys profiler captures. "

with_ray=True,
installation_command=installation_command,
)
with temporary_env_update(cluster_config, {"RAY_LOG_SYNC_FREQUENCY": 20}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only do this if profile is specified. You can use {} as a second argument in other cases

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"),
wandb_group: str = typer.Option(None, help="Weights & Biases group name."),
disable_wandb: bool = typer.Option(False, help="Disable wandb logging"),
profile_step_range: str = typer.Option(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments here as in the grpo file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Wei Du <wedu@nvidia.com>
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Kipok Kipok merged commit 925367a into main Aug 14, 2025
4 checks passed
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Aug 15, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
shtoshni pushed a commit that referenced this pull request Aug 15, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Aug 18, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Aug 18, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants