Conversation
Signed-off-by: Wei Du <wedu@nvidia.com>
Kipok
left a comment
There was a problem hiding this comment.
please also do the same for grpo script. And ideally reuse the logic, so that the nsight_cmd is defined in one function and called in both scripts
nemo_skills/pipeline/nemo_rl/sft.py
Outdated
| wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"), | ||
| wandb_group: str = typer.Option(None, help="Weights & Biases group name."), | ||
| disable_wandb: bool = typer.Option(False, help="Disable wandb logging"), | ||
| nsys_profile: bool = typer.Option(False, help="Profile GPU with Nsight Systems for selected Ray workers via env var matching."), |
There was a problem hiding this comment.
we don't need this, let's just have one nsys_step_range and if it's specified, we do profiling with that range
nemo_skills/pipeline/nemo_rl/sft.py
Outdated
|
|
||
| if nsys_profile: | ||
| if "RAY_LOG_SYNC_FREQUENCY" not in env_variables: | ||
| raise typer.BadParameter( |
There was a problem hiding this comment.
don't raise error, just add it directly to cluster config's env vars with a logic like this https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/pipeline/utils/exp.py#L452
Signed-off-by: Wei Du <wedu@nvidia.com>
|
@Kipok I revised according to the comments |
nemo_skills/pipeline/nemo_rl/grpo.py
Outdated
| disable_wandb: bool = typer.Option(False, help="Disable wandb logging"), | ||
| profile_step_range: str = typer.Option( | ||
| None, | ||
| help="Environment variable to control which training steps the profiler captures. " |
There was a problem hiding this comment.
| help="Environment variable to control which training steps the profiler captures. " | |
| help="Controls which training steps the nsys profiler captures. " |
nemo_skills/pipeline/nemo_rl/grpo.py
Outdated
| with_ray=True, | ||
| installation_command=installation_command, | ||
| ) | ||
| with temporary_env_update(cluster_config, {"RAY_LOG_SYNC_FREQUENCY": 20}): |
There was a problem hiding this comment.
we should only do this if profile is specified. You can use {} as a second argument in other cases
| wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"), | ||
| wandb_group: str = typer.Option(None, help="Weights & Biases group name."), | ||
| disable_wandb: bool = typer.Option(False, help="Disable wandb logging"), | ||
| profile_step_range: str = typer.Option( |
There was a problem hiding this comment.
same comments here as in the grpo file
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Added support for nsys_profile