Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/nsys-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ If you are not using model parallelism in Vllm, you should directly refer to `vl

3. **File Location**: Profile files are saved in `/tmp/ray/session*/logs/nsight/` directory on each worker node. Ensure you check both `ls /tmp/ray/session_[0-9]*/logs/nsight` and `ls /tmp/ray/session_latest/logs/nsight` for the profiles, since the "latest" pointer may be stale.

**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent `$SLURM_JOB_ID-logs/ray` directory.
**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent directory. The header node's files will be synced to ``$SLURM_JOB_ID-logs/ray`, and other nodes' files will be synced to `$SLURM_JOB_ID-logs/ray/$node_ip/` where `$node_ip` is the IP address of the node.

## Analyze Profile Files

Expand Down
31 changes: 31 additions & 0 deletions ray.sub
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,37 @@ monitor-sidecar() {
}
monitor-sidecar &

# Background process to sync ray logs every $RAY_LOG_SYNC_FREQUENCY seconds
log-sync-sidecar() {
set +x
if [[ -z "$RAY_LOG_SYNC_FREQUENCY" ]]; then
echo "RAY_LOG_SYNC_FREQUENCY is not set, skipping log sync sidecar"
return
fi
mkdir -p $LOG_DIR/ray/$node_i
while true; do
sleep $RAY_LOG_SYNC_FREQUENCY
if ls /tmp/ray/session_[0-9]* > /dev/null 2>&1; then
for session_dir in /tmp/ray/session_[0-9]*/; do
if [[ -d "\$session_dir/logs" ]]; then
session_name=\$(basename "\$session_dir")
mkdir -p "$LOG_DIR/ray/$node_i/\$session_name"
if command -v rsync > /dev/null 2>&1; then
rsync -ahP "\$session_dir/logs/" $LOG_DIR/ray/$node_i/\$session_name/logs/ 2>/dev/null || true
else
cp -r "\$session_dir/logs" $LOG_DIR/ray/$node_i/\$session_name/
fi
fi
done
fi
if [[ -f "$LOG_DIR/ENDED" ]]; then
echo "Log sync sidecar terminating..."
break
fi
done
}
log-sync-sidecar &

# Patch nsight.py before starting Ray worker
sed -i 's/context\.py_executable = " "\.join(self\.nsight_cmd) + " python"/context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}"/g' /opt/nemo_rl_venv/lib64/python*/site-packages/ray/_private/runtime_env/nsight.py

Expand Down