feat: add multi-node sandbox support for SLURM clusters#1
feat: add multi-node sandbox support for SLURM clusters#1gwarmstrong wants to merge 6 commits intomainfrom
Conversation
Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs. Key changes: - Auto-detect SLURM multi-node environments and expand nodelists - Allocate unique TCP ports per worker with parallel startup and automatic port conflict retry - Coordinate port reporting between nodes via shared filesystem - Configure nginx upstream to load-balance across all nodes' workers - Worker nodes run local nginx proxy forwarding to master's LB - Parallel health checks for faster startup with many workers - Backward-compatible: single-node mode auto-detected when SLURM vars are absent Validated on DFW with 16-node (128 workers/node) runs: 9594 successful requests, 0 errors.
- Pass SLURM nodelist via sys.argv instead of shell interpolation into Python triple-quoted string (prevents injection) - Fix trap overwrite: fold temp dir cleanup into cleanup() instead of a separate EXIT trap that overwrote SIGTERM/SIGINT handler - Remove unused is_port_free() and find_free_port() dead code - Move network blocking (ld.so.preload) outside master-only branch so it applies on all nodes (worker nodes also run user code) - Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse
Document all required and optional environment variables grouped by category: worker configuration, multi-node/SLURM, and security.
| # Force Lustre cache invalidation - more aggressive than stat/ls | ||
| # 1. Create and delete a temp file to invalidate directory cache | ||
| _tmp_invalidate="$PORTS_REPORT_DIR/.cache_invalidate_$$_$(date +%s%N)" | ||
| touch "$_tmp_invalidate" 2>/dev/null && rm -f "$_tmp_invalidate" 2>/dev/null || true |
There was a problem hiding this comment.
is this truly necessary? I'm not sure we're going to run up against the lustre caching often
| > $UPSTREAM_FILE | ||
|
|
||
| for node in $ALL_NODES; do | ||
| node_short="${node%%.*}" |
There was a problem hiding this comment.
is it possible to create some utilities/functions for things like this? I am worried it makes it hard to read as is
| echo "Starting worker load monitor (updates every 60s)..." | ||
| while true; do | ||
| sleep 60 | ||
| echo "--- Worker Load Stats (Top 10) at $(date) ---" |
There was a problem hiding this comment.
the $(hostname) should be here
| # Applied on ALL nodes since worker nodes run sandboxed user code too | ||
| # ============================================================================= | ||
| BLOCK_NETWORK_LIB="/usr/lib/libblock_network.so" | ||
| if [ "${NEMO_SKILLS_SANDBOX_BLOCK_NETWORK:-0}" = "1" ]; then |
There was a problem hiding this comment.
I'm a little worried that network blocking will mess up internode communication in the multinode deployment, scope this out
| rm -rf "$REMOTE_HEALTH_DIR" "$ENDPOINTS_FILE" | ||
| else | ||
| # Single-node mode: generate config from local ports only | ||
| echo "Single-node mode: generating nginx config from local ports" |
There was a problem hiding this comment.
the control flow to get here is pretty confusing--is it also possible we can have an override to force run in single node mode here? that way in case it's bad for e.g., network blcoking, we can force it seamlessly
Address PR review comments:
1. Remove aggressive Lustre cache invalidation (touch/rm/ls/sync dance).
The cat-based file read already forces Lustre to fetch content; the
extra invalidation was unnecessary overhead.
2. Extract utility functions for readability:
- generate_nginx_config() — template substitution + nginx -t
- read_port_file() — parse port files, emit node:port lines
- wait_for_port_reports() — poll shared storage for all nodes
- verify_remote_workers() — parallel health checks via xargs
This makes the nginx setup section a clear linear flow:
wait_for_port_reports → build upstream → generate_nginx_config →
verify_remote_workers
3. Add $(hostname) to load monitor stats output.
4. Skip network blocking in multi-node mode. ld.so.preload intercepts
socket() in all new exec'd processes — if the monitoring loop restarts
a crashed worker, the new uWSGI process would be unable to bind its
listening socket. Document this limitation.
5. Add SANDBOX_FORCE_SINGLE_NODE env var to override multi-node detection.
Useful for debugging or when multi-node sandbox causes issues.
Also: trim verbose debug logging, reduce file from ~1000 to ~710 lines.
| # Node discovery | ||
| # ============================================================================= | ||
| _H=$(hostname) | ||
| echo "[$_H] SLURM_JOB_NODELIST=${SLURM_JOB_NODELIST:-<not set>} SLURM_NNODES=${SLURM_NNODES:-<not set>}" |
There was a problem hiding this comment.
wouldn't this mean setting SLURM_NNODES is kind of required? but it is not specified above, same for JOB_NODELIST -- at least in order to use multinode. I think this <not set> stuff is going to silence misconfigurations
|
|
||
| if [ -z "$UWSGI_CHEAPER" ]; then | ||
| UWSGI_CHEAPER=1 | ||
| elif [ "$UWSGI_CHEAPER" -le 0 ]; then |
There was a problem hiding this comment.
why are hte uwsgi cheapr warnings silenced?
| # ============================================================================= | ||
| # Write port assignments to shared storage | ||
| # ============================================================================= | ||
| PORTS_FILE="$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" |
There was a problem hiding this comment.
shouldn't htis only happen in multinode mode?
| # --- Worker node: local nginx proxy forwarding to master --- | ||
| echo "[$_H] Starting nginx proxy to master $MASTER_NODE:$NGINX_PORT..." | ||
| cat > /etc/nginx/nginx.conf << EOF | ||
| events { |
There was a problem hiding this comment.
can a lot of this content go in the template? or is this different?
| # when nginx/uWSGI are already running. However, in multi-node mode, if the | ||
| # monitoring loop restarts a crashed worker, the new uWSGI process would be | ||
| # unable to bind its listening socket. We therefore only enable network blocking | ||
| # in single-node mode where worker restarts are less critical. |
There was a problem hiding this comment.
I don't think we should ONLY do it in single-node mode, but we should at least clarify the behavior here and maybe make the errors more transparent
1. Fix misleading log output: only show SLURM vars when they're set, and emit a clear diagnostic when SLURM_JOB_NODELIST expansion fails instead of a silent fallback. 2. Restore uWSGI cheaper validation warnings that were lost in the restructuring. Invalid values are still auto-corrected but now log what happened. 3. Only write port files to shared storage in multi-node mode — the port coordination protocol is unnecessary overhead in single-node. 4. Extract worker proxy nginx config to a separate template file (nginx-worker-proxy.conf.template) instead of an inline heredoc. 5. Enable network blocking on all modes (not just single-node). Add NETWORK_BLOCKING_ACTIVE flag so the monitoring loop emits a clear diagnostic when a worker restart fails due to ld.so.preload blocking socket() in new processes.
| UWSGI_CHEAPER=1 | ||
| elif [ "$UWSGI_CHEAPER" -le 0 ]; then | ||
| echo "WARNING: UWSGI_CHEAPER ($UWSGI_CHEAPER) must be at least 1" | ||
| echo "WARNING: UWSGI_CHEAPER=$UWSGI_CHEAPER must be >= 1, setting to 1" |
There was a problem hiding this comment.
I realize youve made these UWSGI settings comments more precise, but I would rather just remove the diff if possible sinc eit doesn't seem to materialy chagne functionality
| if kill -0 "$pid" 2>/dev/null; then | ||
| kill -TERM "$pid" 2>/dev/null || true | ||
| fi | ||
| kill -0 "$pid" 2>/dev/null && kill -TERM "$pid" 2>/dev/null || true |
There was a problem hiding this comment.
is this an actual necessary and functional chagne?
Minimize diff by matching origin/main's exact wording for uWSGI validation warnings and using the original if/then/fi form in the cleanup function.
Per review feedback: all benchmark-specific packages should go to core for now since JIT install is not yet implemented. Previously only PythonTool-specific deps were in core while benchmark deps like datasets, sacrebleu, faiss-cpu, etc. were only in main.txt. This led to an inconsistent boundary where math grader deps were in core but BFCL deps were not, despite both being benchmark-specific. Addresses review comments #1, NVIDIA-NeMo#4, NVIDIA-NeMo#6 on PR NVIDIA-NeMo#1229. Signed-off-by: George Armstrong <georgea@nvidia.com>
Summary
Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs.
SLURM_JOB_NODELISTand expand compressed nodelists using built-in Python parserNUM_WORKERSuWSGI workers on TCP ports (base 50001+), with automatic port conflict detection and retry/nemo_run/sandbox_ports_*), with Lustre cache invalidationxargs -P 64Architecture (multi-node)
Commits
feat: add multi-node sandbox support— main rewrite ofstart-with-nginx.sh, Dockerfile, nginx templatefix: address review findings— self-review fixes:sys.argvnot string interpolation)cleanup())is_port_free,find_free_port)Validation