feat: add multi-node sandbox support for SLURM clusters by gwarmstrong · Pull Request #1 · gwarmstrong/NeMo-Skills

gwarmstrong · 2026-02-06T18:49:54Z

Summary

Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs.

SLURM multi-node detection: Auto-detect SLURM_JOB_NODELIST and expand compressed nodelists using built-in Python parser
Per-node TCP workers with unique ports: Each node starts NUM_WORKERS uWSGI workers on TCP ports (base 50001+), with automatic port conflict detection and retry
Cross-node port coordination: Workers report ports to shared filesystem (/nemo_run/sandbox_ports_*), with Lustre cache invalidation
Nginx load balancing across all nodes: Master collects port reports, generates upstream config. Worker nodes proxy to master
Parallel startup: Simultaneous worker spawn, parallel health checks, parallel remote verification via xargs -P 64
Backward-compatible: Single-node auto-detected when SLURM vars absent
Dockerfile: Cache optimization (moved COPY late), exec-form CMD

Architecture (multi-node)

Master Node:                           Worker Nodes:
┌─────────────────────┐               ┌─────────────────────┐
│ nginx LB (:6000)    │               │ nginx proxy (:6000) │──→ Master nginx
│  ├→ master:50001    │               │                     │
│  ├→ ...             │               │ uWSGI workers       │
│  ├→ worker1:50001   │               │  ├→ :50001          │
│  └→ ...             │               │  └→ ...             │
│                     │               └─────────────────────┘
│ uWSGI workers       │
│  ├→ :50001          │               Port files (shared FS):
│  └→ ...             │               /nemo_run/sandbox_ports_<JOB>/
└─────────────────────┘                 ├─ master_ports.txt
                                        └─ worker1_ports.txt

Commits

feat: add multi-node sandbox support — main rewrite of start-with-nginx.sh, Dockerfile, nginx template
fix: address review findings — self-review fixes:
- Shell injection fix (pass nodelist via sys.argv not string interpolation)
- Trap overwrite fix (fold temp cleanup into cleanup())
- Remove dead code (is_port_free, find_free_port)
- Network blocking on all nodes (was master-only)
- Stale port file cleanup on startup

Validation

16-node DFW run (128 workers/node = 2048 total), 9594 requests, 0 errors
2-node DFW run successful
Single-node backward compatible

Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs. Key changes: - Auto-detect SLURM multi-node environments and expand nodelists - Allocate unique TCP ports per worker with parallel startup and automatic port conflict retry - Coordinate port reporting between nodes via shared filesystem - Configure nginx upstream to load-balance across all nodes' workers - Worker nodes run local nginx proxy forwarding to master's LB - Parallel health checks for faster startup with many workers - Backward-compatible: single-node mode auto-detected when SLURM vars are absent Validated on DFW with 16-node (128 workers/node) runs: 9594 successful requests, 0 errors.

- Pass SLURM nodelist via sys.argv instead of shell interpolation into Python triple-quoted string (prevents injection) - Fix trap overwrite: fold temp dir cleanup into cleanup() instead of a separate EXIT trap that overwrote SIGTERM/SIGINT handler - Remove unused is_port_free() and find_free_port() dead code - Move network blocking (ld.so.preload) outside master-only branch so it applies on all nodes (worker nodes also run user code) - Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse

Document all required and optional environment variables grouped by category: worker configuration, multi-node/SLURM, and security.

gwarmstrong · 2026-02-06T19:00:00Z