Skip to content

feat: add multi-node sandbox support for SLURM clusters#1

Open
gwarmstrong wants to merge 6 commits intomainfrom
georgea/multinode-sandbox-pr
Open

feat: add multi-node sandbox support for SLURM clusters#1
gwarmstrong wants to merge 6 commits intomainfrom
georgea/multinode-sandbox-pr

Conversation

@gwarmstrong
Copy link
Owner

Summary

Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs.

  • SLURM multi-node detection: Auto-detect SLURM_JOB_NODELIST and expand compressed nodelists using built-in Python parser
  • Per-node TCP workers with unique ports: Each node starts NUM_WORKERS uWSGI workers on TCP ports (base 50001+), with automatic port conflict detection and retry
  • Cross-node port coordination: Workers report ports to shared filesystem (/nemo_run/sandbox_ports_*), with Lustre cache invalidation
  • Nginx load balancing across all nodes: Master collects port reports, generates upstream config. Worker nodes proxy to master
  • Parallel startup: Simultaneous worker spawn, parallel health checks, parallel remote verification via xargs -P 64
  • Backward-compatible: Single-node auto-detected when SLURM vars absent
  • Dockerfile: Cache optimization (moved COPY late), exec-form CMD

Architecture (multi-node)

Master Node:                           Worker Nodes:
┌─────────────────────┐               ┌─────────────────────┐
│ nginx LB (:6000)    │               │ nginx proxy (:6000) │──→ Master nginx
│  ├→ master:50001    │               │                     │
│  ├→ ...             │               │ uWSGI workers       │
│  ├→ worker1:50001   │               │  ├→ :50001          │
│  └→ ...             │               │  └→ ...             │
│                     │               └─────────────────────┘
│ uWSGI workers       │
│  ├→ :50001          │               Port files (shared FS):
│  └→ ...             │               /nemo_run/sandbox_ports_<JOB>/
└─────────────────────┘                 ├─ master_ports.txt
                                        └─ worker1_ports.txt

Commits

  1. feat: add multi-node sandbox support — main rewrite of start-with-nginx.sh, Dockerfile, nginx template
  2. fix: address review findings — self-review fixes:
    • Shell injection fix (pass nodelist via sys.argv not string interpolation)
    • Trap overwrite fix (fold temp cleanup into cleanup())
    • Remove dead code (is_port_free, find_free_port)
    • Network blocking on all nodes (was master-only)
    • Stale port file cleanup on startup

Validation

  • 16-node DFW run (128 workers/node = 2048 total), 9594 requests, 0 errors
  • 2-node DFW run successful
  • Single-node backward compatible

Enable the sandbox (code execution environment) to scale across multiple
SLURM nodes for large-scale RL training jobs.

Key changes:
- Auto-detect SLURM multi-node environments and expand nodelists
- Allocate unique TCP ports per worker with parallel startup and
  automatic port conflict retry
- Coordinate port reporting between nodes via shared filesystem
- Configure nginx upstream to load-balance across all nodes' workers
- Worker nodes run local nginx proxy forwarding to master's LB
- Parallel health checks for faster startup with many workers
- Backward-compatible: single-node mode auto-detected when SLURM
  vars are absent

Validated on DFW with 16-node (128 workers/node) runs: 9594 successful
requests, 0 errors.
- Pass SLURM nodelist via sys.argv instead of shell interpolation
  into Python triple-quoted string (prevents injection)
- Fix trap overwrite: fold temp dir cleanup into cleanup() instead
  of a separate EXIT trap that overwrote SIGTERM/SIGINT handler
- Remove unused is_port_free() and find_free_port() dead code
- Move network blocking (ld.so.preload) outside master-only branch
  so it applies on all nodes (worker nodes also run user code)
- Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse
Document all required and optional environment variables grouped by
category: worker configuration, multi-node/SLURM, and security.
# Force Lustre cache invalidation - more aggressive than stat/ls
# 1. Create and delete a temp file to invalidate directory cache
_tmp_invalidate="$PORTS_REPORT_DIR/.cache_invalidate_$$_$(date +%s%N)"
touch "$_tmp_invalidate" 2>/dev/null && rm -f "$_tmp_invalidate" 2>/dev/null || true
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this truly necessary? I'm not sure we're going to run up against the lustre caching often

> $UPSTREAM_FILE

for node in $ALL_NODES; do
node_short="${node%%.*}"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to create some utilities/functions for things like this? I am worried it makes it hard to read as is

echo "Starting worker load monitor (updates every 60s)..."
while true; do
sleep 60
echo "--- Worker Load Stats (Top 10) at $(date) ---"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the $(hostname) should be here

# Applied on ALL nodes since worker nodes run sandboxed user code too
# =============================================================================
BLOCK_NETWORK_LIB="/usr/lib/libblock_network.so"
if [ "${NEMO_SKILLS_SANDBOX_BLOCK_NETWORK:-0}" = "1" ]; then
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that network blocking will mess up internode communication in the multinode deployment, scope this out

rm -rf "$REMOTE_HEALTH_DIR" "$ENDPOINTS_FILE"
else
# Single-node mode: generate config from local ports only
echo "Single-node mode: generating nginx config from local ports"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the control flow to get here is pretty confusing--is it also possible we can have an override to force run in single node mode here? that way in case it's bad for e.g., network blcoking, we can force it seamlessly

Address PR review comments:

1. Remove aggressive Lustre cache invalidation (touch/rm/ls/sync dance).
   The cat-based file read already forces Lustre to fetch content; the
   extra invalidation was unnecessary overhead.

2. Extract utility functions for readability:
   - generate_nginx_config() — template substitution + nginx -t
   - read_port_file() — parse port files, emit node:port lines
   - wait_for_port_reports() — poll shared storage for all nodes
   - verify_remote_workers() — parallel health checks via xargs
   This makes the nginx setup section a clear linear flow:
     wait_for_port_reports → build upstream → generate_nginx_config →
     verify_remote_workers

3. Add $(hostname) to load monitor stats output.

4. Skip network blocking in multi-node mode. ld.so.preload intercepts
   socket() in all new exec'd processes — if the monitoring loop restarts
   a crashed worker, the new uWSGI process would be unable to bind its
   listening socket. Document this limitation.

5. Add SANDBOX_FORCE_SINGLE_NODE env var to override multi-node detection.
   Useful for debugging or when multi-node sandbox causes issues.

Also: trim verbose debug logging, reduce file from ~1000 to ~710 lines.
# Node discovery
# =============================================================================
_H=$(hostname)
echo "[$_H] SLURM_JOB_NODELIST=${SLURM_JOB_NODELIST:-<not set>} SLURM_NNODES=${SLURM_NNODES:-<not set>}"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this mean setting SLURM_NNODES is kind of required? but it is not specified above, same for JOB_NODELIST -- at least in order to use multinode. I think this <not set> stuff is going to silence misconfigurations


if [ -z "$UWSGI_CHEAPER" ]; then
UWSGI_CHEAPER=1
elif [ "$UWSGI_CHEAPER" -le 0 ]; then
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are hte uwsgi cheapr warnings silenced?

# =============================================================================
# Write port assignments to shared storage
# =============================================================================
PORTS_FILE="$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't htis only happen in multinode mode?

# --- Worker node: local nginx proxy forwarding to master ---
echo "[$_H] Starting nginx proxy to master $MASTER_NODE:$NGINX_PORT..."
cat > /etc/nginx/nginx.conf << EOF
events {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can a lot of this content go in the template? or is this different?

# when nginx/uWSGI are already running. However, in multi-node mode, if the
# monitoring loop restarts a crashed worker, the new uWSGI process would be
# unable to bind its listening socket. We therefore only enable network blocking
# in single-node mode where worker restarts are less critical.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ONLY do it in single-node mode, but we should at least clarify the behavior here and maybe make the errors more transparent

1. Fix misleading log output: only show SLURM vars when they're set,
   and emit a clear diagnostic when SLURM_JOB_NODELIST expansion fails
   instead of a silent fallback.

2. Restore uWSGI cheaper validation warnings that were lost in the
   restructuring. Invalid values are still auto-corrected but now log
   what happened.

3. Only write port files to shared storage in multi-node mode — the
   port coordination protocol is unnecessary overhead in single-node.

4. Extract worker proxy nginx config to a separate template file
   (nginx-worker-proxy.conf.template) instead of an inline heredoc.

5. Enable network blocking on all modes (not just single-node). Add
   NETWORK_BLOCKING_ACTIVE flag so the monitoring loop emits a clear
   diagnostic when a worker restart fails due to ld.so.preload blocking
   socket() in new processes.
UWSGI_CHEAPER=1
elif [ "$UWSGI_CHEAPER" -le 0 ]; then
echo "WARNING: UWSGI_CHEAPER ($UWSGI_CHEAPER) must be at least 1"
echo "WARNING: UWSGI_CHEAPER=$UWSGI_CHEAPER must be >= 1, setting to 1"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize youve made these UWSGI settings comments more precise, but I would rather just remove the diff if possible sinc eit doesn't seem to materialy chagne functionality

if kill -0 "$pid" 2>/dev/null; then
kill -TERM "$pid" 2>/dev/null || true
fi
kill -0 "$pid" 2>/dev/null && kill -TERM "$pid" 2>/dev/null || true
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an actual necessary and functional chagne?

Minimize diff by matching origin/main's exact wording for uWSGI
validation warnings and using the original if/then/fi form in the
cleanup function.
gwarmstrong added a commit that referenced this pull request Feb 14, 2026
Per review feedback: all benchmark-specific packages should go to core
for now since JIT install is not yet implemented. Previously only
PythonTool-specific deps were in core while benchmark deps like datasets,
sacrebleu, faiss-cpu, etc. were only in main.txt. This led to an
inconsistent boundary where math grader deps were in core but BFCL deps
were not, despite both being benchmark-specific.

Addresses review comments #1, NVIDIA-NeMo#4, NVIDIA-NeMo#6 on PR NVIDIA-NeMo#1229.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant