Add multi-node sandbox support for SLURM clusters#1218
Conversation
Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs. Key changes: - Auto-detect SLURM multi-node environments and expand nodelists - Allocate unique TCP ports per worker with parallel startup and automatic port conflict retry - Coordinate port reporting between nodes via shared filesystem - Configure nginx upstream to load-balance across all nodes' workers - Worker nodes run local nginx proxy forwarding to master's LB - Parallel health checks for faster startup with many workers - Backward-compatible: single-node mode auto-detected when SLURM vars are absent Validated on DFW with 16-node (128 workers/node) runs: 9594 successful requests, 0 errors. Signed-off-by: George Armstrong <georgea@nvidia.com>
- Pass SLURM nodelist via sys.argv instead of shell interpolation into Python triple-quoted string (prevents injection) - Fix trap overwrite: fold temp dir cleanup into cleanup() instead of a separate EXIT trap that overwrote SIGTERM/SIGINT handler - Remove unused is_port_free() and find_free_port() dead code - Move network blocking (ld.so.preload) outside master-only branch so it applies on all nodes (worker nodes also run user code) - Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse Signed-off-by: George Armstrong <georgea@nvidia.com>
Document all required and optional environment variables grouped by category: worker configuration, multi-node/SLURM, and security. Signed-off-by: George Armstrong <georgea@nvidia.com>
Address PR review comments:
1. Remove aggressive Lustre cache invalidation (touch/rm/ls/sync dance).
The cat-based file read already forces Lustre to fetch content; the
extra invalidation was unnecessary overhead.
2. Extract utility functions for readability:
- generate_nginx_config() — template substitution + nginx -t
- read_port_file() — parse port files, emit node:port lines
- wait_for_port_reports() — poll shared storage for all nodes
- verify_remote_workers() — parallel health checks via xargs
This makes the nginx setup section a clear linear flow:
wait_for_port_reports → build upstream → generate_nginx_config →
verify_remote_workers
3. Add $(hostname) to load monitor stats output.
4. Skip network blocking in multi-node mode. ld.so.preload intercepts
socket() in all new exec'd processes — if the monitoring loop restarts
a crashed worker, the new uWSGI process would be unable to bind its
listening socket. Document this limitation.
5. Add SANDBOX_FORCE_SINGLE_NODE env var to override multi-node detection.
Useful for debugging or when multi-node sandbox causes issues.
Also: trim verbose debug logging, reduce file from ~1000 to ~710 lines.
Signed-off-by: George Armstrong <georgea@nvidia.com>
1. Fix misleading log output: only show SLURM vars when they're set, and emit a clear diagnostic when SLURM_JOB_NODELIST expansion fails instead of a silent fallback. 2. Restore uWSGI cheaper validation warnings that were lost in the restructuring. Invalid values are still auto-corrected but now log what happened. 3. Only write port files to shared storage in multi-node mode — the port coordination protocol is unnecessary overhead in single-node. 4. Extract worker proxy nginx config to a separate template file (nginx-worker-proxy.conf.template) instead of an inline heredoc. 5. Enable network blocking on all modes (not just single-node). Add NETWORK_BLOCKING_ACTIVE flag so the monitoring loop emits a clear diagnostic when a worker restart fails due to ld.so.preload blocking socket() in new processes. Signed-off-by: George Armstrong <georgea@nvidia.com>
Minimize diff by matching origin/main's exact wording for uWSGI validation warnings and using the original if/then/fi form in the cleanup function. Signed-off-by: George Armstrong <georgea@nvidia.com>
| for node in $ALL_NODES; do | ||
| local node_short="${node%%.*}" | ||
| local port_file="$PORTS_REPORT_DIR/${node_short}_ports.txt" | ||
| if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then |
There was a problem hiding this comment.
useless cat with grep -q
| if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then | |
| if [ -f "$port_file" ] && grep -q "PORT_REPORT_COMPLETE" "$port_file" 2>/dev/null; then |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| PORTS_REPORT_DIR="/tmp/sandbox_ports_$$" | ||
| fi | ||
| mkdir -p "$PORTS_REPORT_DIR" | ||
| rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true |
There was a problem hiding this comment.
race condition: all nodes execute rm -f on shared filesystem simultaneously
If multiple nodes start at nearly the same time, they can race to delete each other's port files. A node that completes writing its ports could have its file deleted by another node still executing this line.
| rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true | |
| rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true |
| echo "=== Sandbox ready (MASTER) ===" | ||
| echo " Nginx LB: http://localhost:$NGINX_PORT" | ||
| echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))" | ||
| echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" |
There was a problem hiding this comment.
potential array index out of bounds if NUM_WORKERS is 0
If NUM_WORKERS=0 (either explicitly set or through nproc --all returning 0), this line would access ACTUAL_WORKER_PORTS[-1] which is invalid in bash.
Check NUM_WORKERS > 0 before array access or earlier in the script.
| echo "UWSGI cheaper mode: disabled" | ||
| echo "=== Sandbox ready (WORKER) ===" | ||
| echo " Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT" | ||
| echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})" |
There was a problem hiding this comment.
same array bounds issue as line 655
| echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})" | |
| echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]:-none}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS>0?NUM_WORKERS-1:0))]:-none})" |
📝 WalkthroughWalkthroughReorganizes the sandbox Dockerfile and adds templates, and substantially rewrites the startup script to support TCP-based uWSGI workers for single-node and SLURM-detected multi-node deployments with dynamic Nginx config generation, cross-node port coordination, remote health checks, and master/worker orchestration. Changes
Sequence DiagramsequenceDiagram
participant SLURM as SLURM Cluster
participant Master as Master Node<br/>(start-with-nginx.sh)
participant Worker as Worker Node(s)<br/>(start-with-nginx.sh)
participant Nginx as Nginx LB
participant uWSGI as uWSGI Workers<br/>(TCP)
participant PortDir as Shared Port<br/>Coordination Dir
Master->>SLURM: Query node list (SLURM_NODELIST)
SLURM-->>Master: Node hostnames & count
Master->>Master: Determine master/worker role and init ports
par Multi-Node Startup
Master->>uWSGI: Start local uWSGI workers on BASE_PORT + offsets
uWSGI-->>PortDir: Write per-node port report
Worker->>uWSGI: Start local uWSGI workers on assigned ports
Worker-->>PortDir: Write per-node port report
Master->>Nginx: Generate & validate nginx.conf (template + collected ports)
Master->>Nginx: Start/Reload Nginx with TCP upstreams
Worker->>Nginx: Start local proxy to master (if non-master)
end
Master->>PortDir: wait_for_port_reports() and aggregate reports
PortDir-->>Master: Collected ports from all nodes
Master->>Master: Build upstreams with all worker endpoints
Master->>Worker: verify_remote_workers() (parallel health checks)
Worker-->>Master: Health responses
Master->>Master: Monitor loop: health checks, restarts, nginx status
Master->>Nginx: Monitor nginx and reload on config changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 335-337: The startup cleanup removes every node's port report
files (rm -f "$PORTS_REPORT_DIR"/*_ports.txt) causing a race where slow nodes
delete other nodes' reports; change the removal to only the current node's
report (e.g., rm -f "$PORTS_REPORT_DIR/${NODE_NAME}_ports.txt" or use hostname:
"${PORTS_REPORT_DIR}/$(hostname)_ports.txt") so other nodes' files remain;
update start-with-nginx.sh where PORTS_REPORT_DIR is used and ensure
wait_for_port_reports still reads all reports from the directory.
🧹 Nitpick comments (4)
dockerfiles/sandbox/start-with-nginx.sh (4)
117-147: Workers bind to0.0.0.0unconditionally — consider restricting in single-node mode.Line 126 binds uWSGI workers to
0.0.0.0:${WORKER_PORT}, which is required for multi-node cross-node communication but unnecessarily exposes workers on all interfaces in single-node mode. If the container's network isn't fully isolated, any host on the network can reach individual workers directly (bypassing nginx session affinity).Suggested conditional bind address
+ if [ "$NODE_COUNT" -gt 1 ]; then + BIND_ADDR="0.0.0.0" + else + BIND_ADDR="127.0.0.1" + fi + cat > /tmp/worker${i}_uwsgi.ini << EOF [uwsgi] module = main callable = app processes = ${UWSGI_PROCESSES} -http-socket = 0.0.0.0:${WORKER_PORT} +http-socket = ${BIND_ADDR}:${WORKER_PORT}Note:
BIND_ADDRwould need to be determined beforestart_worker_fastis called (e.g., set once after node discovery).
144-144: Shell redirections without commands (SC2188).Lines 144, 560, 577, and 594 use bare
> fileto truncate files. While this works in bash, it's flagged by Shellcheck (SC2188) and can confuse readers. The idiomatic form is: > file(using the no-op builtin).Example fix
- > /var/log/worker${i}.log + : > /var/log/worker${i}.log- > "$PORTS_FILE" + : > "$PORTS_FILE"- > $UPSTREAM_FILE + : > $UPSTREAM_FILEAlso applies to: 560-560, 577-577, 594-594
247-261: Exported temp dir + xargs{}substitution — minor robustness notes.
export REMOTE_HEALTH_DIR(Line 245) leaks into the global environment. Sinceverify_remote_workersis called only once this is harmless, butlocal+ explicitexportor passing via env to xargs would be cleaner.- In the
xargs -I {} sh -c 'endpoint="{}"'pattern (Line 254–261), if an endpoint string ever contains a single quote, it would break thesh -cquoting. This can't happen with validhost:portvalues, but a defensive alternative is to pipe throughsh -c 'endpoint="$1"; ...' _ {}using positional args.Both are low-risk given the trusted input, mentioning for hardening only.
668-705: Monitoring loop: restarted worker uses same port but nginx upstream is static.The monitoring loop restarts crashed workers via
start_workerwhich reuses the original port. This works because the restarted worker binds to the same port that nginx already has in its upstream — no config reload needed. This is a correct and practical design choice.One edge consideration: if a restarted worker fails to bind (port still held by the dying process, or network blocking active), the monitoring loop will re-attempt every 10 seconds indefinitely. With network blocking, the warning is emitted (Lines 687–691). Without network blocking, the retry is silent after the first warning. Consider adding a retry counter or backoff to avoid log flooding for persistently-failing workers.
| break | ||
| fi | ||
|
|
||
| cat "$endpoints_file" | xargs -P 64 -I {} sh -c ' |
There was a problem hiding this comment.
useless cat - xargs can read from file directly
| cat "$endpoints_file" | xargs -P 64 -I {} sh -c ' | |
| xargs -P 64 -I {} sh -c ' |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Kipok
left a comment
There was a problem hiding this comment.
please rebuild sandbox and run slurm tests. If those pass, we should be good to merge
Address PR review comments: - Remove useless cat pipe in port report check (grep reads file directly) - Scope rm to current node's port file only, preventing a race where a slow-starting node deletes port files already written by faster nodes Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
4f89d40 to
5cdca71
Compare
Additional Comments (3)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
This test invokes
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 651-659: The port display is misleading because
ACTUAL_WORKER_PORTS may be non-contiguous after conflict retries; update the
echo logic in the IS_MASTER/WORKER blocks to either print the full list of
ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS array up to NUM_WORKERS)
or detect non-contiguity and append a "(non-contiguous ports: ...)" note;
specifically change the lines that currently emit
"${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the
similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or
print the range plus a non-contiguous warning) so operators see the real ports
when retries/offsets occurred.
- Around line 685-694: When a worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid
immediately restarting it to prevent an infinite futile restart loop; modify the
monitor logic around the kill-check and restart (the block that calls
start_worker and updates WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip
restarting when NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry
cap (e.g., track restart counts in an array like WORKER_RETRIES and stop
restarting after N attempts) and log a single clear message when restarts are
suppressed; ensure you reference NETWORK_BLOCKING_ACTIVE, start_worker,
WORKER_PIDS, ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when
making the change so the monitor stops spawning doomed processes.
- Line 144: The script uses a bare redirection of the form >
/var/log/worker${i}.log which is not POSIX-portable; replace each bare
redirection of the worker log (the occurrences that use >
/var/log/worker${i}.log) with an explicit no-op redirection using the shell
builtin (i.e., use a ":" no-op followed by the redirection) so the intent is
explicit and shellcheck SC2188 is satisfied; apply the same change for the other
matching occurrences with the worker log filenames.
🧹 Nitpick comments (2)
dockerfiles/sandbox/start-with-nginx.sh (2)
254-261:xargs -I {} sh -c '…{}'is a shell-injection anti-pattern.Although endpoints are internally generated, interpolating
{}directly insidesh -cbreaks if the value ever contains quotes or shell metacharacters. Pass the value as a positional parameter instead.Proposed fix
- cat "$endpoints_file" | xargs -P 64 -I {} sh -c ' - endpoint="{}" - status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")" + cat "$endpoints_file" | xargs -P 64 -I {} sh -c ' + endpoint="$1" + status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")" [ -f "$status_file" ] && exit 0 if curl -s -f --connect-timeout 2 --max-time 5 "http://${endpoint}/health" > /dev/null 2>&1; then touch "$status_file" fi - ' + ' _ {}
668-679: Background stats subshell PID is not tracked bycleanup().The monitoring subshell spawned here won't be killed by the
cleanuptrap. In a container this is benign (container stop kills all), but for completeness consider capturing the PID and adding it to the kill list.
| echo "cheaper = ${UWSGI_CHEAPER}" >> /tmp/worker${i}_uwsgi.ini | ||
| fi | ||
|
|
||
| > /var/log/worker${i}.log |
There was a problem hiding this comment.
Bare redirection without a command (SC2188).
> file works in bash but is not POSIX-portable and triggers shellcheck SC2188. Use : > to make intent explicit.
Proposed fix
- > /var/log/worker${i}.log
+ : > /var/log/worker${i}.logSame pattern on lines 560, 577, and 594.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > /var/log/worker${i}.log | |
| : > /var/log/worker${i}.log |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 144-144: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).
(SC2188)
🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` at line 144, The script uses a bare
redirection of the form > /var/log/worker${i}.log which is not POSIX-portable;
replace each bare redirection of the worker log (the occurrences that use >
/var/log/worker${i}.log) with an explicit no-op redirection using the shell
builtin (i.e., use a ":" no-op followed by the redirection) so the intent is
explicit and shellcheck SC2188 is satisfied; apply the same change for the other
matching occurrences with the worker log filenames.
| if [ "$IS_MASTER" = "1" ]; then | ||
| echo "=== Sandbox ready (MASTER) ===" | ||
| echo " Nginx LB: http://localhost:$NGINX_PORT" | ||
| echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))" | ||
| echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" | ||
| else | ||
| echo "UWSGI cheaper mode: disabled" | ||
| echo "=== Sandbox ready (WORKER) ===" | ||
| echo " Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT" | ||
| echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})" |
There was a problem hiding this comment.
Port range display is misleading after conflict retries.
After the retry algorithm applies offsets (e.g., +200), ports are no longer contiguous. Displaying ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]} implies a contiguous range, which could confuse operators. Consider listing all ports or noting "non-contiguous" when retries occurred.
🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 651 - 659, The port
display is misleading because ACTUAL_WORKER_PORTS may be non-contiguous after
conflict retries; update the echo logic in the IS_MASTER/WORKER blocks to either
print the full list of ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS
array up to NUM_WORKERS) or detect non-contiguity and append a "(non-contiguous
ports: ...)" note; specifically change the lines that currently emit
"${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the
similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or
print the range plus a non-contiguous warning) so operators see the real ports
when retries/offsets occurred.
| if ! kill -0 "$pid" 2>/dev/null; then | ||
| echo "WARNING: Worker $i (PID $pid) died - restarting..." | ||
| new_pid=$(start_worker $i) | ||
| WORKER_PIDS[$idx]=$new_pid | ||
| echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..." | ||
| if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then | ||
| echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted" | ||
| echo "[$_H] worker may fail to bind its port because socket() is blocked for" | ||
| echo "[$_H] new processes. Remaining workers continue serving requests." | ||
| fi | ||
| result=$(start_worker $i) | ||
| WORKER_PIDS[$idx]="${result%%:*}" | ||
| ACTUAL_WORKER_PORTS[$idx]="${result##*:}" |
There was a problem hiding this comment.
Futile restart loop when network blocking is active.
When NETWORK_BLOCKING_ACTIVE=1, restarted workers cannot bind their socket (as documented in lines 627-629 and the warning itself). Yet the restart still proceeds at line 692, the new process dies, and the next monitoring iteration restarts it again — an infinite loop of doomed spawns generating log noise every 10 seconds.
Consider skipping the restart (or capping per-worker retries) when network blocking is active:
Proposed fix
if ! kill -0 "$pid" 2>/dev/null; then
echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
echo "[$_H] worker may fail to bind its port because socket() is blocked for"
- echo "[$_H] new processes. Remaining workers continue serving requests."
+ echo "[$_H] new processes. Skipping restart — remaining workers continue serving."
+ continue
fi
result=$(start_worker $i)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]="${result##*:}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ! kill -0 "$pid" 2>/dev/null; then | |
| echo "WARNING: Worker $i (PID $pid) died - restarting..." | |
| new_pid=$(start_worker $i) | |
| WORKER_PIDS[$idx]=$new_pid | |
| echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..." | |
| if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then | |
| echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted" | |
| echo "[$_H] worker may fail to bind its port because socket() is blocked for" | |
| echo "[$_H] new processes. Remaining workers continue serving requests." | |
| fi | |
| result=$(start_worker $i) | |
| WORKER_PIDS[$idx]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$idx]="${result##*:}" | |
| if ! kill -0 "$pid" 2>/dev/null; then | |
| echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..." | |
| if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then | |
| echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted" | |
| echo "[$_H] worker may fail to bind its port because socket() is blocked for" | |
| echo "[$_H] new processes. Skipping restart — remaining workers continue serving." | |
| continue | |
| fi | |
| result=$(start_worker $i) | |
| WORKER_PIDS[$idx]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$idx]="${result##*:}" |
🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 685 - 694, When a
worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid immediately restarting it to
prevent an infinite futile restart loop; modify the monitor logic around the
kill-check and restart (the block that calls start_worker and updates
WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip restarting when
NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry cap (e.g., track
restart counts in an array like WORKER_RETRIES and stop restarting after N
attempts) and log a single clear message when restarts are suppressed; ensure
you reference NETWORK_BLOCKING_ACTIVE, start_worker, WORKER_PIDS,
ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when making the
change so the monitor stops spawning doomed processes.
Signed-off-by: George Armstrong <georgea@nvidia.com>
415e024 to
44863ff
Compare
| PORTS_REPORT_DIR="/tmp/sandbox_ports_$$" | ||
| fi | ||
| mkdir -p "$PORTS_REPORT_DIR" | ||
| rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true |
There was a problem hiding this comment.
Shared port file clobber
rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" runs on every node at startup. In a multi-node SLURM job, nodes can start at different times; a late-starting node can delete an already-written port report from an early node, causing the master to block until timeout in wait_for_port_reports(). Remove this unconditional delete or scope it so only the current node can clear its own file in a race-safe way (e.g., write to a temp file and mv atomically).
| echo "=== Sandbox ready (MASTER) ===" | ||
| echo " Nginx LB: http://localhost:$NGINX_PORT" | ||
| echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))" | ||
| echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" |
There was a problem hiding this comment.
NUM_WORKERS=0 breaks
This status line indexes ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]. If NUM_WORKERS is 0 (explicitly or via nproc --all), this becomes [-1] and errors in bash. Guard earlier (validate NUM_WORKERS > 0 and exit with a clear error) before any array indexing/loops that assume at least 1 worker.
| if [ "$ALL_NODES" = "127.0.0.1" ] || [ "$CURRENT_NODE_SHORT" = "$MASTER_NODE_SHORT" ]; then | ||
| IS_MASTER=1 | ||
| echo "[$_H] Role: MASTER | Nodes: $NODE_COUNT | Master: $MASTER_NODE" | ||
| else | ||
| IS_MASTER=0 | ||
| echo "[$_H] Role: WORKER | Master: $MASTER_NODE" |
There was a problem hiding this comment.
Master detection mismatch
In SLURM mode, MASTER_NODE is taken from the expanded nodelist (e.g. node001), but the current node is detected via hostname and compared as CURRENT_NODE_SHORT == MASTER_NODE_SHORT. On many clusters hostname returns a different alias/FQDN than scontrol/nodelist uses (e.g. node001.cluster vs node001 or vice versa), which can cause multiple nodes to think they're master (or none). Prefer using SLURM-provided identity (e.g. SLURMD_NODENAME or scontrol show hostname $SLURM_NODEID output) for the local node name, and compare using the same naming scheme used to build ALL_NODES.
| for endpoint in $(read_port_file "$node" "$port_file"); do | ||
| echo " server ${endpoint} max_fails=3 fail_timeout=30s;" >> $UPSTREAM_FILE | ||
| echo "$endpoint" >> "$ENDPOINTS_FILE" | ||
| done |
There was a problem hiding this comment.
Port file parsing unsafe
for endpoint in $(read_port_file ...) word-splits on whitespace. If a port file is empty/partial, read_port_file can emit blank lines and this loop will silently skip/merge tokens. Also, any unexpected whitespace will corrupt endpoints. Use a while IFS= read -r endpoint; do ...; done < <(read_port_file ...) pattern to preserve lines exactly and handle empty output deterministically.
|
@Kipok the gpt-oss test and super_49b tests pass (minus a couple unrelated RULER fluctuations) |
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary
SLURM_JOB_NODELIST), expands compressed nodelists, and coordinates port assignments across nodes via shared filesystemChanges
dockerfiles/sandbox/start-with-nginx.sh(major rewrite)node[001-016],gpu[01-02],cpu[01-03], etc.)/nemo_run/or/workspace/)SANDBOX_FORCE_SINGLE_NODEoverride for debuggingdockerfiles/sandbox/nginx-worker-proxy.conf.template(new file)dockerfiles/Dockerfile.sandbox(minor)COPY start-with-nginx.shafter dependency layers for better Docker cacheCMD ["/start-with-nginx.sh"]nginx-worker-proxy.conf.templatedockerfiles/sandbox/nginx.conf.template(minor)Test plan
Summary by CodeRabbit
New Features
Chores