Add multi-node sandbox support for SLURM clusters by gwarmstrong · Pull Request #1218 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-02-06T20:17:39Z

Summary

Adds multi-node sandbox support, enabling code execution workers to be distributed across multiple SLURM nodes with nginx load balancing and session affinity
Auto-detects SLURM multi-node environments (via SLURM_JOB_NODELIST), expands compressed nodelists, and coordinates port assignments across nodes via shared filesystem
Backward-compatible: single-node mode works identically when SLURM variables are absent

Changes

dockerfiles/sandbox/start-with-nginx.sh (major rewrite)

SLURM nodelist expansion (supports node[001-016], gpu[01-02],cpu[01-03], etc.)
TCP socket workers (replacing unix sockets) with configurable base port
Cross-node port coordination via shared filesystem (/nemo_run/ or /workspace/)
Master/worker node roles: master runs nginx LB with cross-node upstream, workers proxy to master
Port conflict retry algorithm for TCP port binding failures
Parallel health checks for both local and remote workers
Network blocking support for all modes with transparent diagnostics
Comprehensive environment variable documentation header
SANDBOX_FORCE_SINGLE_NODE override for debugging

dockerfiles/sandbox/nginx-worker-proxy.conf.template (new file)

Extracted nginx config template for worker nodes that proxy to master's LB

dockerfiles/Dockerfile.sandbox (minor)

Move COPY start-with-nginx.sh after dependency layers for better Docker cache
Use exec-form CMD ["/start-with-nginx.sh"]
Add COPY for new nginx-worker-proxy.conf.template

dockerfiles/sandbox/nginx.conf.template (minor)

Updated comments to document both single-node and multi-node upstream modes

Test plan

Validated on DFW 16-node run (128 workers/node): 9594 successful requests, 0 5xx errors
Validated on DFW 2-node run
Single-node backward compatibility verified (no SLURM vars → same behavior as before)
CI tests pass

Summary by CodeRabbit

New Features
- Multi-node sandboxing with coordinated worker discovery, session-aware routing, health/readiness checks, and session-affinity for stable routing.
- TCP-based worker networking with proxying and an nginx status endpoint for monitoring.
Chores
- Optimized container startup ordering and command form for improved caching and startup behavior.
- Added new proxy configuration template and introduced a configuration flag to span group nodes.

Enable the sandbox (code execution environment) to scale across multiple SLURM nodes for large-scale RL training jobs. Key changes: - Auto-detect SLURM multi-node environments and expand nodelists - Allocate unique TCP ports per worker with parallel startup and automatic port conflict retry - Coordinate port reporting between nodes via shared filesystem - Configure nginx upstream to load-balance across all nodes' workers - Worker nodes run local nginx proxy forwarding to master's LB - Parallel health checks for faster startup with many workers - Backward-compatible: single-node mode auto-detected when SLURM vars are absent Validated on DFW with 16-node (128 workers/node) runs: 9594 successful requests, 0 errors. Signed-off-by: George Armstrong <georgea@nvidia.com>

- Pass SLURM nodelist via sys.argv instead of shell interpolation into Python triple-quoted string (prevents injection) - Fix trap overwrite: fold temp dir cleanup into cleanup() instead of a separate EXIT trap that overwrote SIGTERM/SIGINT handler - Remove unused is_port_free() and find_free_port() dead code - Move network blocking (ld.so.preload) outside master-only branch so it applies on all nodes (worker nodes also run user code) - Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse Signed-off-by: George Armstrong <georgea@nvidia.com>

Document all required and optional environment variables grouped by category: worker configuration, multi-node/SLURM, and security. Signed-off-by: George Armstrong <georgea@nvidia.com>

Address PR review comments: 1. Remove aggressive Lustre cache invalidation (touch/rm/ls/sync dance). The cat-based file read already forces Lustre to fetch content; the extra invalidation was unnecessary overhead. 2. Extract utility functions for readability: - generate_nginx_config() — template substitution + nginx -t - read_port_file() — parse port files, emit node:port lines - wait_for_port_reports() — poll shared storage for all nodes - verify_remote_workers() — parallel health checks via xargs This makes the nginx setup section a clear linear flow: wait_for_port_reports → build upstream → generate_nginx_config → verify_remote_workers 3. Add $(hostname) to load monitor stats output. 4. Skip network blocking in multi-node mode. ld.so.preload intercepts socket() in all new exec'd processes — if the monitoring loop restarts a crashed worker, the new uWSGI process would be unable to bind its listening socket. Document this limitation. 5. Add SANDBOX_FORCE_SINGLE_NODE env var to override multi-node detection. Useful for debugging or when multi-node sandbox causes issues. Also: trim verbose debug logging, reduce file from ~1000 to ~710 lines. Signed-off-by: George Armstrong <georgea@nvidia.com>

1. Fix misleading log output: only show SLURM vars when they're set, and emit a clear diagnostic when SLURM_JOB_NODELIST expansion fails instead of a silent fallback. 2. Restore uWSGI cheaper validation warnings that were lost in the restructuring. Invalid values are still auto-corrected but now log what happened. 3. Only write port files to shared storage in multi-node mode — the port coordination protocol is unnecessary overhead in single-node. 4. Extract worker proxy nginx config to a separate template file (nginx-worker-proxy.conf.template) instead of an inline heredoc. 5. Enable network blocking on all modes (not just single-node). Add NETWORK_BLOCKING_ACTIVE flag so the monitoring loop emits a clear diagnostic when a worker restart fails due to ld.so.preload blocking socket() in new processes. Signed-off-by: George Armstrong <georgea@nvidia.com>

Minimize diff by matching origin/main's exact wording for uWSGI validation warnings and using the original if/then/fi form in the cleanup function. Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

_{1 file reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T20:19:48Z

dockerfiles/sandbox/start-with-nginx.sh

+        for node in $ALL_NODES; do
+            local node_short="${node%%.*}"
+            local port_file="$PORTS_REPORT_DIR/${node_short}_ports.txt"
+            if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then


useless cat with grep -q

Suggested change

if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then

if [ -f "$port_file" ] && grep -q "PORT_REPORT_COMPLETE" "$port_file" 2>/dev/null; then

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-06T20:19:49Z

dockerfiles/sandbox/start-with-nginx.sh

+    PORTS_REPORT_DIR="/tmp/sandbox_ports_$$"
+fi
+mkdir -p "$PORTS_REPORT_DIR"
+rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true


race condition: all nodes execute rm -f on shared filesystem simultaneously

If multiple nodes start at nearly the same time, they can race to delete each other's port files. A node that completes writing its ports could have its file deleted by another node still executing this line.

Suggested change

rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true

rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true

greptile-apps · 2026-02-06T20:19:50Z

dockerfiles/sandbox/start-with-nginx.sh

+    echo "=== Sandbox ready (MASTER) ==="
+    echo "  Nginx LB: http://localhost:$NGINX_PORT"
+    echo "  Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
+    echo "  Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"


potential array index out of bounds if NUM_WORKERS is 0

If NUM_WORKERS=0 (either explicitly set or through nproc --all returning 0), this line would access ACTUAL_WORKER_PORTS[-1] which is invalid in bash.

Check NUM_WORKERS > 0 before array access or earlier in the script.

greptile-apps · 2026-02-06T20:19:51Z

dockerfiles/sandbox/start-with-nginx.sh

-    echo "UWSGI cheaper mode: disabled"
+    echo "=== Sandbox ready (WORKER) ==="
+    echo "  Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT"
+    echo "  Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"


same array bounds issue as line 655

Suggested change

echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"

echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]:-none}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS>0?NUM_WORKERS-1:0))]:-none})"

coderabbitai · 2026-02-06T20:24:10Z

📝 Walkthrough

Walkthrough

Reorganizes the sandbox Dockerfile and adds templates, and substantially rewrites the startup script to support TCP-based uWSGI workers for single-node and SLURM-detected multi-node deployments with dynamic Nginx config generation, cross-node port coordination, remote health checks, and master/worker orchestration.

Changes

Cohort / File(s)	Summary
Dockerfile `dockerfiles/Dockerfile.sandbox`	Reordered COPY operations for cache optimization, added copy of `nginx-worker-proxy.conf.template`, and changed `CMD` from shell form to exec form (`["/start-with-nginx.sh"]`).
Nginx templates `dockerfiles/sandbox/nginx.conf.template`, `dockerfiles/sandbox/nginx-worker-proxy.conf.template`	Expanded upstream documentation and session-affinity notes in `nginx.conf.template`; added new `nginx-worker-proxy.conf.template` that proxies to a master load balancer with consistent-hash affinity, proxy headers, extended timeouts, disabled buffering, logging, and /nginx-status stub_status.
Startup script `dockerfiles/sandbox/start-with-nginx.sh`	Full rewrite to support multi-node deployments: SLURM nodelist expansion, master/worker role detection, per-worker TCP ports with retry/offset, shared port reporting directory, wait/verify aggregation of remote ports, dynamic nginx config generation & validation, parallel remote health checks, worker spawn/restart loops, optional network-blocking, and proxy behavior for non-master workers.
Python config `nemo_skills/pipeline/utils/scripts.py`	Added new public boolean field `SandboxScript.span_group_nodes: bool = True` to indicate sandbox spans group nodes (config flag; no behavior changes within file).

Sequence Diagram

sequenceDiagram
    participant SLURM as SLURM Cluster
    participant Master as Master Node<br/>(start-with-nginx.sh)
    participant Worker as Worker Node(s)<br/>(start-with-nginx.sh)
    participant Nginx as Nginx LB
    participant uWSGI as uWSGI Workers<br/>(TCP)
    participant PortDir as Shared Port<br/>Coordination Dir

    Master->>SLURM: Query node list (SLURM_NODELIST)
    SLURM-->>Master: Node hostnames & count
    Master->>Master: Determine master/worker role and init ports

    par Multi-Node Startup
        Master->>uWSGI: Start local uWSGI workers on BASE_PORT + offsets
        uWSGI-->>PortDir: Write per-node port report
        Worker->>uWSGI: Start local uWSGI workers on assigned ports
        Worker-->>PortDir: Write per-node port report
        Master->>Nginx: Generate & validate nginx.conf (template + collected ports)
        Master->>Nginx: Start/Reload Nginx with TCP upstreams
        Worker->>Nginx: Start local proxy to master (if non-master)
    end

    Master->>PortDir: wait_for_port_reports() and aggregate reports
    PortDir-->>Master: Collected ports from all nodes
    Master->>Master: Build upstreams with all worker endpoints
    Master->>Worker: verify_remote_workers() (parallel health checks)
    Worker-->>Master: Health responses

    Master->>Master: Monitor loop: health checks, restarts, nginx status
    Master->>Nginx: Monitor nginx and reload on config changes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add multi-node sandbox support for SLURM clusters' directly and clearly summarizes the main change: introducing multi-node sandbox capabilities for SLURM environments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/multinode-sandbox-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 335-337: The startup cleanup removes every node's port report
files (rm -f "$PORTS_REPORT_DIR"/*_ports.txt) causing a race where slow nodes
delete other nodes' reports; change the removal to only the current node's
report (e.g., rm -f "$PORTS_REPORT_DIR/${NODE_NAME}_ports.txt" or use hostname:
"${PORTS_REPORT_DIR}/$(hostname)_ports.txt") so other nodes' files remain;
update start-with-nginx.sh where PORTS_REPORT_DIR is used and ensure
wait_for_port_reports still reads all reports from the directory.

🧹 Nitpick comments (4)

dockerfiles/sandbox/start-with-nginx.sh (4)
117-147: Workers bind to 0.0.0.0 unconditionally — consider restricting in single-node mode.

Line 126 binds uWSGI workers to 0.0.0.0:${WORKER_PORT}, which is required for multi-node cross-node communication but unnecessarily exposes workers on all interfaces in single-node mode. If the container's network isn't fully isolated, any host on the network can reach individual workers directly (bypassing nginx session affinity).
Suggested conditional bind address
+    if [ "$NODE_COUNT" -gt 1 ]; then
+        BIND_ADDR="0.0.0.0"
+    else
+        BIND_ADDR="127.0.0.1"
+    fi
+
     cat > /tmp/worker${i}_uwsgi.ini << EOF
 [uwsgi]
 module = main
 callable = app
 processes = ${UWSGI_PROCESSES}
-http-socket = 0.0.0.0:${WORKER_PORT}
+http-socket = ${BIND_ADDR}:${WORKER_PORT}
Note: BIND_ADDR would need to be determined before start_worker_fast is called (e.g., set once after node discovery).
144-144: Shell redirections without commands (SC2188).

Lines 144, 560, 577, and 594 use bare > file to truncate files. While this works in bash, it's flagged by Shellcheck (SC2188) and can confuse readers. The idiomatic form is : > file (using the no-op builtin).
Example fix
-    > /var/log/worker${i}.log
+    : > /var/log/worker${i}.log
-    > "$PORTS_FILE"
+    : > "$PORTS_FILE"
-        > $UPSTREAM_FILE
+        : > $UPSTREAM_FILE
Also applies to: 560-560, 577-577, 594-594

247-261: Exported temp dir + xargs {} substitution — minor robustness notes.

export REMOTE_HEALTH_DIR (Line 245) leaks into the global environment. Since verify_remote_workers is called only once this is harmless, but local + explicit export or passing via env to xargs would be cleaner.

In the xargs -I {} sh -c 'endpoint="{}"' pattern (Line 254–261), if an endpoint string ever contains a single quote, it would break the sh -c quoting. This can't happen with valid host:port values, but a defensive alternative is to pipe through sh -c 'endpoint="$1"; ...' _ {} using positional args.

Both are low-risk given the trusted input, mentioning for hardening only.

668-705: Monitoring loop: restarted worker uses same port but nginx upstream is static.

The monitoring loop restarts crashed workers via start_worker which reuses the original port. This works because the restarted worker binds to the same port that nginx already has in its upstream — no config reload needed. This is a correct and practical design choice.

One edge consideration: if a restarted worker fails to bind (port still held by the dying process, or network blocking active), the monitoring loop will re-attempt every 10 seconds indefinitely. With network blocking, the warning is emitted (Lines 687–691). Without network blocking, the retry is silent after the first warning. Consider adding a retry counter or backoff to avoid log flooding for persistently-failing workers.

dockerfiles/sandbox/start-with-nginx.sh

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T00:33:39Z

dockerfiles/sandbox/start-with-nginx.sh

+            break
+        fi
+
+        cat "$endpoints_file" | xargs -P 64 -I {} sh -c '


useless cat - xargs can read from file directly

Suggested change

cat "$endpoints_file" | xargs -P 64 -I {} sh -c '

xargs -P 64 -I {} sh -c '

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Kipok

please rebuild sandbox and run slurm tests. If those pass, we should be good to merge

Address PR review comments: - Remove useless cat pipe in port report check (grep reads file directly) - Scope rm to current node's port file only, preventing a race where a slow-starting node deletes port files already written by faster nodes Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T01:08:02Z

Additional Comments (3)

nemo_skills/inference/generate.py
Missing key validation

STRUCTURED_OUTPUTS[self.cfg.structured_output] will raise a KeyError for any typo/unsupported value passed via ++structured_output=... (including from --extra_judge_args). Since this is user-facing CLI config, it should fail with a clear error listing valid keys rather than a raw KeyError stacktrace.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/test_generation.py
Integration test hits external API

This test invokes ns eval against https://integrate.api.nvidia.com/v1 and real models. In CI (and for most contributors) this will deterministically fail due to missing credentials/network access, and it’s also very slow/flaky. Unit tests here should mock the model client / generation call (or be marked to skip unless explicit env vars are present).

.github/workflows/tests.yml
CI now prepares HLE dataset

ns prepare_data ... hle changes CI to download/prepare an additional dataset. If HLE requires credentials, large downloads, or is otherwise not reliably available in CI, this will break the workflow. Please confirm hle is lightweight/public like the other CI datasets, or gate it behind a condition/fixture used only in dedicated tests.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 651-659: The port display is misleading because
ACTUAL_WORKER_PORTS may be non-contiguous after conflict retries; update the
echo logic in the IS_MASTER/WORKER blocks to either print the full list of
ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS array up to NUM_WORKERS)
or detect non-contiguity and append a "(non-contiguous ports: ...)" note;
specifically change the lines that currently emit
"${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the
similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or
print the range plus a non-contiguous warning) so operators see the real ports
when retries/offsets occurred.
- Around line 685-694: When a worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid
immediately restarting it to prevent an infinite futile restart loop; modify the
monitor logic around the kill-check and restart (the block that calls
start_worker and updates WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip
restarting when NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry
cap (e.g., track restart counts in an array like WORKER_RETRIES and stop
restarting after N attempts) and log a single clear message when restarts are
suppressed; ensure you reference NETWORK_BLOCKING_ACTIVE, start_worker,
WORKER_PIDS, ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when
making the change so the monitor stops spawning doomed processes.
- Line 144: The script uses a bare redirection of the form >
/var/log/worker${i}.log which is not POSIX-portable; replace each bare
redirection of the worker log (the occurrences that use >
/var/log/worker${i}.log) with an explicit no-op redirection using the shell
builtin (i.e., use a ":" no-op followed by the redirection) so the intent is
explicit and shellcheck SC2188 is satisfied; apply the same change for the other
matching occurrences with the worker log filenames.

🧹 Nitpick comments (2)

dockerfiles/sandbox/start-with-nginx.sh (2)
254-261: xargs -I {} sh -c '…{}' is a shell-injection anti-pattern.

Although endpoints are internally generated, interpolating {} directly inside sh -c breaks if the value ever contains quotes or shell metacharacters. Pass the value as a positional parameter instead.
Proposed fix
-        cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
-            endpoint="{}"
-            status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")"
+        cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
+            endpoint="$1"
+            status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")"
             [ -f "$status_file" ] && exit 0
             if curl -s -f --connect-timeout 2 --max-time 5 "http://${endpoint}/health" > /dev/null 2>&1; then
                 touch "$status_file"
             fi
-        '
+        ' _ {}
668-679: Background stats subshell PID is not tracked by cleanup().

The monitoring subshell spawned here won't be killed by the cleanup trap. In a container this is benign (container stop kills all), but for completeness consider capturing the PID and adding it to the kill list.

coderabbitai · 2026-02-07T01:12:30Z

dockerfiles/sandbox/start-with-nginx.sh

+        echo "cheaper = ${UWSGI_CHEAPER}" >> /tmp/worker${i}_uwsgi.ini
+    fi
+
+    > /var/log/worker${i}.log


⚠️ Potential issue | 🟡 Minor

Bare redirection without a command (SC2188).

> file works in bash but is not POSIX-portable and triggers shellcheck SC2188. Use : > to make intent explicit.

Proposed fix

- > /var/log/worker${i}.log + : > /var/log/worker${i}.log

Same pattern on lines 560, 577, and 594.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

> /var/log/worker${i}.log

: > /var/log/worker${i}.log

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 144-144: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

(SC2188)

🤖 Prompt for AI Agents

In `@dockerfiles/sandbox/start-with-nginx.sh` at line 144, The script uses a bare redirection of the form > /var/log/worker${i}.log which is not POSIX-portable; replace each bare redirection of the worker log (the occurrences that use > /var/log/worker${i}.log) with an explicit no-op redirection using the shell builtin (i.e., use a ":" no-op followed by the redirection) so the intent is explicit and shellcheck SC2188 is satisfied; apply the same change for the other matching occurrences with the worker log filenames.

coderabbitai · 2026-02-07T01:12:31Z

dockerfiles/sandbox/start-with-nginx.sh

+if [ "$IS_MASTER" = "1" ]; then
+    echo "=== Sandbox ready (MASTER) ==="
+    echo "  Nginx LB: http://localhost:$NGINX_PORT"
+    echo "  Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
+    echo "  Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"
 else
-    echo "UWSGI cheaper mode: disabled"
+    echo "=== Sandbox ready (WORKER) ==="
+    echo "  Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT"
+    echo "  Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"


⚠️ Potential issue | 🟡 Minor

Port range display is misleading after conflict retries.

After the retry algorithm applies offsets (e.g., +200), ports are no longer contiguous. Displaying ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]} implies a contiguous range, which could confuse operators. Consider listing all ports or noting "non-contiguous" when retries occurred.

🤖 Prompt for AI Agents

In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 651 - 659, The port display is misleading because ACTUAL_WORKER_PORTS may be non-contiguous after conflict retries; update the echo logic in the IS_MASTER/WORKER blocks to either print the full list of ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS array up to NUM_WORKERS) or detect non-contiguity and append a "(non-contiguous ports: ...)" note; specifically change the lines that currently emit "${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or print the range plus a non-contiguous warning) so operators see the real ports when retries/offsets occurred.

coderabbitai · 2026-02-07T01:12:31Z

dockerfiles/sandbox/start-with-nginx.sh

        if ! kill -0 "$pid" 2>/dev/null; then
-            echo "WARNING: Worker $i (PID $pid) died - restarting..."
-            new_pid=$(start_worker $i)
-            WORKER_PIDS[$idx]=$new_pid
+            echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
+            if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
+                echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
+                echo "[$_H]   worker may fail to bind its port because socket() is blocked for"
+                echo "[$_H]   new processes. Remaining workers continue serving requests."
+            fi
+            result=$(start_worker $i)
+            WORKER_PIDS[$idx]="${result%%:*}"
+            ACTUAL_WORKER_PORTS[$idx]="${result##*:}"


⚠️ Potential issue | 🟠 Major

Futile restart loop when network blocking is active.

When NETWORK_BLOCKING_ACTIVE=1, restarted workers cannot bind their socket (as documented in lines 627-629 and the warning itself). Yet the restart still proceeds at line 692, the new process dies, and the next monitoring iteration restarts it again — an infinite loop of doomed spawns generating log noise every 10 seconds.

Consider skipping the restart (or capping per-worker retries) when network blocking is active:

Proposed fix

if ! kill -0 "$pid" 2>/dev/null; then echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..." if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted" echo "[$_H] worker may fail to bind its port because socket() is blocked for" - echo "[$_H] new processes. Remaining workers continue serving requests." + echo "[$_H] new processes. Skipping restart — remaining workers continue serving." + continue fi result=$(start_worker $i) WORKER_PIDS[$idx]="${result%%:*}" ACTUAL_WORKER_PORTS[$idx]="${result##*:}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if ! kill -0 "$pid" 2>/dev/null; then

echo "WARNING: Worker $i (PID $pid) died - restarting..."

new_pid=$(start_worker $i)

WORKER_PIDS[$idx]=$new_pid

echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."

if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then

echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"

echo "[$_H] worker may fail to bind its port because socket() is blocked for"

echo "[$_H] new processes. Remaining workers continue serving requests."

fi

result=$(start_worker $i)

WORKER_PIDS[$idx]="${result%%:*}"

ACTUAL_WORKER_PORTS[$idx]="${result##*:}"

if ! kill -0 "$pid" 2>/dev/null; then

echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."

if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then

echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"

echo "[$_H] worker may fail to bind its port because socket() is blocked for"

echo "[$_H] new processes. Skipping restart — remaining workers continue serving."

continue

fi

result=$(start_worker $i)

WORKER_PIDS[$idx]="${result%%:*}"

ACTUAL_WORKER_PORTS[$idx]="${result##*:}"

🤖 Prompt for AI Agents

In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 685 - 694, When a worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid immediately restarting it to prevent an infinite futile restart loop; modify the monitor logic around the kill-check and restart (the block that calls start_worker and updates WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip restarting when NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry cap (e.g., track restart counts in an array like WORKER_RETRIES and stop restarting after N attempts) and log a single clear message when restarts are suppressed; ensure you reference NETWORK_BLOCKING_ACTIVE, start_worker, WORKER_PIDS, ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when making the change so the monitor stops spawning doomed processes.

Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

_{1 file reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T05:11:54Z

dockerfiles/sandbox/start-with-nginx.sh

+    PORTS_REPORT_DIR="/tmp/sandbox_ports_$$"
+fi
+mkdir -p "$PORTS_REPORT_DIR"
+rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true


Shared port file clobber

rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" runs on every node at startup. In a multi-node SLURM job, nodes can start at different times; a late-starting node can delete an already-written port report from an early node, causing the master to block until timeout in wait_for_port_reports(). Remove this unconditional delete or scope it so only the current node can clear its own file in a race-safe way (e.g., write to a temp file and mv atomically).

greptile-apps · 2026-02-07T05:11:55Z

dockerfiles/sandbox/start-with-nginx.sh

+    echo "=== Sandbox ready (MASTER) ==="
+    echo "  Nginx LB: http://localhost:$NGINX_PORT"
+    echo "  Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
+    echo "  Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"


NUM_WORKERS=0 breaks

This status line indexes ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]. If NUM_WORKERS is 0 (explicitly or via nproc --all), this becomes [-1] and errors in bash. Guard earlier (validate NUM_WORKERS > 0 and exit with a clear error) before any array indexing/loops that assume at least 1 worker.

greptile-apps · 2026-02-07T05:11:56Z

dockerfiles/sandbox/start-with-nginx.sh

+if [ "$ALL_NODES" = "127.0.0.1" ] || [ "$CURRENT_NODE_SHORT" = "$MASTER_NODE_SHORT" ]; then
+    IS_MASTER=1
+    echo "[$_H] Role: MASTER | Nodes: $NODE_COUNT | Master: $MASTER_NODE"
+else
+    IS_MASTER=0
+    echo "[$_H] Role: WORKER | Master: $MASTER_NODE"


Master detection mismatch

In SLURM mode, MASTER_NODE is taken from the expanded nodelist (e.g. node001), but the current node is detected via hostname and compared as CURRENT_NODE_SHORT == MASTER_NODE_SHORT. On many clusters hostname returns a different alias/FQDN than scontrol/nodelist uses (e.g. node001.cluster vs node001 or vice versa), which can cause multiple nodes to think they're master (or none). Prefer using SLURM-provided identity (e.g. SLURMD_NODENAME or scontrol show hostname $SLURM_NODEID output) for the local node name, and compare using the same naming scheme used to build ALL_NODES.

greptile-apps · 2026-02-07T05:11:57Z

dockerfiles/sandbox/start-with-nginx.sh

+            for endpoint in $(read_port_file "$node" "$port_file"); do
+                echo "        server ${endpoint} max_fails=3 fail_timeout=30s;" >> $UPSTREAM_FILE
+                echo "$endpoint" >> "$ENDPOINTS_FILE"
+            done


Port file parsing unsafe

for endpoint in $(read_port_file ...) word-splits on whitespace. If a port file is empty/partial, read_port_file can emit blank lines and this loop will silently skip/merge tokens. Also, any unexpected whitespace will corrupt endpoints. Use a while IFS= read -r endpoint; do ...; done < <(read_port_file ...) pattern to preserve lines exactly and handle empty output deterministically.

gwarmstrong · 2026-02-07T05:40:13Z

@Kipok the gpt-oss test and super_49b tests pass (minus a couple unrelated RULER fluctuations)

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

gwarmstrong added 6 commits February 6, 2026 12:16

docs: add environment variable reference to start-with-nginx.sh

ada4f67

Document all required and optional environment variables grouped by category: worker configuration, multi-node/SLURM, and security. Signed-off-by: George Armstrong <georgea@nvidia.com>

fix: revert cosmetic changes to uWSGI validation and cleanup function

bc8b02b

Minimize diff by matching origin/main's exact wording for uWSGI validation warnings and using the original if/then/fi form in the cleanup function. Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

dockerfiles/sandbox/start-with-nginx.sh Show resolved Hide resolved

gwarmstrong requested a review from Kipok February 7, 2026 00:09

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

Kipok approved these changes Feb 7, 2026

View reviewed changes

gwarmstrong and others added 3 commits February 6, 2026 17:05

A small update on running tests docs (#1219)

c00a644

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Kipok added the run GPU tests label Feb 7, 2026

gwarmstrong force-pushed the georgea/multinode-sandbox-pr branch from 4f89d40 to 5cdca71 Compare February 7, 2026 01:05

Merge branch 'main' into georgea/multinode-sandbox-pr

9156609

gwarmstrong added run GPU tests and removed run GPU tests labels Feb 7, 2026

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

fix: make sandbox span all group nodes by default

44863ff

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong force-pushed the georgea/multinode-sandbox-pr branch from 415e024 to 44863ff Compare February 7, 2026 05:10

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

gwarmstrong merged commit b19ba96 into main Feb 7, 2026
5 checks passed

gwarmstrong deleted the georgea/multinode-sandbox-pr branch February 7, 2026 05:40

coderabbitai bot mentioned this pull request Feb 11, 2026

feat: migrate sandbox from uwsgi to gunicorn #1232

Closed

coderabbitai bot mentioned this pull request Feb 11, 2026

feat: migrate sandbox from uwsgi to gunicorn #1234

Open

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Add multi-node sandbox support for SLURM clusters (#1218)

f14126d

Signed-off-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Add multi-node sandbox support for SLURM clusters (#1218)

c15d304

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

	if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null \| grep -q "PORT_REPORT_COMPLETE"; then
	if [ -f "$port_file" ] && grep -q "PORT_REPORT_COMPLETE" "$port_file" 2>/dev/null; then

	rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null \|\| true
	rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null \|\| true

	echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"
	echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]:-none}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS>0?NUM_WORKERS-1:0))]:-none})"

	cat "$endpoints_file" \| xargs -P 64 -I {} sh -c '
	xargs -P 64 -I {} sh -c '

Conversation

gwarmstrong commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

gwarmstrong commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading