Skip to content

Add multi-node sandbox support for SLURM clusters#1218

Merged
gwarmstrong merged 11 commits intomainfrom
georgea/multinode-sandbox-pr
Feb 7, 2026
Merged

Add multi-node sandbox support for SLURM clusters#1218
gwarmstrong merged 11 commits intomainfrom
georgea/multinode-sandbox-pr

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Feb 6, 2026

Summary

  • Adds multi-node sandbox support, enabling code execution workers to be distributed across multiple SLURM nodes with nginx load balancing and session affinity
  • Auto-detects SLURM multi-node environments (via SLURM_JOB_NODELIST), expands compressed nodelists, and coordinates port assignments across nodes via shared filesystem
  • Backward-compatible: single-node mode works identically when SLURM variables are absent

Changes

dockerfiles/sandbox/start-with-nginx.sh (major rewrite)

  • SLURM nodelist expansion (supports node[001-016], gpu[01-02],cpu[01-03], etc.)
  • TCP socket workers (replacing unix sockets) with configurable base port
  • Cross-node port coordination via shared filesystem (/nemo_run/ or /workspace/)
  • Master/worker node roles: master runs nginx LB with cross-node upstream, workers proxy to master
  • Port conflict retry algorithm for TCP port binding failures
  • Parallel health checks for both local and remote workers
  • Network blocking support for all modes with transparent diagnostics
  • Comprehensive environment variable documentation header
  • SANDBOX_FORCE_SINGLE_NODE override for debugging

dockerfiles/sandbox/nginx-worker-proxy.conf.template (new file)

  • Extracted nginx config template for worker nodes that proxy to master's LB

dockerfiles/Dockerfile.sandbox (minor)

  • Move COPY start-with-nginx.sh after dependency layers for better Docker cache
  • Use exec-form CMD ["/start-with-nginx.sh"]
  • Add COPY for new nginx-worker-proxy.conf.template

dockerfiles/sandbox/nginx.conf.template (minor)

  • Updated comments to document both single-node and multi-node upstream modes

Test plan

  • Validated on DFW 16-node run (128 workers/node): 9594 successful requests, 0 5xx errors
  • Validated on DFW 2-node run
  • Single-node backward compatibility verified (no SLURM vars → same behavior as before)
  • CI tests pass

Summary by CodeRabbit

  • New Features

    • Multi-node sandboxing with coordinated worker discovery, session-aware routing, health/readiness checks, and session-affinity for stable routing.
    • TCP-based worker networking with proxying and an nginx status endpoint for monitoring.
  • Chores

    • Optimized container startup ordering and command form for improved caching and startup behavior.
    • Added new proxy configuration template and introduced a configuration flag to span group nodes.

Enable the sandbox (code execution environment) to scale across multiple
SLURM nodes for large-scale RL training jobs.

Key changes:
- Auto-detect SLURM multi-node environments and expand nodelists
- Allocate unique TCP ports per worker with parallel startup and
  automatic port conflict retry
- Coordinate port reporting between nodes via shared filesystem
- Configure nginx upstream to load-balance across all nodes' workers
- Worker nodes run local nginx proxy forwarding to master's LB
- Parallel health checks for faster startup with many workers
- Backward-compatible: single-node mode auto-detected when SLURM
  vars are absent

Validated on DFW with 16-node (128 workers/node) runs: 9594 successful
requests, 0 errors.

Signed-off-by: George Armstrong <georgea@nvidia.com>
- Pass SLURM nodelist via sys.argv instead of shell interpolation
  into Python triple-quoted string (prevents injection)
- Fix trap overwrite: fold temp dir cleanup into cleanup() instead
  of a separate EXIT trap that overwrote SIGTERM/SIGINT handler
- Remove unused is_port_free() and find_free_port() dead code
- Move network blocking (ld.so.preload) outside master-only branch
  so it applies on all nodes (worker nodes also run user code)
- Clean stale port files on startup to handle SANDBOX_PORTS_DIR reuse

Signed-off-by: George Armstrong <georgea@nvidia.com>
Document all required and optional environment variables grouped by
category: worker configuration, multi-node/SLURM, and security.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Address PR review comments:

1. Remove aggressive Lustre cache invalidation (touch/rm/ls/sync dance).
   The cat-based file read already forces Lustre to fetch content; the
   extra invalidation was unnecessary overhead.

2. Extract utility functions for readability:
   - generate_nginx_config() — template substitution + nginx -t
   - read_port_file() — parse port files, emit node:port lines
   - wait_for_port_reports() — poll shared storage for all nodes
   - verify_remote_workers() — parallel health checks via xargs
   This makes the nginx setup section a clear linear flow:
     wait_for_port_reports → build upstream → generate_nginx_config →
     verify_remote_workers

3. Add $(hostname) to load monitor stats output.

4. Skip network blocking in multi-node mode. ld.so.preload intercepts
   socket() in all new exec'd processes — if the monitoring loop restarts
   a crashed worker, the new uWSGI process would be unable to bind its
   listening socket. Document this limitation.

5. Add SANDBOX_FORCE_SINGLE_NODE env var to override multi-node detection.
   Useful for debugging or when multi-node sandbox causes issues.

Also: trim verbose debug logging, reduce file from ~1000 to ~710 lines.
Signed-off-by: George Armstrong <georgea@nvidia.com>
1. Fix misleading log output: only show SLURM vars when they're set,
   and emit a clear diagnostic when SLURM_JOB_NODELIST expansion fails
   instead of a silent fallback.

2. Restore uWSGI cheaper validation warnings that were lost in the
   restructuring. Invalid values are still auto-corrected but now log
   what happened.

3. Only write port files to shared storage in multi-node mode — the
   port coordination protocol is unnecessary overhead in single-node.

4. Extract worker proxy nginx config to a separate template file
   (nginx-worker-proxy.conf.template) instead of an inline heredoc.

5. Enable network blocking on all modes (not just single-node). Add
   NETWORK_BLOCKING_ACTIVE flag so the monitoring loop emits a clear
   diagnostic when a worker restart fails due to ld.so.preload blocking
   socket() in new processes.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Minimize diff by matching origin/main's exact wording for uWSGI
validation warnings and using the original if/then/fi form in the
cleanup function.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

for node in $ALL_NODES; do
local node_short="${node%%.*}"
local port_file="$PORTS_REPORT_DIR/${node_short}_ports.txt"
if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless cat with grep -q

Suggested change
if [ -f "$port_file" ] && cat "$port_file" 2>/dev/null | grep -q "PORT_REPORT_COMPLETE"; then
if [ -f "$port_file" ] && grep -q "PORT_REPORT_COMPLETE" "$port_file" 2>/dev/null; then

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

PORTS_REPORT_DIR="/tmp/sandbox_ports_$$"
fi
mkdir -p "$PORTS_REPORT_DIR"
rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

race condition: all nodes execute rm -f on shared filesystem simultaneously

If multiple nodes start at nearly the same time, they can race to delete each other's port files. A node that completes writing its ports could have its file deleted by another node still executing this line.

Suggested change
rm -f "$PORTS_REPORT_DIR"/*_ports.txt 2>/dev/null || true
rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true

echo "=== Sandbox ready (MASTER) ==="
echo " Nginx LB: http://localhost:$NGINX_PORT"
echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential array index out of bounds if NUM_WORKERS is 0

If NUM_WORKERS=0 (either explicitly set or through nproc --all returning 0), this line would access ACTUAL_WORKER_PORTS[-1] which is invalid in bash.

Check NUM_WORKERS > 0 before array access or earlier in the script.

echo "UWSGI cheaper mode: disabled"
echo "=== Sandbox ready (WORKER) ==="
echo " Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT"
echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same array bounds issue as line 655

Suggested change
echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"
echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]:-none}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS>0?NUM_WORKERS-1:0))]:-none})"

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

Reorganizes the sandbox Dockerfile and adds templates, and substantially rewrites the startup script to support TCP-based uWSGI workers for single-node and SLURM-detected multi-node deployments with dynamic Nginx config generation, cross-node port coordination, remote health checks, and master/worker orchestration.

Changes

Cohort / File(s) Summary
Dockerfile
dockerfiles/Dockerfile.sandbox
Reordered COPY operations for cache optimization, added copy of nginx-worker-proxy.conf.template, and changed CMD from shell form to exec form (["/start-with-nginx.sh"]).
Nginx templates
dockerfiles/sandbox/nginx.conf.template, dockerfiles/sandbox/nginx-worker-proxy.conf.template
Expanded upstream documentation and session-affinity notes in nginx.conf.template; added new nginx-worker-proxy.conf.template that proxies to a master load balancer with consistent-hash affinity, proxy headers, extended timeouts, disabled buffering, logging, and /nginx-status stub_status.
Startup script
dockerfiles/sandbox/start-with-nginx.sh
Full rewrite to support multi-node deployments: SLURM nodelist expansion, master/worker role detection, per-worker TCP ports with retry/offset, shared port reporting directory, wait/verify aggregation of remote ports, dynamic nginx config generation & validation, parallel remote health checks, worker spawn/restart loops, optional network-blocking, and proxy behavior for non-master workers.
Python config
nemo_skills/pipeline/utils/scripts.py
Added new public boolean field SandboxScript.span_group_nodes: bool = True to indicate sandbox spans group nodes (config flag; no behavior changes within file).

Sequence Diagram

sequenceDiagram
    participant SLURM as SLURM Cluster
    participant Master as Master Node<br/>(start-with-nginx.sh)
    participant Worker as Worker Node(s)<br/>(start-with-nginx.sh)
    participant Nginx as Nginx LB
    participant uWSGI as uWSGI Workers<br/>(TCP)
    participant PortDir as Shared Port<br/>Coordination Dir

    Master->>SLURM: Query node list (SLURM_NODELIST)
    SLURM-->>Master: Node hostnames & count
    Master->>Master: Determine master/worker role and init ports

    par Multi-Node Startup
        Master->>uWSGI: Start local uWSGI workers on BASE_PORT + offsets
        uWSGI-->>PortDir: Write per-node port report
        Worker->>uWSGI: Start local uWSGI workers on assigned ports
        Worker-->>PortDir: Write per-node port report
        Master->>Nginx: Generate & validate nginx.conf (template + collected ports)
        Master->>Nginx: Start/Reload Nginx with TCP upstreams
        Worker->>Nginx: Start local proxy to master (if non-master)
    end

    Master->>PortDir: wait_for_port_reports() and aggregate reports
    PortDir-->>Master: Collected ports from all nodes
    Master->>Master: Build upstreams with all worker endpoints
    Master->>Worker: verify_remote_workers() (parallel health checks)
    Worker-->>Master: Health responses

    Master->>Master: Monitor loop: health checks, restarts, nginx status
    Master->>Nginx: Monitor nginx and reload on config changes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add multi-node sandbox support for SLURM clusters' directly and clearly summarizes the main change: introducing multi-node sandbox capabilities for SLURM environments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/multinode-sandbox-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 335-337: The startup cleanup removes every node's port report
files (rm -f "$PORTS_REPORT_DIR"/*_ports.txt) causing a race where slow nodes
delete other nodes' reports; change the removal to only the current node's
report (e.g., rm -f "$PORTS_REPORT_DIR/${NODE_NAME}_ports.txt" or use hostname:
"${PORTS_REPORT_DIR}/$(hostname)_ports.txt") so other nodes' files remain;
update start-with-nginx.sh where PORTS_REPORT_DIR is used and ensure
wait_for_port_reports still reads all reports from the directory.
🧹 Nitpick comments (4)
dockerfiles/sandbox/start-with-nginx.sh (4)

117-147: Workers bind to 0.0.0.0 unconditionally — consider restricting in single-node mode.

Line 126 binds uWSGI workers to 0.0.0.0:${WORKER_PORT}, which is required for multi-node cross-node communication but unnecessarily exposes workers on all interfaces in single-node mode. If the container's network isn't fully isolated, any host on the network can reach individual workers directly (bypassing nginx session affinity).

Suggested conditional bind address
+    if [ "$NODE_COUNT" -gt 1 ]; then
+        BIND_ADDR="0.0.0.0"
+    else
+        BIND_ADDR="127.0.0.1"
+    fi
+
     cat > /tmp/worker${i}_uwsgi.ini << EOF
 [uwsgi]
 module = main
 callable = app
 processes = ${UWSGI_PROCESSES}
-http-socket = 0.0.0.0:${WORKER_PORT}
+http-socket = ${BIND_ADDR}:${WORKER_PORT}

Note: BIND_ADDR would need to be determined before start_worker_fast is called (e.g., set once after node discovery).


144-144: Shell redirections without commands (SC2188).

Lines 144, 560, 577, and 594 use bare > file to truncate files. While this works in bash, it's flagged by Shellcheck (SC2188) and can confuse readers. The idiomatic form is : > file (using the no-op builtin).

Example fix
-    > /var/log/worker${i}.log
+    : > /var/log/worker${i}.log
-    > "$PORTS_FILE"
+    : > "$PORTS_FILE"
-        > $UPSTREAM_FILE
+        : > $UPSTREAM_FILE

Also applies to: 560-560, 577-577, 594-594


247-261: Exported temp dir + xargs {} substitution — minor robustness notes.

  1. export REMOTE_HEALTH_DIR (Line 245) leaks into the global environment. Since verify_remote_workers is called only once this is harmless, but local + explicit export or passing via env to xargs would be cleaner.
  2. In the xargs -I {} sh -c 'endpoint="{}"' pattern (Line 254–261), if an endpoint string ever contains a single quote, it would break the sh -c quoting. This can't happen with valid host:port values, but a defensive alternative is to pipe through sh -c 'endpoint="$1"; ...' _ {} using positional args.

Both are low-risk given the trusted input, mentioning for hardening only.


668-705: Monitoring loop: restarted worker uses same port but nginx upstream is static.

The monitoring loop restarts crashed workers via start_worker which reuses the original port. This works because the restarted worker binds to the same port that nginx already has in its upstream — no config reload needed. This is a correct and practical design choice.

One edge consideration: if a restarted worker fails to bind (port still held by the dying process, or network blocking active), the monitoring loop will re-attempt every 10 seconds indefinitely. With network blocking, the warning is emitted (Lines 687–691). Without network blocking, the retry is silent after the first warning. Consider adding a retry counter or backoff to avoid log flooding for persistently-failing workers.

@gwarmstrong gwarmstrong requested a review from Kipok February 7, 2026 00:09
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

break
fi

cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless cat - xargs can read from file directly

Suggested change
cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
xargs -P 64 -I {} sh -c '

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebuild sandbox and run slurm tests. If those pass, we should be good to merge

gwarmstrong and others added 3 commits February 6, 2026 17:05
Address PR review comments:
- Remove useless cat pipe in port report check (grep reads file directly)
- Scope rm to current node's port file only, preventing a race where
  a slow-starting node deletes port files already written by faster nodes

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
…#1186)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the georgea/multinode-sandbox-pr branch from 4f89d40 to 5cdca71 Compare February 7, 2026 01:05
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Additional Comments (3)

nemo_skills/inference/generate.py
Missing key validation

STRUCTURED_OUTPUTS[self.cfg.structured_output] will raise a KeyError for any typo/unsupported value passed via ++structured_output=... (including from --extra_judge_args). Since this is user-facing CLI config, it should fail with a clear error listing valid keys rather than a raw KeyError stacktrace.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/test_generation.py
Integration test hits external API

This test invokes ns eval against https://integrate.api.nvidia.com/v1 and real models. In CI (and for most contributors) this will deterministically fail due to missing credentials/network access, and it’s also very slow/flaky. Unit tests here should mock the model client / generation call (or be marked to skip unless explicit env vars are present).


.github/workflows/tests.yml
CI now prepares HLE dataset

ns prepare_data ... hle changes CI to download/prepare an additional dataset. If HLE requires credentials, large downloads, or is otherwise not reliably available in CI, this will break the workflow. Please confirm hle is lightweight/public like the other CI datasets, or gate it behind a condition/fixture used only in dedicated tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 651-659: The port display is misleading because
ACTUAL_WORKER_PORTS may be non-contiguous after conflict retries; update the
echo logic in the IS_MASTER/WORKER blocks to either print the full list of
ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS array up to NUM_WORKERS)
or detect non-contiguity and append a "(non-contiguous ports: ...)" note;
specifically change the lines that currently emit
"${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the
similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or
print the range plus a non-contiguous warning) so operators see the real ports
when retries/offsets occurred.
- Around line 685-694: When a worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid
immediately restarting it to prevent an infinite futile restart loop; modify the
monitor logic around the kill-check and restart (the block that calls
start_worker and updates WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip
restarting when NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry
cap (e.g., track restart counts in an array like WORKER_RETRIES and stop
restarting after N attempts) and log a single clear message when restarts are
suppressed; ensure you reference NETWORK_BLOCKING_ACTIVE, start_worker,
WORKER_PIDS, ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when
making the change so the monitor stops spawning doomed processes.
- Line 144: The script uses a bare redirection of the form >
/var/log/worker${i}.log which is not POSIX-portable; replace each bare
redirection of the worker log (the occurrences that use >
/var/log/worker${i}.log) with an explicit no-op redirection using the shell
builtin (i.e., use a ":" no-op followed by the redirection) so the intent is
explicit and shellcheck SC2188 is satisfied; apply the same change for the other
matching occurrences with the worker log filenames.
🧹 Nitpick comments (2)
dockerfiles/sandbox/start-with-nginx.sh (2)

254-261: xargs -I {} sh -c '…{}' is a shell-injection anti-pattern.

Although endpoints are internally generated, interpolating {} directly inside sh -c breaks if the value ever contains quotes or shell metacharacters. Pass the value as a positional parameter instead.

Proposed fix
-        cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
-            endpoint="{}"
-            status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")"
+        cat "$endpoints_file" | xargs -P 64 -I {} sh -c '
+            endpoint="$1"
+            status_file="$REMOTE_HEALTH_DIR/$(echo "$endpoint" | tr ":" "_")"
             [ -f "$status_file" ] && exit 0
             if curl -s -f --connect-timeout 2 --max-time 5 "http://${endpoint}/health" > /dev/null 2>&1; then
                 touch "$status_file"
             fi
-        '
+        ' _ {}

668-679: Background stats subshell PID is not tracked by cleanup().

The monitoring subshell spawned here won't be killed by the cleanup trap. In a container this is benign (container stop kills all), but for completeness consider capturing the PID and adding it to the kill list.

echo "cheaper = ${UWSGI_CHEAPER}" >> /tmp/worker${i}_uwsgi.ini
fi

> /var/log/worker${i}.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bare redirection without a command (SC2188).

> file works in bash but is not POSIX-portable and triggers shellcheck SC2188. Use : > to make intent explicit.

Proposed fix
-    > /var/log/worker${i}.log
+    : > /var/log/worker${i}.log

Same pattern on lines 560, 577, and 594.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> /var/log/worker${i}.log
: > /var/log/worker${i}.log
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 144-144: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

(SC2188)

🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` at line 144, The script uses a bare
redirection of the form > /var/log/worker${i}.log which is not POSIX-portable;
replace each bare redirection of the worker log (the occurrences that use >
/var/log/worker${i}.log) with an explicit no-op redirection using the shell
builtin (i.e., use a ":" no-op followed by the redirection) so the intent is
explicit and shellcheck SC2188 is satisfied; apply the same change for the other
matching occurrences with the worker log filenames.

Comment on lines +651 to +659
if [ "$IS_MASTER" = "1" ]; then
echo "=== Sandbox ready (MASTER) ==="
echo " Nginx LB: http://localhost:$NGINX_PORT"
echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"
else
echo "UWSGI cheaper mode: disabled"
echo "=== Sandbox ready (WORKER) ==="
echo " Proxy: localhost:$NGINX_PORT -> $MASTER_NODE:$NGINX_PORT"
echo " Local workers: $NUM_WORKERS (ports ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Port range display is misleading after conflict retries.

After the retry algorithm applies offsets (e.g., +200), ports are no longer contiguous. Displaying ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]} implies a contiguous range, which could confuse operators. Consider listing all ports or noting "non-contiguous" when retries occurred.

🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 651 - 659, The port
display is misleading because ACTUAL_WORKER_PORTS may be non-contiguous after
conflict retries; update the echo logic in the IS_MASTER/WORKER blocks to either
print the full list of ACTUAL_WORKER_PORTS (iterating the ACTUAL_WORKER_PORTS
array up to NUM_WORKERS) or detect non-contiguity and append a "(non-contiguous
ports: ...)" note; specifically change the lines that currently emit
"${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}" and the
similar WORKER echo to instead join and print each ACTUAL_WORKER_PORTS[i] (or
print the range plus a non-contiguous warning) so operators see the real ports
when retries/offsets occurred.

Comment on lines 685 to +694
if ! kill -0 "$pid" 2>/dev/null; then
echo "WARNING: Worker $i (PID $pid) died - restarting..."
new_pid=$(start_worker $i)
WORKER_PIDS[$idx]=$new_pid
echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
echo "[$_H] worker may fail to bind its port because socket() is blocked for"
echo "[$_H] new processes. Remaining workers continue serving requests."
fi
result=$(start_worker $i)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]="${result##*:}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Futile restart loop when network blocking is active.

When NETWORK_BLOCKING_ACTIVE=1, restarted workers cannot bind their socket (as documented in lines 627-629 and the warning itself). Yet the restart still proceeds at line 692, the new process dies, and the next monitoring iteration restarts it again — an infinite loop of doomed spawns generating log noise every 10 seconds.

Consider skipping the restart (or capping per-worker retries) when network blocking is active:

Proposed fix
         if ! kill -0 "$pid" 2>/dev/null; then
             echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
             if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
                 echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
                 echo "[$_H]   worker may fail to bind its port because socket() is blocked for"
-                echo "[$_H]   new processes. Remaining workers continue serving requests."
+                echo "[$_H]   new processes. Skipping restart — remaining workers continue serving."
+                continue
             fi
             result=$(start_worker $i)
             WORKER_PIDS[$idx]="${result%%:*}"
             ACTUAL_WORKER_PORTS[$idx]="${result##*:}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! kill -0 "$pid" 2>/dev/null; then
echo "WARNING: Worker $i (PID $pid) died - restarting..."
new_pid=$(start_worker $i)
WORKER_PIDS[$idx]=$new_pid
echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
echo "[$_H] worker may fail to bind its port because socket() is blocked for"
echo "[$_H] new processes. Remaining workers continue serving requests."
fi
result=$(start_worker $i)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]="${result##*:}"
if ! kill -0 "$pid" 2>/dev/null; then
echo "[$_H] WARNING: Worker $i (PID $pid) died — restarting..."
if [ "$NETWORK_BLOCKING_ACTIVE" = "1" ]; then
echo "[$_H] WARNING: Network blocking (ld.so.preload) is active. The restarted"
echo "[$_H] worker may fail to bind its port because socket() is blocked for"
echo "[$_H] new processes. Skipping restart — remaining workers continue serving."
continue
fi
result=$(start_worker $i)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]="${result##*:}"
🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 685 - 694, When a
worker dies and NETWORK_BLOCKING_ACTIVE=1, avoid immediately restarting it to
prevent an infinite futile restart loop; modify the monitor logic around the
kill-check and restart (the block that calls start_worker and updates
WORKER_PIDS and ACTUAL_WORKER_PORTS) to either skip restarting when
NETWORK_BLOCKING_ACTIVE is set or implement a per-worker retry cap (e.g., track
restart counts in an array like WORKER_RETRIES and stop restarting after N
attempts) and log a single clear message when restarts are suppressed; ensure
you reference NETWORK_BLOCKING_ACTIVE, start_worker, WORKER_PIDS,
ACTUAL_WORKER_PORTS (and add WORKER_RETRIES if using the cap) when making the
change so the monitor stops spawning doomed processes.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the georgea/multinode-sandbox-pr branch from 415e024 to 44863ff Compare February 7, 2026 05:10
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

PORTS_REPORT_DIR="/tmp/sandbox_ports_$$"
fi
mkdir -p "$PORTS_REPORT_DIR"
rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared port file clobber

rm -f "$PORTS_REPORT_DIR/${CURRENT_NODE_SHORT}_ports.txt" runs on every node at startup. In a multi-node SLURM job, nodes can start at different times; a late-starting node can delete an already-written port report from an early node, causing the master to block until timeout in wait_for_port_reports(). Remove this unconditional delete or scope it so only the current node can clear its own file in a race-safe way (e.g., write to a temp file and mv atomically).

echo "=== Sandbox ready (MASTER) ==="
echo " Nginx LB: http://localhost:$NGINX_PORT"
echo " Nodes: $NODE_COUNT | Workers/node: $NUM_WORKERS | Total: $((NODE_COUNT * NUM_WORKERS))"
echo " Local ports: ${ACTUAL_WORKER_PORTS[0]}-${ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_WORKERS=0 breaks

This status line indexes ACTUAL_WORKER_PORTS[$((NUM_WORKERS-1))]. If NUM_WORKERS is 0 (explicitly or via nproc --all), this becomes [-1] and errors in bash. Guard earlier (validate NUM_WORKERS > 0 and exit with a clear error) before any array indexing/loops that assume at least 1 worker.

Comment on lines +308 to +313
if [ "$ALL_NODES" = "127.0.0.1" ] || [ "$CURRENT_NODE_SHORT" = "$MASTER_NODE_SHORT" ]; then
IS_MASTER=1
echo "[$_H] Role: MASTER | Nodes: $NODE_COUNT | Master: $MASTER_NODE"
else
IS_MASTER=0
echo "[$_H] Role: WORKER | Master: $MASTER_NODE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master detection mismatch

In SLURM mode, MASTER_NODE is taken from the expanded nodelist (e.g. node001), but the current node is detected via hostname and compared as CURRENT_NODE_SHORT == MASTER_NODE_SHORT. On many clusters hostname returns a different alias/FQDN than scontrol/nodelist uses (e.g. node001.cluster vs node001 or vice versa), which can cause multiple nodes to think they're master (or none). Prefer using SLURM-provided identity (e.g. SLURMD_NODENAME or scontrol show hostname $SLURM_NODEID output) for the local node name, and compare using the same naming scheme used to build ALL_NODES.

Comment on lines +582 to +585
for endpoint in $(read_port_file "$node" "$port_file"); do
echo " server ${endpoint} max_fails=3 fail_timeout=30s;" >> $UPSTREAM_FILE
echo "$endpoint" >> "$ENDPOINTS_FILE"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port file parsing unsafe

for endpoint in $(read_port_file ...) word-splits on whitespace. If a port file is empty/partial, read_port_file can emit blank lines and this loop will silently skip/merge tokens. Also, any unexpected whitespace will corrupt endpoints. Use a while IFS= read -r endpoint; do ...; done < <(read_port_file ...) pattern to preserve lines exactly and handle empty output deterministically.

@gwarmstrong
Copy link
Collaborator Author

@Kipok the gpt-oss test and super_49b tests pass (minus a couple unrelated RULER fluctuations)

@gwarmstrong gwarmstrong merged commit b19ba96 into main Feb 7, 2026
5 checks passed
@gwarmstrong gwarmstrong deleted the georgea/multinode-sandbox-pr branch February 7, 2026 05:40
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants