feat: migrate sandbox from uwsgi to gunicorn by gwarmstrong · Pull Request #1234 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-02-11T21:18:14Z

Summary

Replace tiangolo/uwsgi-nginx-flask base image with python:3.10-slim + nginx + gunicorn
Remove all uwsgi configuration (UWSGI_PROCESSES, UWSGI_CHEAPER, UWSGI_CPU_AFFINITY, uwsgi_read_timeout)
Add deprecation warnings for old uwsgi env vars in start-with-nginx.sh
Fix SIGINT handler in shell_worker for gunicorn compatibility
Update icpc/ioi dataset env vars to use GUNICORN_WORKERS
Test fixes for port conflicts and sys.executable usage

Depends on #1233 (lockfiles). Split from #1226.

Summary by CodeRabbit

Chores
- Switched to a lighter base image and simplified worker-related environment configuration.
- Replaced legacy worker orchestration with a unified worker model and updated sandbox environment defaults for scaling.
- Improved startup reliability with port-availability checks and coordinated worker port assignment.
- Improved graceful shutdown and lifecycle handling.
Bug Fixes
- Restored interrupt handling in the local sandbox for proper KeyboardInterrupt behavior and updated test invocation to use the system Python.

greptile-apps

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T21:20:34Z

dockerfiles/sandbox/start-with-nginx.sh

+            --backlog 100 \
+            --access-logfile /var/log/worker${i}.log \
+            --error-logfile /var/log/worker${i}.log \
+            --log-level info \


trailing backslash unnecessary (line continues to closing paren)

Suggested change

--log-level info \

--log-level info

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

coderabbitai · 2026-02-11T21:24:30Z

📝 Walkthrough

Walkthrough

Migrates sandbox runtime from uWSGI to Gunicorn: switches Docker base image, replaces uWSGI env vars with GUNICORN equivalents, rewrites startup orchestration to spawn Gunicorn workers with port coordination and health checks, updates tests and dataset sandbox envs, and minor network-blocking comment/handler tweaks.

Changes

Cohort / File(s)	Summary
Docker base & runtime image `dockerfiles/Dockerfile.sandbox`	Base image changed to `python:3.10-slim`; added `NGINX_PORT=6000`; removed multiple uWSGI-related ARG/ENV entries; apt/nginx install adjusted.
Startup orchestration `dockerfiles/sandbox/start-with-nginx.sh`	Replaced uWSGI orchestration with Gunicorn workers, added port_is_free checks and retry/offset logic, deprecation warnings for uWSGI vars, per-worker log handling, nginx graceful shutdown, and multi-node upstream coordination.
Network blocking helper `dockerfiles/sandbox/block_network.c`	Comment updated to reference Gunicorn instead of uWSGI; no functional change.
Local sandbox server `nemo_skills/code_execution/local_sandbox/local_sandbox_server.py`	Restores default SIGINT handler after creating TerminalInteractiveShell; updates comment referencing Gunicorn.
Local sandbox runner `nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh`	Removed injection of legacy UWSGI env vars and added propagation of `GUNICORN_WORKERS` to docker run env.
Dataset sandbox envs `nemo_skills/dataset/icpc/__init__.py`, `nemo_skills/dataset/ioi/__init__.py`	Replaced UWSGI-specific env entries with single `GUNICORN_WORKERS=1024` in SANDBOX_ENV_VARS.
Tests `tests/test_sandbox_network_blocking.py`	Fixture now computes/passes `SANDBOX_WORKER_BASE_PORT`; subprocess runs use `sys.executable` instead of hardcoded `python3`.

Sequence Diagram(s)

sequenceDiagram
    participant Docker as Container
    participant Start as start-with-nginx.sh
    participant PortChecker as port_is_free
    participant Gunicorn as Gunicorn Workers
    participant Nginx as nginx
    participant Health as Health Checks

    Docker->>Start: container start
    Start->>PortChecker: check SANDBOX_WORKER_BASE_PORT
    PortChecker-->>Start: port free / retry offsets
    Start->>Gunicorn: spawn N workers (per-worker ports, logs)
    Start->>Nginx: generate upstreams & config
    Nginx-->>Start: nginx started / ready
    Start->>Health: perform local/remote health checks
    Health-->>Start: health OK -> continue monitoring

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add multi-node sandbox support for SLURM clusters #1218: Overlapping changes to start-with-nginx.sh for TCP multi-node port coordination and upstream generation.
feat: add lockfiles for reproducible sandbox builds #1233: Adds Gunicorn to sandbox dependencies and lockfile-based installs to support migration from uWSGI.
Network Blocking for Sandbox Code Execution #1071: Related edits to sandbox network-blocking and shared files (block_network.c, start-with-nginx.sh, local_sandbox_server.py, tests).

Suggested reviewers

Kipok
stephencge
activatedgeek

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately and concisely describes the main change: migrating the sandbox infrastructure from uWSGI to Gunicorn, which is the primary objective across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/sandbox-gunicorn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Line 138: In start-with-nginx.sh there is a bare redirection line
">/var/log/worker${i}.log" which triggers ShellCheck SC2188; replace the bare
redirection with an explicit truncate command using the shell no-op builtin
(i.e., use ":" to truncate the file) so the intent is clear and ShellCheck is
satisfied, updating the line that performs the redirection for
/var/log/worker${i}.log.

In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py`:
- Around line 104-108: The current call signal.signal(signal.SIGINT,
signal.SIG_DFL) will let the OS kill the process instead of raising
KeyboardInterrupt, so change the SIGINT handler to use Python's
default_int_handler instead; locate the signal setup in local_sandbox_server.py
(the block mentioning TerminalInteractiveShell) and replace the SIG_DFL
installation with signal.signal(signal.SIGINT, signal.default_int_handler) so
the except KeyboardInterrupt in the shell worker (referenced by the
run_cell/interrupt handling in the shell worker/ShellManager code) becomes
reachable and returns the {"status":"interrupted"} response.

🧹 Nitpick comments (6)

tests/test_sandbox_network_blocking.py (1)
41-55: LGTM — dynamic port assignment for worker base port prevents conflicts with other services.

Minor note: get_free_port(strategy="random") doesn't verify the port is actually free on the host; it just picks a random number. There's a tiny chance port and worker_base_port could collide. The start-with-nginx.sh port-conflict handling should recover, but you could pass exclude=[port] to the second call for safety:
-    worker_base_port = get_free_port(strategy="random")
+    worker_base_port = get_free_port(strategy="random", exclude=[port])
dockerfiles/Dockerfile.sandbox (2)
34-36: Add --no-install-recommends to reduce image size.

Since the base is python:3.10-slim, adding --no-install-recommends avoids pulling in unnecessary transitive packages and keeps the image lean.
Proposed fix
 RUN apt-get update && \
-    apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
+    apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
     rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
83-97: LGTM — lock-file-based installs via uv are a solid improvement for reproducibility.

One minor inconsistency: line 102 still uses bare pip install gdown while everything else uses uv. Consider uv pip install gdown for consistency, though this is a build-time-only tool and doesn't affect the lockfile.
dockerfiles/sandbox/start-with-nginx.sh (3)
139-153: Verify that the tracked PID corresponds to the Gunicorn master process.

The ( cd /app && env ... gunicorn ... ) >> logfile 2>&1 & pattern captures the subshell PID via $!. Because the outer redirection is applied to the subshell itself, bash may not be able to exec-optimize gunicorn into the subshell process. If the subshell PID differs from gunicorn's master PID, kill -TERM in cleanup() would kill the subshell but might orphan gunicorn workers.

In practice, when the subshell's only remaining work is the gunicorn command, the subshell should exit when gunicorn exits, so liveness checks (kill -0) work. But SIGTERM delivery could be an issue — the subshell receives it and exits without forwarding to gunicorn.

Consider using exec to replace the subshell:
Proposed fix
-    ( cd /app && env WORKER_NUM=$i \
-        gunicorn main:app \
+    ( cd /app && exec env WORKER_NUM=$i \
+        gunicorn main:app \
421-437: No guard after exhausting retries — worker will attempt to bind a possibly occupied port.

If port_is_free never succeeds within MAX_STARTUP_RETRIES iterations, the loop exits and start_worker_fast is called with the last (likely occupied) port. Gunicorn will fail to bind and die — caught by the monitoring loop, but a log message would help diagnose the situation.

Also, ShellCheck SC2034: retry is unused. You can use _ as the loop variable.
Proposed fix
-    for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+    for _ in $(seq 0 $MAX_STARTUP_RETRIES); do
         if port_is_free $port; then
             break
         fi
         CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
         port=$((port + PORT_INCREMENT))
         echo "[$_H] Port conflict on worker $i, trying port $port"
     done
+
+    if ! port_is_free $port; then
+        echo "[$_H] WARNING: Could not find free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried: $port)"
+    fi
356-368: STATEFUL_SANDBOX=1 silently overrides user-provided GUNICORN_WORKERS.

If a user sets both STATEFUL_SANDBOX=1 and GUNICORN_WORKERS=4, the value is silently forced to 1. Consider logging a warning when overriding a user-provided value:
 if [ "$STATEFUL_SANDBOX" -eq 1 ]; then
+    [ -n "$GUNICORN_WORKERS" ] && [ "$GUNICORN_WORKERS" -ne 1 ] && \
+        echo "WARNING: STATEFUL_SANDBOX=1 requires GUNICORN_WORKERS=1, overriding GUNICORN_WORKERS=$GUNICORN_WORKERS"
     GUNICORN_WORKERS=1

dockerfiles/sandbox/start-with-nginx.sh

coderabbitai · 2026-02-11T21:24:33Z

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

+    # TerminalInteractiveShell installs a SIGINT handler that calls sys.exit(0)
+    # instead of raising KeyboardInterrupt when _executing is False (which is
+    # the case when run_cell is called programmatically). Restore the default
+    # handler so SIGINT raises KeyboardInterrupt and our except clause catches it.
+    signal.signal(signal.SIGINT, signal.SIG_DFL)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Python signal.SIG_DFL SIGINT behavior process termination

💡 Result:

In Python, there are two different “defaults” in play for SIGINT (Ctrl+C):

Python’s default SIGINT handler: CPython installs a handler that translates SIGINT into a KeyboardInterrupt exception (as long as the parent process hasn’t changed SIGINT handling). So the process usually does not immediately die at the OS level; instead Python raises KeyboardInterrupt in the main thread. [1]

signal.SIG_DFL (restore OS default): signal.SIG_DFL means “use the signal’s default action”. If you do:

signal.signal(signal.SIGINT, signal.SIG_DFL)

then SIGINT is no longer translated into KeyboardInterrupt; it reverts to the platform/OS default behavior. Python documents SIG_DFL as “perform the default function for the signal”. [1]

On POSIX/Unix, the default action for SIGINT is process termination (i.e., the kernel terminates the process as if by _exit(), with wait status indicating termination by signal). [2]

Net effect:

Default Python: Ctrl+C ⇒ KeyboardInterrupt (catchable in Python). [1]

After signal(SIGINT, SIG_DFL): Ctrl+C / SIGINT ⇒ process is terminated by the OS signal default, typically without a Python exception/traceback. [1][2]

Sources:
[1] Python signal module docs (SIGINT → KeyboardInterrupt, SIG_DFL restores default) (docs.python.org)
[2] POSIX/Open Group description of SIG_DFL default actions and termination semantics (pubs.opengroup.org)

Citations:

1: https://docs.python.org/3/library/signal.html

2: https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html?utm_source=openai

🌐 Web query:

Python signal.default_int_handler KeyboardInterrupt

💡 Result:

signal.default_int_handler is Python’s built-in default handler for SIGINT (typically generated by pressing Ctrl+C). Its behavior is simply to raise KeyboardInterrupt. [1]

Typical uses:

Restore Python’s Ctrl+C behavior after installing a custom handler

import signal signal.signal(signal.SIGINT, signal.default_int_handler)

This makes SIGINT raise KeyboardInterrupt again (instead of whatever your custom handler did). [1]

Note: signal.signal(SIGINT, signal.SIG_DFL) does not restore Python’s KeyboardInterrupt behavior; it restores the OS default action (usually terminating the process). To restore Python’s behavior, use signal.default_int_handler (or save/restore the previous handler). [4]

Why you might not see it installed: Python only installs its default SIGINT→KeyboardInterrupt translation if the parent process hasn’t already set SIGINT to “ignore” (SIG_IGN). In that case, Python inherits the ignore disposition and won’t translate SIGINT into KeyboardInterrupt. [1], [8]

Sources: Python signal docs [1]; Stack Overflow on restoring default SIGINT handling [4]; Stack Overflow on inherited SIG_IGN preventing KeyboardInterrupt [8].

🏁 Script executed:

find . -name "local_sandbox_server.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 283

🏁 Script executed:

fd "local_sandbox_server.py" -t f

Repository: NVIDIA-NeMo/Skills

Length of output: 283

Bug: signal.SIG_DFL terminates the process instead of raising KeyboardInterrupt.

signal.SIG_DFL restores the OS-level default for SIGINT, which is process termination on POSIX systems — it does not raise KeyboardInterrupt. This means the except KeyboardInterrupt block on line 142 is unreachable; when the ShellManager sends SIGINT, the shell worker will be killed by the OS rather than gracefully returning an {"status": "interrupted"} response.

Use signal.default_int_handler — Python's built-in handler that raises KeyboardInterrupt:

Proposed fix

- signal.signal(signal.SIGINT, signal.SIG_DFL) + signal.signal(signal.SIGINT, signal.default_int_handler)

🤖 Prompt for AI Agents

In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py` around lines 104 - 108, The current call signal.signal(signal.SIGINT, signal.SIG_DFL) will let the OS kill the process instead of raising KeyboardInterrupt, so change the SIGINT handler to use Python's default_int_handler instead; locate the signal setup in local_sandbox_server.py (the block mentioning TerminalInteractiveShell) and replace the SIG_DFL installation with signal.signal(signal.SIGINT, signal.default_int_handler) so the except KeyboardInterrupt in the shell worker (referenced by the run_cell/interrupt handling in the shell worker/ShellManager code) becomes reachable and returns the {"status":"interrupted"} response.

greptile-apps

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T21:52:49Z

dockerfiles/sandbox/start-with-nginx.sh

+    # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
+    for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+        if port_is_free $port; then
+            break
+        fi
+        CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
+        port=$((port + PORT_INCREMENT))


potential race condition between port_is_free check and start_worker_fast binding. if another process binds to the port in this gap, gunicorn will still retry 5 times (~5s delay). the old approach with post-spawn retry handled this by checking logs after the process started. consider checking if worker actually started successfully on the expected port.

Suggested change

# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied

for retry in $(seq 0 $MAX_STARTUP_RETRIES); do

if port_is_free $port; then

break

fi

CONFLICT_COUNT=$((CONFLICT_COUNT + 1))

port=$((port + PORT_INCREMENT))

# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied

for retry in $(seq 0 $MAX_STARTUP_RETRIES); do

if port_is_free $port; then

result=$(start_worker_fast $i $port)

WORKER_PIDS[$((i - 1))]="${result%%:*}"

ACTUAL_WORKER_PORTS[$((i - 1))]=$port

break

fi

CONFLICT_COUNT=$((CONFLICT_COUNT + 1))

port=$((port + PORT_INCREMENT))

echo "[$_H] Port conflict on worker $i, trying port $port"

done

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 414-437: The port-conflict loop may exhaust all retries and still
call start_worker_fast on an occupied port; update the loop that uses
port_is_free and MAX_STARTUP_RETRIES to use a throwaway loop variable (_)
instead of retry, and after the loop add an explicit exhaustion check: if no
free port was found within MAX_STARTUP_RETRIES increments, log a clear
error/warning including the worker index and attempted port(s), increment
CONFLICT_COUNT as needed, and either skip starting that worker (set
WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an empty/invalid marker)
or exit with a nonzero status so failure is visible; keep references to
start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS, PORT_INCREMENT and
CONFLICT_COUNT when implementing the handling.

In `@tests/test_sandbox_network_blocking.py`:
- Around line 41-42: The test picks two free ports independently causing a
possible collision: when creating port and worker_base_port using
get_free_port(strategy="random"), change the second allocation to avoid the
first (e.g., call get_free_port(strategy="random", exclude=[port]) or otherwise
pass an exclude argument with port) or derive worker_base_port deterministically
from port (e.g., port+1 after verifying availability) so that get_free_port and
the variables port and worker_base_port cannot return the same value.

🧹 Nitpick comments (4)

dockerfiles/Dockerfile.sandbox (2)
35-37: Add --no-install-recommends to reduce image size.

On a slim base image, recommended packages can add significant bloat. This was also flagged by Trivy (DS-0029).
Proposed fix
 RUN apt-get update && \
-    apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
+    apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
     rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
84-85: Consider pinning the uv version for reproducible builds.

The lockfile approach gives you reproducible Python deps, but the installer itself (uv) is fetched at whatever version is latest at build time. A breaking change in uv could cause build failures.
Proposed fix (example version)
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+RUN curl -LsSf https://astral.sh/uv/0.6.6/install.sh | sh
dockerfiles/sandbox/start-with-nginx.sh (2)

65-77: Good deprecation path for uWSGI env vars.

The fallback from UWSGI_PROCESSES → GUNICORN_WORKERS via : "${GUNICORN_WORKERS:=$UWSGI_PROCESSES}" is clean. Note that when STATEFUL_SANDBOX=1 (the default), line 361 unconditionally sets GUNICORN_WORKERS=1, so the migrated value is silently overridden. This is correct behavior (stateful mode needs exactly 1 worker), but consider mentioning it in the deprecation warning so users aren't confused.

142-152: Gunicorn logs may be duplicated to the worker log file.

--access-logfile and --error-logfile both point to /var/log/worker${i}.log, while the subshell's stdout/stderr are also redirected (>>) to the same file. Gunicorn's console output (which includes startup messages and worker lifecycle events) will be captured by both mechanisms, resulting in duplicate lines.

Consider either dropping the --access-logfile/--error-logfile flags and relying solely on the shell redirect, or redirecting stdout/stderr to /dev/null and relying solely on gunicorn's log flags.

coderabbitai · 2026-02-11T21:53:40Z

dockerfiles/sandbox/start-with-nginx.sh

+# Phase 1: Pre-check ports and spawn all workers simultaneously.
+# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),
+# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.
 echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."
 START_SPAWN=$(date +%s)
+CONFLICT_COUNT=0

 for i in $(seq 1 $NUM_WORKERS); do
    port=$((SANDBOX_WORKER_BASE_PORT + i - 1))
-    result=$(start_worker_fast $i $port)
-    WORKER_PIDS[$((i - 1))]="${result%%:*}"
-    ACTUAL_WORKER_PORTS[$((i - 1))]=$port
-done
-
-echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s"

-# Phase 2: Retry workers that failed due to port conflicts
-retry_round=0
-while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do
-    sleep 1
-
-    FAILED_WORKERS=()
-    for i in $(seq 1 $NUM_WORKERS); do
-        idx=$((i - 1))
-        worker_is_alive "${WORKER_PIDS[$idx]}" && continue
-        worker_had_port_conflict $i && FAILED_WORKERS+=($i)
-    done
-
-    [ ${#FAILED_WORKERS[@]} -eq 0 ] && break
-
-    PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT ))
-    echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET"
-
-    for i in "${FAILED_WORKERS[@]}"; do
-        idx=$((i - 1))
-        new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET))
-        result=$(start_worker_fast $i $new_port)
-        WORKER_PIDS[$idx]="${result%%:*}"
-        ACTUAL_WORKER_PORTS[$idx]=$new_port
+    # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
+    for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+        if port_is_free $port; then
+            break
+        fi
+        CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
+        port=$((port + PORT_INCREMENT))
+        echo "[$_H] Port conflict on worker $i, trying port $port"
    done

-    retry_round=$((retry_round + 1))
+    result=$(start_worker_fast $i $port)
+    WORKER_PIDS[$((i - 1))]="${result%%:*}"
+    ACTUAL_WORKER_PORTS[$((i - 1))]=$port
 done


⚠️ Potential issue | 🟡 Minor

Port conflict resolution loop has no exhaustion guard.

If port_is_free never succeeds (all candidate ports occupied), the loop silently falls through and start_worker_fast is called on a busy port. After MAX_STARTUP_RETRIES offset increments, consider failing explicitly or logging a clear warning that the worker will likely fail to bind.

Also, the retry variable on line 425 is unused (ShellCheck SC2034) — use _ as the loop variable.

Proposed fix

- for retry in $(seq 0 $MAX_STARTUP_RETRIES); do + for _ in $(seq 0 $MAX_STARTUP_RETRIES); do if port_is_free $port; then break fi CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) port=$((port + PORT_INCREMENT)) echo "[$_H] Port conflict on worker $i, trying port $port" done + if ! port_is_free $port; then + echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)" + fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Phase 1: Pre-check ports and spawn all workers simultaneously.

# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),

# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.

echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."

START_SPAWN=$(date +%s)

CONFLICT_COUNT=0

for i in $(seq 1 $NUM_WORKERS); do

port=$((SANDBOX_WORKER_BASE_PORT + i - 1))

result=$(start_worker_fast $i $port)

WORKER_PIDS[$((i - 1))]="${result%%:*}"

ACTUAL_WORKER_PORTS[$((i - 1))]=$port

done

echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s"

# Phase 2: Retry workers that failed due to port conflicts

retry_round=0

while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do

sleep 1

FAILED_WORKERS=()

for i in $(seq 1 $NUM_WORKERS); do

idx=$((i - 1))

worker_is_alive "${WORKER_PIDS[$idx]}" && continue

worker_had_port_conflict $i && FAILED_WORKERS+=($i)

done

[ ${#FAILED_WORKERS[@]} -eq 0 ] && break

PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT ))

echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET"

for i in "${FAILED_WORKERS[@]}"; do

idx=$((i - 1))

new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET))

result=$(start_worker_fast $i $new_port)

WORKER_PIDS[$idx]="${result%%:*}"

ACTUAL_WORKER_PORTS[$idx]=$new_port

# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied

for retry in $(seq 0 $MAX_STARTUP_RETRIES); do

if port_is_free $port; then

break

fi

CONFLICT_COUNT=$((CONFLICT_COUNT + 1))

port=$((port + PORT_INCREMENT))

echo "[$_H] Port conflict on worker $i, trying port $port"

done

retry_round=$((retry_round + 1))

result=$(start_worker_fast $i $port)

WORKER_PIDS[$((i - 1))]="${result%%:*}"

ACTUAL_WORKER_PORTS[$((i - 1))]=$port

done

# Phase 1: Pre-check ports and spawn all workers simultaneously.

# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),

# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.

echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."

START_SPAWN=$(date +%s)

CONFLICT_COUNT=0

for i in $(seq 1 $NUM_WORKERS); do

port=$((SANDBOX_WORKER_BASE_PORT + i - 1))

# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied

for _ in $(seq 0 $MAX_STARTUP_RETRIES); do

if port_is_free $port; then

break

fi

CONFLICT_COUNT=$((CONFLICT_COUNT + 1))

port=$((port + PORT_INCREMENT))

echo "[$_H] Port conflict on worker $i, trying port $port"

done

if ! port_is_free $port; then

echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)"

fi

result=$(start_worker_fast $i $port)

WORKER_PIDS[$((i - 1))]="${result%%:*}"

ACTUAL_WORKER_PORTS[$((i - 1))]=$port

done

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 425-425: retry appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents

In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 414 - 437, The port-conflict loop may exhaust all retries and still call start_worker_fast on an occupied port; update the loop that uses port_is_free and MAX_STARTUP_RETRIES to use a throwaway loop variable (_) instead of retry, and after the loop add an explicit exhaustion check: if no free port was found within MAX_STARTUP_RETRIES increments, log a clear error/warning including the worker index and attempted port(s), increment CONFLICT_COUNT as needed, and either skip starting that worker (set WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an empty/invalid marker) or exit with a nonzero status so failure is visible; keep references to start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS, PORT_INCREMENT and CONFLICT_COUNT when implementing the handling.

coderabbitai · 2026-02-11T21:53:40Z

tests/test_sandbox_network_blocking.py

    port = get_free_port(strategy="random")
+    worker_base_port = get_free_port(strategy="random")


⚠️ Potential issue | 🟡 Minor

Potential port collision between port and worker_base_port.

Both ports are chosen independently with strategy="random" and neither uses the exclude parameter to avoid the other. In unlucky runs the two could collide, causing a flaky test failure.

Proposed fix

port = get_free_port(strategy="random") - worker_base_port = get_free_port(strategy="random") + worker_base_port = get_free_port(exclude=[port], strategy="random")

🤖 Prompt for AI Agents

In `@tests/test_sandbox_network_blocking.py` around lines 41 - 42, The test picks two free ports independently causing a possible collision: when creating port and worker_base_port using get_free_port(strategy="random"), change the second allocation to avoid the first (e.g., call get_free_port(strategy="random", exclude=[port]) or otherwise pass an exclude argument with port) or derive worker_base_port deterministically from port (e.g., port+1 after verifying availability) so that get_free_port and the variables port and worker_base_port cannot return the same value.

Replace tiangolo/uwsgi-nginx-flask with python:3.10-slim and gunicorn. Add nginx as system package, remove all uwsgi configuration (UWSGI_PROCESSES, UWSGI_CHEAPER, UWSGI_CPU_AFFINITY, uwsgi_read_timeout). Add deprecation warnings for old uwsgi env vars in start-with-nginx.sh. Fix SIGINT handler in shell_worker for gunicorn compatibility. Update icpc/ioi dataset env vars to use GUNICORN_WORKERS. Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Kipok

lgtm as long as all tests work well. Few AI comments from offline review that seem meaningful to check

Gunicorn with 1024 workers (Medium concern)

In icpc/init.py and ioi/init.py:
"GUNICORN_WORKERS=1024"
uWSGI's cheaper mode allowed scaling from 1 active process up to 1024 on demand. Gunicorn doesn't have an equivalent — it will fork
all 1024 workers at startup. This could cause significant memory overhead and slow startup. Consider whether this is the intended
behavior, or if a smaller worker count should be used. If the 1024 was only needed because uWSGI's cheaper mode made it cheap to
declare a high max, this number likely needs to be reduced.

worker_had_port_conflict is now dead code (Minor)

The function at line 164 of start-with-nginx.sh:

worker_had_port_conflict() {
grep -q "Address already in use" /var/log/worker${1}.log 2>/dev/null
}

This was used by the old Phase 2 retry loop which has been removed. The function is no longer called anywhere and should be deleted.

In shell_worker, resetting SIGINT to signal.SIG_DFL makes timeout interrupts terminate
the shell process instead of raising KeyboardInterrupt. ShellManager.run_cell() sends SIGINT after timeout expecting the shell to handle interruption and return an
interrupted response; with SIG_DFL, this path turns into EOFError + forced shell restart, dropping session state on timeouts. This regresses persistent-session behavior and
should restore Python's default interrupt handler (signal.default_int_handler) rather than the OS default action.

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/dataset/icpc/__init__.py (1)
24-28: Same lazy-scaling concern as ioi/__init__.py applies here.

The uWSGI cheaper-mode removal and the eager-spawn behaviour of GUNICORN_WORKERS=1024 are identical to the change in nemo_skills/dataset/ioi/__init__.py. See that file's review comment for the full analysis and the verification script. The changes themselves are structurally consistent across both dataset modules.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/icpc/__init__.py` around lines 24 - 28, The
SANDBOX_ENV_VARS constant in nemo_skills/dataset/icpc/__init__.py is setting
GUNICORN_WORKERS=1024 which forces eager worker spawn and defeats lazy-scaling;
change that entry to a sensible default (or remove it) so the system can rely on
runtime worker-scaling (match the fix applied in ioi/__init__.py), and ensure
NUM_WORKERS and STATEFUL_SANDBOX entries remain correct—update the
SANDBOX_ENV_VARS list to remove or lower "GUNICORN_WORKERS=1024" and verify
behavior of the module-level SANDBOX_ENV_VARS constant after the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/icpc/__init__.py`:
- Around line 24-28: The SANDBOX_ENV_VARS constant in
nemo_skills/dataset/icpc/__init__.py is setting GUNICORN_WORKERS=1024 which
forces eager worker spawn and defeats lazy-scaling; change that entry to a
sensible default (or remove it) so the system can rely on runtime worker-scaling
(match the fix applied in ioi/__init__.py), and ensure NUM_WORKERS and
STATEFUL_SANDBOX entries remain correct—update the SANDBOX_ENV_VARS list to
remove or lower "GUNICORN_WORKERS=1024" and verify behavior of the module-level
SANDBOX_ENV_VARS constant after the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 321a0ba and 32e08a2.

📒 Files selected for processing (2)

nemo_skills/dataset/icpc/__init__.py
nemo_skills/dataset/ioi/__init__.py

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

gwarmstrong force-pushed the georgea/sandbox-gunicorn branch from b5a8316 to 21daa10 Compare February 11, 2026 21:50

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

gwarmstrong force-pushed the georgea/sandbox-gunicorn branch from 21daa10 to 321a0ba Compare February 11, 2026 23:04

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

Kipok approved these changes Feb 18, 2026

View reviewed changes

Merge branch 'main' into georgea/sandbox-gunicorn

32e08a2

gwarmstrong added the run GPU tests label Feb 24, 2026

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

Merge branch 'main' into georgea/sandbox-gunicorn

42e5fbc

gwarmstrong added run GPU tests and removed run GPU tests labels Feb 26, 2026

gwarmstrong added 2 commits February 26, 2026 09:40

Merge branch 'main' into georgea/sandbox-gunicorn

0a379c7

Merge branch 'main' into georgea/sandbox-gunicorn

97ec213

gwarmstrong added reviewed needs changes labels Feb 27, 2026

		port = get_free_port(strategy="random")
		worker_base_port = get_free_port(strategy="random")

Conversation

gwarmstrong commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gwarmstrong commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading