Skip to content

feat: migrate sandbox from uwsgi to gunicorn#1234

Open
gwarmstrong wants to merge 5 commits intomainfrom
georgea/sandbox-gunicorn
Open

feat: migrate sandbox from uwsgi to gunicorn#1234
gwarmstrong wants to merge 5 commits intomainfrom
georgea/sandbox-gunicorn

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Feb 11, 2026

Summary

  • Replace tiangolo/uwsgi-nginx-flask base image with python:3.10-slim + nginx + gunicorn
  • Remove all uwsgi configuration (UWSGI_PROCESSES, UWSGI_CHEAPER, UWSGI_CPU_AFFINITY, uwsgi_read_timeout)
  • Add deprecation warnings for old uwsgi env vars in start-with-nginx.sh
  • Fix SIGINT handler in shell_worker for gunicorn compatibility
  • Update icpc/ioi dataset env vars to use GUNICORN_WORKERS
  • Test fixes for port conflicts and sys.executable usage

Depends on #1233 (lockfiles). Split from #1226.

Summary by CodeRabbit

  • Chores

    • Switched to a lighter base image and simplified worker-related environment configuration.
    • Replaced legacy worker orchestration with a unified worker model and updated sandbox environment defaults for scaling.
    • Improved startup reliability with port-availability checks and coordinated worker port assignment.
    • Improved graceful shutdown and lifecycle handling.
  • Bug Fixes

    • Restored interrupt handling in the local sandbox for proper KeyboardInterrupt behavior and updated test invocation to use the system Python.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

--backlog 100 \
--access-logfile /var/log/worker${i}.log \
--error-logfile /var/log/worker${i}.log \
--log-level info \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing backslash unnecessary (line continues to closing paren)

Suggested change
--log-level info \
--log-level info

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Migrates sandbox runtime from uWSGI to Gunicorn: switches Docker base image, replaces uWSGI env vars with GUNICORN equivalents, rewrites startup orchestration to spawn Gunicorn workers with port coordination and health checks, updates tests and dataset sandbox envs, and minor network-blocking comment/handler tweaks.

Changes

Cohort / File(s) Summary
Docker base & runtime image
dockerfiles/Dockerfile.sandbox
Base image changed to python:3.10-slim; added NGINX_PORT=6000; removed multiple uWSGI-related ARG/ENV entries; apt/nginx install adjusted.
Startup orchestration
dockerfiles/sandbox/start-with-nginx.sh
Replaced uWSGI orchestration with Gunicorn workers, added port_is_free checks and retry/offset logic, deprecation warnings for uWSGI vars, per-worker log handling, nginx graceful shutdown, and multi-node upstream coordination.
Network blocking helper
dockerfiles/sandbox/block_network.c
Comment updated to reference Gunicorn instead of uWSGI; no functional change.
Local sandbox server
nemo_skills/code_execution/local_sandbox/local_sandbox_server.py
Restores default SIGINT handler after creating TerminalInteractiveShell; updates comment referencing Gunicorn.
Local sandbox runner
nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh
Removed injection of legacy UWSGI env vars and added propagation of GUNICORN_WORKERS to docker run env.
Dataset sandbox envs
nemo_skills/dataset/icpc/__init__.py, nemo_skills/dataset/ioi/__init__.py
Replaced UWSGI-specific env entries with single GUNICORN_WORKERS=1024 in SANDBOX_ENV_VARS.
Tests
tests/test_sandbox_network_blocking.py
Fixture now computes/passes SANDBOX_WORKER_BASE_PORT; subprocess runs use sys.executable instead of hardcoded python3.

Sequence Diagram(s)

sequenceDiagram
    participant Docker as Container
    participant Start as start-with-nginx.sh
    participant PortChecker as port_is_free
    participant Gunicorn as Gunicorn Workers
    participant Nginx as nginx
    participant Health as Health Checks

    Docker->>Start: container start
    Start->>PortChecker: check SANDBOX_WORKER_BASE_PORT
    PortChecker-->>Start: port free / retry offsets
    Start->>Gunicorn: spawn N workers (per-worker ports, logs)
    Start->>Nginx: generate upstreams & config
    Nginx-->>Start: nginx started / ready
    Start->>Health: perform local/remote health checks
    Health-->>Start: health OK -> continue monitoring
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Kipok
  • stephencge
  • activatedgeek
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely describes the main change: migrating the sandbox infrastructure from uWSGI to Gunicorn, which is the primary objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/sandbox-gunicorn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Line 138: In start-with-nginx.sh there is a bare redirection line
">/var/log/worker${i}.log" which triggers ShellCheck SC2188; replace the bare
redirection with an explicit truncate command using the shell no-op builtin
(i.e., use ":" to truncate the file) so the intent is clear and ShellCheck is
satisfied, updating the line that performs the redirection for
/var/log/worker${i}.log.

In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py`:
- Around line 104-108: The current call signal.signal(signal.SIGINT,
signal.SIG_DFL) will let the OS kill the process instead of raising
KeyboardInterrupt, so change the SIGINT handler to use Python's
default_int_handler instead; locate the signal setup in local_sandbox_server.py
(the block mentioning TerminalInteractiveShell) and replace the SIG_DFL
installation with signal.signal(signal.SIGINT, signal.default_int_handler) so
the except KeyboardInterrupt in the shell worker (referenced by the
run_cell/interrupt handling in the shell worker/ShellManager code) becomes
reachable and returns the {"status":"interrupted"} response.
🧹 Nitpick comments (6)
tests/test_sandbox_network_blocking.py (1)

41-55: LGTM — dynamic port assignment for worker base port prevents conflicts with other services.

Minor note: get_free_port(strategy="random") doesn't verify the port is actually free on the host; it just picks a random number. There's a tiny chance port and worker_base_port could collide. The start-with-nginx.sh port-conflict handling should recover, but you could pass exclude=[port] to the second call for safety:

-    worker_base_port = get_free_port(strategy="random")
+    worker_base_port = get_free_port(strategy="random", exclude=[port])
dockerfiles/Dockerfile.sandbox (2)

34-36: Add --no-install-recommends to reduce image size.

Since the base is python:3.10-slim, adding --no-install-recommends avoids pulling in unnecessary transitive packages and keeps the image lean.

Proposed fix
 RUN apt-get update && \
-    apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
+    apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
     rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*

83-97: LGTM — lock-file-based installs via uv are a solid improvement for reproducibility.

One minor inconsistency: line 102 still uses bare pip install gdown while everything else uses uv. Consider uv pip install gdown for consistency, though this is a build-time-only tool and doesn't affect the lockfile.

dockerfiles/sandbox/start-with-nginx.sh (3)

139-153: Verify that the tracked PID corresponds to the Gunicorn master process.

The ( cd /app && env ... gunicorn ... ) >> logfile 2>&1 & pattern captures the subshell PID via $!. Because the outer redirection is applied to the subshell itself, bash may not be able to exec-optimize gunicorn into the subshell process. If the subshell PID differs from gunicorn's master PID, kill -TERM in cleanup() would kill the subshell but might orphan gunicorn workers.

In practice, when the subshell's only remaining work is the gunicorn command, the subshell should exit when gunicorn exits, so liveness checks (kill -0) work. But SIGTERM delivery could be an issue — the subshell receives it and exits without forwarding to gunicorn.

Consider using exec to replace the subshell:

Proposed fix
-    ( cd /app && env WORKER_NUM=$i \
-        gunicorn main:app \
+    ( cd /app && exec env WORKER_NUM=$i \
+        gunicorn main:app \

421-437: No guard after exhausting retries — worker will attempt to bind a possibly occupied port.

If port_is_free never succeeds within MAX_STARTUP_RETRIES iterations, the loop exits and start_worker_fast is called with the last (likely occupied) port. Gunicorn will fail to bind and die — caught by the monitoring loop, but a log message would help diagnose the situation.

Also, ShellCheck SC2034: retry is unused. You can use _ as the loop variable.

Proposed fix
-    for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+    for _ in $(seq 0 $MAX_STARTUP_RETRIES); do
         if port_is_free $port; then
             break
         fi
         CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
         port=$((port + PORT_INCREMENT))
         echo "[$_H] Port conflict on worker $i, trying port $port"
     done
+
+    if ! port_is_free $port; then
+        echo "[$_H] WARNING: Could not find free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried: $port)"
+    fi

356-368: STATEFUL_SANDBOX=1 silently overrides user-provided GUNICORN_WORKERS.

If a user sets both STATEFUL_SANDBOX=1 and GUNICORN_WORKERS=4, the value is silently forced to 1. Consider logging a warning when overriding a user-provided value:

 if [ "$STATEFUL_SANDBOX" -eq 1 ]; then
+    [ -n "$GUNICORN_WORKERS" ] && [ "$GUNICORN_WORKERS" -ne 1 ] && \
+        echo "WARNING: STATEFUL_SANDBOX=1 requires GUNICORN_WORKERS=1, overriding GUNICORN_WORKERS=$GUNICORN_WORKERS"
     GUNICORN_WORKERS=1

Comment on lines +104 to +108
# TerminalInteractiveShell installs a SIGINT handler that calls sys.exit(0)
# instead of raising KeyboardInterrupt when _executing is False (which is
# the case when run_cell is called programmatically). Restore the default
# handler so SIGINT raises KeyboardInterrupt and our except clause catches it.
signal.signal(signal.SIGINT, signal.SIG_DFL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Python signal.SIG_DFL SIGINT behavior process termination

💡 Result:

In Python, there are two different “defaults” in play for SIGINT (Ctrl+C):

  • Python’s default SIGINT handler: CPython installs a handler that translates SIGINT into a KeyboardInterrupt exception (as long as the parent process hasn’t changed SIGINT handling). So the process usually does not immediately die at the OS level; instead Python raises KeyboardInterrupt in the main thread. [1]

  • signal.SIG_DFL (restore OS default): signal.SIG_DFL means “use the signal’s default action”. If you do:

    signal.signal(signal.SIGINT, signal.SIG_DFL)

    then SIGINT is no longer translated into KeyboardInterrupt; it reverts to the platform/OS default behavior. Python documents SIG_DFL as “perform the default function for the signal”. [1]

On POSIX/Unix, the default action for SIGINT is process termination (i.e., the kernel terminates the process as if by _exit(), with wait status indicating termination by signal). [2]

Net effect:

  • Default Python: Ctrl+C ⇒ KeyboardInterrupt (catchable in Python). [1]
  • After signal(SIGINT, SIG_DFL): Ctrl+C / SIGINT ⇒ process is terminated by the OS signal default, typically without a Python exception/traceback. [1][2]

Sources:
[1] Python signal module docs (SIGINT → KeyboardInterrupt, SIG_DFL restores default) (docs.python.org)
[2] POSIX/Open Group description of SIG_DFL default actions and termination semantics (pubs.opengroup.org)

Citations:


🌐 Web query:

Python signal.default_int_handler KeyboardInterrupt

💡 Result:

signal.default_int_handler is Python’s built-in default handler for SIGINT (typically generated by pressing Ctrl+C). Its behavior is simply to raise KeyboardInterrupt. [1]

Typical uses:

  • Restore Python’s Ctrl+C behavior after installing a custom handler

    import signal
    signal.signal(signal.SIGINT, signal.default_int_handler)

    This makes SIGINT raise KeyboardInterrupt again (instead of whatever your custom handler did). [1]

  • Note: signal.signal(SIGINT, signal.SIG_DFL) does not restore Python’s KeyboardInterrupt behavior; it restores the OS default action (usually terminating the process). To restore Python’s behavior, use signal.default_int_handler (or save/restore the previous handler). [4]

  • Why you might not see it installed: Python only installs its default SIGINTKeyboardInterrupt translation if the parent process hasn’t already set SIGINT to “ignore” (SIG_IGN). In that case, Python inherits the ignore disposition and won’t translate SIGINT into KeyboardInterrupt. [1], [8]

Sources: Python signal docs [1]; Stack Overflow on restoring default SIGINT handling [4]; Stack Overflow on inherited SIG_IGN preventing KeyboardInterrupt [8].


🏁 Script executed:

find . -name "local_sandbox_server.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 283


🏁 Script executed:

fd "local_sandbox_server.py" -t f

Repository: NVIDIA-NeMo/Skills

Length of output: 283


Bug: signal.SIG_DFL terminates the process instead of raising KeyboardInterrupt.

signal.SIG_DFL restores the OS-level default for SIGINT, which is process termination on POSIX systems — it does not raise KeyboardInterrupt. This means the except KeyboardInterrupt block on line 142 is unreachable; when the ShellManager sends SIGINT, the shell worker will be killed by the OS rather than gracefully returning an {"status": "interrupted"} response.

Use signal.default_int_handler — Python's built-in handler that raises KeyboardInterrupt:

Proposed fix
-    signal.signal(signal.SIGINT, signal.SIG_DFL)
+    signal.signal(signal.SIGINT, signal.default_int_handler)
🤖 Prompt for AI Agents
In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py` around
lines 104 - 108, The current call signal.signal(signal.SIGINT, signal.SIG_DFL)
will let the OS kill the process instead of raising KeyboardInterrupt, so change
the SIGINT handler to use Python's default_int_handler instead; locate the
signal setup in local_sandbox_server.py (the block mentioning
TerminalInteractiveShell) and replace the SIG_DFL installation with
signal.signal(signal.SIGINT, signal.default_int_handler) so the except
KeyboardInterrupt in the shell worker (referenced by the run_cell/interrupt
handling in the shell worker/ShellManager code) becomes reachable and returns
the {"status":"interrupted"} response.

@gwarmstrong gwarmstrong force-pushed the georgea/sandbox-gunicorn branch from b5a8316 to 21daa10 Compare February 11, 2026 21:50
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +424 to +430
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential race condition between port_is_free check and start_worker_fast binding. if another process binds to the port in this gap, gunicorn will still retry 5 times (~5s delay). the old approach with post-spawn retry handled this by checking logs after the process started. consider checking if worker actually started successfully on the expected port.

Suggested change
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
echo "[$_H] Port conflict on worker $i, trying port $port"
done

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 414-437: The port-conflict loop may exhaust all retries and still
call start_worker_fast on an occupied port; update the loop that uses
port_is_free and MAX_STARTUP_RETRIES to use a throwaway loop variable (_)
instead of retry, and after the loop add an explicit exhaustion check: if no
free port was found within MAX_STARTUP_RETRIES increments, log a clear
error/warning including the worker index and attempted port(s), increment
CONFLICT_COUNT as needed, and either skip starting that worker (set
WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an empty/invalid marker)
or exit with a nonzero status so failure is visible; keep references to
start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS, PORT_INCREMENT and
CONFLICT_COUNT when implementing the handling.

In `@tests/test_sandbox_network_blocking.py`:
- Around line 41-42: The test picks two free ports independently causing a
possible collision: when creating port and worker_base_port using
get_free_port(strategy="random"), change the second allocation to avoid the
first (e.g., call get_free_port(strategy="random", exclude=[port]) or otherwise
pass an exclude argument with port) or derive worker_base_port deterministically
from port (e.g., port+1 after verifying availability) so that get_free_port and
the variables port and worker_base_port cannot return the same value.
🧹 Nitpick comments (4)
dockerfiles/Dockerfile.sandbox (2)

35-37: Add --no-install-recommends to reduce image size.

On a slim base image, recommended packages can add significant bloat. This was also flagged by Trivy (DS-0029).

Proposed fix
 RUN apt-get update && \
-    apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
+    apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \
     rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*

84-85: Consider pinning the uv version for reproducible builds.

The lockfile approach gives you reproducible Python deps, but the installer itself (uv) is fetched at whatever version is latest at build time. A breaking change in uv could cause build failures.

Proposed fix (example version)
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+RUN curl -LsSf https://astral.sh/uv/0.6.6/install.sh | sh
dockerfiles/sandbox/start-with-nginx.sh (2)

65-77: Good deprecation path for uWSGI env vars.

The fallback from UWSGI_PROCESSESGUNICORN_WORKERS via : "${GUNICORN_WORKERS:=$UWSGI_PROCESSES}" is clean. Note that when STATEFUL_SANDBOX=1 (the default), line 361 unconditionally sets GUNICORN_WORKERS=1, so the migrated value is silently overridden. This is correct behavior (stateful mode needs exactly 1 worker), but consider mentioning it in the deprecation warning so users aren't confused.


142-152: Gunicorn logs may be duplicated to the worker log file.

--access-logfile and --error-logfile both point to /var/log/worker${i}.log, while the subshell's stdout/stderr are also redirected (>>) to the same file. Gunicorn's console output (which includes startup messages and worker lifecycle events) will be captured by both mechanisms, resulting in duplicate lines.

Consider either dropping the --access-logfile/--error-logfile flags and relying solely on the shell redirect, or redirecting stdout/stderr to /dev/null and relying solely on gunicorn's log flags.

Comment on lines +414 to 437
# Phase 1: Pre-check ports and spawn all workers simultaneously.
# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),
# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.
echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."
START_SPAWN=$(date +%s)
CONFLICT_COUNT=0

for i in $(seq 1 $NUM_WORKERS); do
port=$((SANDBOX_WORKER_BASE_PORT + i - 1))
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
done

echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s"

# Phase 2: Retry workers that failed due to port conflicts
retry_round=0
while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do
sleep 1

FAILED_WORKERS=()
for i in $(seq 1 $NUM_WORKERS); do
idx=$((i - 1))
worker_is_alive "${WORKER_PIDS[$idx]}" && continue
worker_had_port_conflict $i && FAILED_WORKERS+=($i)
done

[ ${#FAILED_WORKERS[@]} -eq 0 ] && break

PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT ))
echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET"

for i in "${FAILED_WORKERS[@]}"; do
idx=$((i - 1))
new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET))
result=$(start_worker_fast $i $new_port)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]=$new_port
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
echo "[$_H] Port conflict on worker $i, trying port $port"
done

retry_round=$((retry_round + 1))
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Port conflict resolution loop has no exhaustion guard.

If port_is_free never succeeds (all candidate ports occupied), the loop silently falls through and start_worker_fast is called on a busy port. After MAX_STARTUP_RETRIES offset increments, consider failing explicitly or logging a clear warning that the worker will likely fail to bind.

Also, the retry variable on line 425 is unused (ShellCheck SC2034) — use _ as the loop variable.

Proposed fix
-    for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+    for _ in $(seq 0 $MAX_STARTUP_RETRIES); do
         if port_is_free $port; then
             break
         fi
         CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
         port=$((port + PORT_INCREMENT))
         echo "[$_H] Port conflict on worker $i, trying port $port"
     done
+    if ! port_is_free $port; then
+        echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)"
+    fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Phase 1: Pre-check ports and spawn all workers simultaneously.
# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),
# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.
echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."
START_SPAWN=$(date +%s)
CONFLICT_COUNT=0
for i in $(seq 1 $NUM_WORKERS); do
port=$((SANDBOX_WORKER_BASE_PORT + i - 1))
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
done
echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s"
# Phase 2: Retry workers that failed due to port conflicts
retry_round=0
while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do
sleep 1
FAILED_WORKERS=()
for i in $(seq 1 $NUM_WORKERS); do
idx=$((i - 1))
worker_is_alive "${WORKER_PIDS[$idx]}" && continue
worker_had_port_conflict $i && FAILED_WORKERS+=($i)
done
[ ${#FAILED_WORKERS[@]} -eq 0 ] && break
PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT ))
echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET"
for i in "${FAILED_WORKERS[@]}"; do
idx=$((i - 1))
new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET))
result=$(start_worker_fast $i $new_port)
WORKER_PIDS[$idx]="${result%%:*}"
ACTUAL_WORKER_PORTS[$idx]=$new_port
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
echo "[$_H] Port conflict on worker $i, trying port $port"
done
retry_round=$((retry_round + 1))
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
done
# Phase 1: Pre-check ports and spawn all workers simultaneously.
# Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py),
# so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this.
echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..."
START_SPAWN=$(date +%s)
CONFLICT_COUNT=0
for i in $(seq 1 $NUM_WORKERS); do
port=$((SANDBOX_WORKER_BASE_PORT + i - 1))
# Try up to MAX_STARTUP_RETRIES offsets if the port is occupied
for _ in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
echo "[$_H] Port conflict on worker $i, trying port $port"
done
if ! port_is_free $port; then
echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)"
fi
result=$(start_worker_fast $i $port)
WORKER_PIDS[$((i - 1))]="${result%%:*}"
ACTUAL_WORKER_PORTS[$((i - 1))]=$port
done
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 425-425: retry appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 414 - 437, The
port-conflict loop may exhaust all retries and still call start_worker_fast on
an occupied port; update the loop that uses port_is_free and MAX_STARTUP_RETRIES
to use a throwaway loop variable (_) instead of retry, and after the loop add an
explicit exhaustion check: if no free port was found within MAX_STARTUP_RETRIES
increments, log a clear error/warning including the worker index and attempted
port(s), increment CONFLICT_COUNT as needed, and either skip starting that
worker (set WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an
empty/invalid marker) or exit with a nonzero status so failure is visible; keep
references to start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS,
PORT_INCREMENT and CONFLICT_COUNT when implementing the handling.

Comment on lines 41 to +42
port = get_free_port(strategy="random")
worker_base_port = get_free_port(strategy="random")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential port collision between port and worker_base_port.

Both ports are chosen independently with strategy="random" and neither uses the exclude parameter to avoid the other. In unlucky runs the two could collide, causing a flaky test failure.

Proposed fix
     port = get_free_port(strategy="random")
-    worker_base_port = get_free_port(strategy="random")
+    worker_base_port = get_free_port(exclude=[port], strategy="random")
🤖 Prompt for AI Agents
In `@tests/test_sandbox_network_blocking.py` around lines 41 - 42, The test picks
two free ports independently causing a possible collision: when creating port
and worker_base_port using get_free_port(strategy="random"), change the second
allocation to avoid the first (e.g., call get_free_port(strategy="random",
exclude=[port]) or otherwise pass an exclude argument with port) or derive
worker_base_port deterministically from port (e.g., port+1 after verifying
availability) so that get_free_port and the variables port and worker_base_port
cannot return the same value.

Replace tiangolo/uwsgi-nginx-flask with python:3.10-slim and gunicorn.
Add nginx as system package, remove all uwsgi configuration (UWSGI_PROCESSES,
UWSGI_CHEAPER, UWSGI_CPU_AFFINITY, uwsgi_read_timeout).
Add deprecation warnings for old uwsgi env vars in start-with-nginx.sh.
Fix SIGINT handler in shell_worker for gunicorn compatibility.
Update icpc/ioi dataset env vars to use GUNICORN_WORKERS.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the georgea/sandbox-gunicorn branch from 21daa10 to 321a0ba Compare February 11, 2026 23:04
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as long as all tests work well. Few AI comments from offline review that seem meaningful to check

  1. Gunicorn with 1024 workers (Medium concern)

In icpc/init.py and ioi/init.py:
"GUNICORN_WORKERS=1024"
uWSGI's cheaper mode allowed scaling from 1 active process up to 1024 on demand. Gunicorn doesn't have an equivalent — it will fork
all 1024 workers at startup. This could cause significant memory overhead and slow startup. Consider whether this is the intended
behavior, or if a smaller worker count should be used. If the 1024 was only needed because uWSGI's cheaper mode made it cheap to
declare a high max, this number likely needs to be reduced.

  1. worker_had_port_conflict is now dead code (Minor)

The function at line 164 of start-with-nginx.sh:

worker_had_port_conflict() {
grep -q "Address already in use" /var/log/worker${1}.log 2>/dev/null
}

This was used by the old Phase 2 retry loop which has been removed. The function is no longer called anywhere and should be deleted.

In shell_worker, resetting SIGINT to signal.SIG_DFL makes timeout interrupts terminate
the shell process instead of raising KeyboardInterrupt. ShellManager.run_cell() sends SIGINT after timeout expecting the shell to handle interruption and return an
interrupted response; with SIG_DFL, this path turns into EOFError + forced shell restart, dropping session state on timeouts. This regresses persistent-session behavior and
should restore Python's default interrupt handler (signal.default_int_handler) rather than the OS default action.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/dataset/icpc/__init__.py (1)

24-28: Same lazy-scaling concern as ioi/__init__.py applies here.

The uWSGI cheaper-mode removal and the eager-spawn behaviour of GUNICORN_WORKERS=1024 are identical to the change in nemo_skills/dataset/ioi/__init__.py. See that file's review comment for the full analysis and the verification script. The changes themselves are structurally consistent across both dataset modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/icpc/__init__.py` around lines 24 - 28, The
SANDBOX_ENV_VARS constant in nemo_skills/dataset/icpc/__init__.py is setting
GUNICORN_WORKERS=1024 which forces eager worker spawn and defeats lazy-scaling;
change that entry to a sensible default (or remove it) so the system can rely on
runtime worker-scaling (match the fix applied in ioi/__init__.py), and ensure
NUM_WORKERS and STATEFUL_SANDBOX entries remain correct—update the
SANDBOX_ENV_VARS list to remove or lower "GUNICORN_WORKERS=1024" and verify
behavior of the module-level SANDBOX_ENV_VARS constant after the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/icpc/__init__.py`:
- Around line 24-28: The SANDBOX_ENV_VARS constant in
nemo_skills/dataset/icpc/__init__.py is setting GUNICORN_WORKERS=1024 which
forces eager worker spawn and defeats lazy-scaling; change that entry to a
sensible default (or remove it) so the system can rely on runtime worker-scaling
(match the fix applied in ioi/__init__.py), and ensure NUM_WORKERS and
STATEFUL_SANDBOX entries remain correct—update the SANDBOX_ENV_VARS list to
remove or lower "GUNICORN_WORKERS=1024" and verify behavior of the module-level
SANDBOX_ENV_VARS constant after the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 321a0ba and 32e08a2.

📒 Files selected for processing (2)
  • nemo_skills/dataset/icpc/__init__.py
  • nemo_skills/dataset/ioi/__init__.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants