feat: migrate sandbox from uwsgi to gunicorn#1234
Conversation
| --backlog 100 \ | ||
| --access-logfile /var/log/worker${i}.log \ | ||
| --error-logfile /var/log/worker${i}.log \ | ||
| --log-level info \ |
There was a problem hiding this comment.
trailing backslash unnecessary (line continues to closing paren)
| --log-level info \ | |
| --log-level info |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
📝 WalkthroughWalkthroughMigrates sandbox runtime from uWSGI to Gunicorn: switches Docker base image, replaces uWSGI env vars with GUNICORN equivalents, rewrites startup orchestration to spawn Gunicorn workers with port coordination and health checks, updates tests and dataset sandbox envs, and minor network-blocking comment/handler tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant Docker as Container
participant Start as start-with-nginx.sh
participant PortChecker as port_is_free
participant Gunicorn as Gunicorn Workers
participant Nginx as nginx
participant Health as Health Checks
Docker->>Start: container start
Start->>PortChecker: check SANDBOX_WORKER_BASE_PORT
PortChecker-->>Start: port free / retry offsets
Start->>Gunicorn: spawn N workers (per-worker ports, logs)
Start->>Nginx: generate upstreams & config
Nginx-->>Start: nginx started / ready
Start->>Health: perform local/remote health checks
Health-->>Start: health OK -> continue monitoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Line 138: In start-with-nginx.sh there is a bare redirection line
">/var/log/worker${i}.log" which triggers ShellCheck SC2188; replace the bare
redirection with an explicit truncate command using the shell no-op builtin
(i.e., use ":" to truncate the file) so the intent is clear and ShellCheck is
satisfied, updating the line that performs the redirection for
/var/log/worker${i}.log.
In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py`:
- Around line 104-108: The current call signal.signal(signal.SIGINT,
signal.SIG_DFL) will let the OS kill the process instead of raising
KeyboardInterrupt, so change the SIGINT handler to use Python's
default_int_handler instead; locate the signal setup in local_sandbox_server.py
(the block mentioning TerminalInteractiveShell) and replace the SIG_DFL
installation with signal.signal(signal.SIGINT, signal.default_int_handler) so
the except KeyboardInterrupt in the shell worker (referenced by the
run_cell/interrupt handling in the shell worker/ShellManager code) becomes
reachable and returns the {"status":"interrupted"} response.
🧹 Nitpick comments (6)
tests/test_sandbox_network_blocking.py (1)
41-55: LGTM — dynamic port assignment for worker base port prevents conflicts with other services.Minor note:
get_free_port(strategy="random")doesn't verify the port is actually free on the host; it just picks a random number. There's a tiny chanceportandworker_base_portcould collide. Thestart-with-nginx.shport-conflict handling should recover, but you could passexclude=[port]to the second call for safety:- worker_base_port = get_free_port(strategy="random") + worker_base_port = get_free_port(strategy="random", exclude=[port])dockerfiles/Dockerfile.sandbox (2)
34-36: Add--no-install-recommendsto reduce image size.Since the base is
python:3.10-slim, adding--no-install-recommendsavoids pulling in unnecessary transitive packages and keeps the image lean.Proposed fix
RUN apt-get update && \ - apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \ + apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \ rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
83-97: LGTM — lock-file-based installs viauvare a solid improvement for reproducibility.One minor inconsistency: line 102 still uses bare
pip install gdownwhile everything else usesuv. Consideruv pip install gdownfor consistency, though this is a build-time-only tool and doesn't affect the lockfile.dockerfiles/sandbox/start-with-nginx.sh (3)
139-153: Verify that the tracked PID corresponds to the Gunicorn master process.The
( cd /app && env ... gunicorn ... ) >> logfile 2>&1 &pattern captures the subshell PID via$!. Because the outer redirection is applied to the subshell itself, bash may not be able to exec-optimize gunicorn into the subshell process. If the subshell PID differs from gunicorn's master PID,kill -TERMincleanup()would kill the subshell but might orphan gunicorn workers.In practice, when the subshell's only remaining work is the gunicorn command, the subshell should exit when gunicorn exits, so liveness checks (
kill -0) work. ButSIGTERMdelivery could be an issue — the subshell receives it and exits without forwarding to gunicorn.Consider using
execto replace the subshell:Proposed fix
- ( cd /app && env WORKER_NUM=$i \ - gunicorn main:app \ + ( cd /app && exec env WORKER_NUM=$i \ + gunicorn main:app \
421-437: No guard after exhausting retries — worker will attempt to bind a possibly occupied port.If
port_is_freenever succeeds withinMAX_STARTUP_RETRIESiterations, the loop exits andstart_worker_fastis called with the last (likely occupied) port. Gunicorn will fail to bind and die — caught by the monitoring loop, but a log message would help diagnose the situation.Also, ShellCheck SC2034:
retryis unused. You can use_as the loop variable.Proposed fix
- for retry in $(seq 0 $MAX_STARTUP_RETRIES); do + for _ in $(seq 0 $MAX_STARTUP_RETRIES); do if port_is_free $port; then break fi CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) port=$((port + PORT_INCREMENT)) echo "[$_H] Port conflict on worker $i, trying port $port" done + + if ! port_is_free $port; then + echo "[$_H] WARNING: Could not find free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried: $port)" + fi
356-368: STATEFUL_SANDBOX=1 silently overrides user-provided GUNICORN_WORKERS.If a user sets both
STATEFUL_SANDBOX=1andGUNICORN_WORKERS=4, the value is silently forced to 1. Consider logging a warning when overriding a user-provided value:if [ "$STATEFUL_SANDBOX" -eq 1 ]; then + [ -n "$GUNICORN_WORKERS" ] && [ "$GUNICORN_WORKERS" -ne 1 ] && \ + echo "WARNING: STATEFUL_SANDBOX=1 requires GUNICORN_WORKERS=1, overriding GUNICORN_WORKERS=$GUNICORN_WORKERS" GUNICORN_WORKERS=1
| # TerminalInteractiveShell installs a SIGINT handler that calls sys.exit(0) | ||
| # instead of raising KeyboardInterrupt when _executing is False (which is | ||
| # the case when run_cell is called programmatically). Restore the default | ||
| # handler so SIGINT raises KeyboardInterrupt and our except clause catches it. | ||
| signal.signal(signal.SIGINT, signal.SIG_DFL) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Python signal.SIG_DFL SIGINT behavior process termination
💡 Result:
In Python, there are two different “defaults” in play for SIGINT (Ctrl+C):
-
Python’s default SIGINT handler: CPython installs a handler that translates SIGINT into a
KeyboardInterruptexception (as long as the parent process hasn’t changed SIGINT handling). So the process usually does not immediately die at the OS level; instead Python raisesKeyboardInterruptin the main thread. [1] -
signal.SIG_DFL(restore OS default):signal.SIG_DFLmeans “use the signal’s default action”. If you do:signal.signal(signal.SIGINT, signal.SIG_DFL)
then SIGINT is no longer translated into
KeyboardInterrupt; it reverts to the platform/OS default behavior. Python documentsSIG_DFLas “perform the default function for the signal”. [1]
On POSIX/Unix, the default action for SIGINT is process termination (i.e., the kernel terminates the process as if by _exit(), with wait status indicating termination by signal). [2]
Net effect:
- Default Python: Ctrl+C ⇒
KeyboardInterrupt(catchable in Python). [1] - After
signal(SIGINT, SIG_DFL): Ctrl+C / SIGINT ⇒ process is terminated by the OS signal default, typically without a Python exception/traceback. [1][2]
Sources:
[1] Python signal module docs (SIGINT → KeyboardInterrupt, SIG_DFL restores default) (docs.python.org)
[2] POSIX/Open Group description of SIG_DFL default actions and termination semantics (pubs.opengroup.org)
Citations:
- 1: https://docs.python.org/3/library/signal.html
- 2: https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html?utm_source=openai
🌐 Web query:
Python signal.default_int_handler KeyboardInterrupt
💡 Result:
signal.default_int_handler is Python’s built-in default handler for SIGINT (typically generated by pressing Ctrl+C). Its behavior is simply to raise KeyboardInterrupt. [1]
Typical uses:
-
Restore Python’s Ctrl+C behavior after installing a custom handler
import signal signal.signal(signal.SIGINT, signal.default_int_handler)
This makes
SIGINTraiseKeyboardInterruptagain (instead of whatever your custom handler did). [1] -
Note:
signal.signal(SIGINT, signal.SIG_DFL)does not restore Python’sKeyboardInterruptbehavior; it restores the OS default action (usually terminating the process). To restore Python’s behavior, usesignal.default_int_handler(or save/restore the previous handler). [4] -
Why you might not see it installed: Python only installs its default
SIGINT→KeyboardInterrupttranslation if the parent process hasn’t already setSIGINTto “ignore” (SIG_IGN). In that case, Python inherits the ignore disposition and won’t translateSIGINTintoKeyboardInterrupt. [1], [8]
Sources: Python signal docs [1]; Stack Overflow on restoring default SIGINT handling [4]; Stack Overflow on inherited SIG_IGN preventing KeyboardInterrupt [8].
🏁 Script executed:
find . -name "local_sandbox_server.py" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 283
🏁 Script executed:
fd "local_sandbox_server.py" -t fRepository: NVIDIA-NeMo/Skills
Length of output: 283
Bug: signal.SIG_DFL terminates the process instead of raising KeyboardInterrupt.
signal.SIG_DFL restores the OS-level default for SIGINT, which is process termination on POSIX systems — it does not raise KeyboardInterrupt. This means the except KeyboardInterrupt block on line 142 is unreachable; when the ShellManager sends SIGINT, the shell worker will be killed by the OS rather than gracefully returning an {"status": "interrupted"} response.
Use signal.default_int_handler — Python's built-in handler that raises KeyboardInterrupt:
Proposed fix
- signal.signal(signal.SIGINT, signal.SIG_DFL)
+ signal.signal(signal.SIGINT, signal.default_int_handler)🤖 Prompt for AI Agents
In `@nemo_skills/code_execution/local_sandbox/local_sandbox_server.py` around
lines 104 - 108, The current call signal.signal(signal.SIGINT, signal.SIG_DFL)
will let the OS kill the process instead of raising KeyboardInterrupt, so change
the SIGINT handler to use Python's default_int_handler instead; locate the
signal setup in local_sandbox_server.py (the block mentioning
TerminalInteractiveShell) and replace the SIG_DFL installation with
signal.signal(signal.SIGINT, signal.default_int_handler) so the except
KeyboardInterrupt in the shell worker (referenced by the run_cell/interrupt
handling in the shell worker/ShellManager code) becomes reachable and returns
the {"status":"interrupted"} response.
b5a8316 to
21daa10
Compare
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | ||
| for retry in $(seq 0 $MAX_STARTUP_RETRIES); do | ||
| if port_is_free $port; then | ||
| break | ||
| fi | ||
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | ||
| port=$((port + PORT_INCREMENT)) |
There was a problem hiding this comment.
potential race condition between port_is_free check and start_worker_fast binding. if another process binds to the port in this gap, gunicorn will still retry 5 times (~5s delay). the old approach with post-spawn retry handled this by checking logs after the process started. consider checking if worker actually started successfully on the expected port.
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | |
| for retry in $(seq 0 $MAX_STARTUP_RETRIES); do | |
| if port_is_free $port; then | |
| break | |
| fi | |
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | |
| port=$((port + PORT_INCREMENT)) | |
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | |
| for retry in $(seq 0 $MAX_STARTUP_RETRIES); do | |
| if port_is_free $port; then | |
| result=$(start_worker_fast $i $port) | |
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | |
| break | |
| fi | |
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | |
| port=$((port + PORT_INCREMENT)) | |
| echo "[$_H] Port conflict on worker $i, trying port $port" | |
| done |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@dockerfiles/sandbox/start-with-nginx.sh`:
- Around line 414-437: The port-conflict loop may exhaust all retries and still
call start_worker_fast on an occupied port; update the loop that uses
port_is_free and MAX_STARTUP_RETRIES to use a throwaway loop variable (_)
instead of retry, and after the loop add an explicit exhaustion check: if no
free port was found within MAX_STARTUP_RETRIES increments, log a clear
error/warning including the worker index and attempted port(s), increment
CONFLICT_COUNT as needed, and either skip starting that worker (set
WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an empty/invalid marker)
or exit with a nonzero status so failure is visible; keep references to
start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS, PORT_INCREMENT and
CONFLICT_COUNT when implementing the handling.
In `@tests/test_sandbox_network_blocking.py`:
- Around line 41-42: The test picks two free ports independently causing a
possible collision: when creating port and worker_base_port using
get_free_port(strategy="random"), change the second allocation to avoid the
first (e.g., call get_free_port(strategy="random", exclude=[port]) or otherwise
pass an exclude argument with port) or derive worker_base_port deterministically
from port (e.g., port+1 after verifying availability) so that get_free_port and
the variables port and worker_base_port cannot return the same value.
🧹 Nitpick comments (4)
dockerfiles/Dockerfile.sandbox (2)
35-37: Add--no-install-recommendsto reduce image size.On a slim base image, recommended packages can add significant bloat. This was also flagged by Trivy (DS-0029).
Proposed fix
RUN apt-get update && \ - apt-get install -y nginx curl git net-tools bzip2 build-essential libseccomp-dev && \ + apt-get install -y --no-install-recommends nginx curl git net-tools bzip2 build-essential libseccomp-dev && \ rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
84-85: Consider pinning theuvversion for reproducible builds.The lockfile approach gives you reproducible Python deps, but the installer itself (
uv) is fetched at whatever version is latest at build time. A breaking change inuvcould cause build failures.Proposed fix (example version)
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh +RUN curl -LsSf https://astral.sh/uv/0.6.6/install.sh | shdockerfiles/sandbox/start-with-nginx.sh (2)
65-77: Good deprecation path for uWSGI env vars.The fallback from
UWSGI_PROCESSES→GUNICORN_WORKERSvia: "${GUNICORN_WORKERS:=$UWSGI_PROCESSES}"is clean. Note that whenSTATEFUL_SANDBOX=1(the default), line 361 unconditionally setsGUNICORN_WORKERS=1, so the migrated value is silently overridden. This is correct behavior (stateful mode needs exactly 1 worker), but consider mentioning it in the deprecation warning so users aren't confused.
142-152: Gunicorn logs may be duplicated to the worker log file.
--access-logfileand--error-logfileboth point to/var/log/worker${i}.log, while the subshell's stdout/stderr are also redirected (>>) to the same file. Gunicorn's console output (which includes startup messages and worker lifecycle events) will be captured by both mechanisms, resulting in duplicate lines.Consider either dropping the
--access-logfile/--error-logfileflags and relying solely on the shell redirect, or redirecting stdout/stderr to/dev/nulland relying solely on gunicorn's log flags.
| # Phase 1: Pre-check ports and spawn all workers simultaneously. | ||
| # Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py), | ||
| # so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this. | ||
| echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..." | ||
| START_SPAWN=$(date +%s) | ||
| CONFLICT_COUNT=0 | ||
|
|
||
| for i in $(seq 1 $NUM_WORKERS); do | ||
| port=$((SANDBOX_WORKER_BASE_PORT + i - 1)) | ||
| result=$(start_worker_fast $i $port) | ||
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | ||
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | ||
| done | ||
|
|
||
| echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s" | ||
|
|
||
| # Phase 2: Retry workers that failed due to port conflicts | ||
| retry_round=0 | ||
| while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do | ||
| sleep 1 | ||
|
|
||
| FAILED_WORKERS=() | ||
| for i in $(seq 1 $NUM_WORKERS); do | ||
| idx=$((i - 1)) | ||
| worker_is_alive "${WORKER_PIDS[$idx]}" && continue | ||
| worker_had_port_conflict $i && FAILED_WORKERS+=($i) | ||
| done | ||
|
|
||
| [ ${#FAILED_WORKERS[@]} -eq 0 ] && break | ||
|
|
||
| PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT )) | ||
| echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET" | ||
|
|
||
| for i in "${FAILED_WORKERS[@]}"; do | ||
| idx=$((i - 1)) | ||
| new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET)) | ||
| result=$(start_worker_fast $i $new_port) | ||
| WORKER_PIDS[$idx]="${result%%:*}" | ||
| ACTUAL_WORKER_PORTS[$idx]=$new_port | ||
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | ||
| for retry in $(seq 0 $MAX_STARTUP_RETRIES); do | ||
| if port_is_free $port; then | ||
| break | ||
| fi | ||
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | ||
| port=$((port + PORT_INCREMENT)) | ||
| echo "[$_H] Port conflict on worker $i, trying port $port" | ||
| done | ||
|
|
||
| retry_round=$((retry_round + 1)) | ||
| result=$(start_worker_fast $i $port) | ||
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | ||
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | ||
| done |
There was a problem hiding this comment.
Port conflict resolution loop has no exhaustion guard.
If port_is_free never succeeds (all candidate ports occupied), the loop silently falls through and start_worker_fast is called on a busy port. After MAX_STARTUP_RETRIES offset increments, consider failing explicitly or logging a clear warning that the worker will likely fail to bind.
Also, the retry variable on line 425 is unused (ShellCheck SC2034) — use _ as the loop variable.
Proposed fix
- for retry in $(seq 0 $MAX_STARTUP_RETRIES); do
+ for _ in $(seq 0 $MAX_STARTUP_RETRIES); do
if port_is_free $port; then
break
fi
CONFLICT_COUNT=$((CONFLICT_COUNT + 1))
port=$((port + PORT_INCREMENT))
echo "[$_H] Port conflict on worker $i, trying port $port"
done
+ if ! port_is_free $port; then
+ echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)"
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Phase 1: Pre-check ports and spawn all workers simultaneously. | |
| # Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py), | |
| # so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this. | |
| echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..." | |
| START_SPAWN=$(date +%s) | |
| CONFLICT_COUNT=0 | |
| for i in $(seq 1 $NUM_WORKERS); do | |
| port=$((SANDBOX_WORKER_BASE_PORT + i - 1)) | |
| result=$(start_worker_fast $i $port) | |
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | |
| done | |
| echo "[$_H] All $NUM_WORKERS workers spawned in $(($(date +%s) - START_SPAWN))s" | |
| # Phase 2: Retry workers that failed due to port conflicts | |
| retry_round=0 | |
| while [ $retry_round -lt $MAX_STARTUP_RETRIES ]; do | |
| sleep 1 | |
| FAILED_WORKERS=() | |
| for i in $(seq 1 $NUM_WORKERS); do | |
| idx=$((i - 1)) | |
| worker_is_alive "${WORKER_PIDS[$idx]}" && continue | |
| worker_had_port_conflict $i && FAILED_WORKERS+=($i) | |
| done | |
| [ ${#FAILED_WORKERS[@]} -eq 0 ] && break | |
| PORT_OFFSET=$(( (retry_round + 1) * PORT_INCREMENT )) | |
| echo "[$_H] Retry $((retry_round + 1)): ${#FAILED_WORKERS[@]} port conflicts, offset +$PORT_OFFSET" | |
| for i in "${FAILED_WORKERS[@]}"; do | |
| idx=$((i - 1)) | |
| new_port=$((SANDBOX_WORKER_BASE_PORT + i - 1 + PORT_OFFSET)) | |
| result=$(start_worker_fast $i $new_port) | |
| WORKER_PIDS[$idx]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$idx]=$new_port | |
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | |
| for retry in $(seq 0 $MAX_STARTUP_RETRIES); do | |
| if port_is_free $port; then | |
| break | |
| fi | |
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | |
| port=$((port + PORT_INCREMENT)) | |
| echo "[$_H] Port conflict on worker $i, trying port $port" | |
| done | |
| retry_round=$((retry_round + 1)) | |
| result=$(start_worker_fast $i $port) | |
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | |
| done | |
| # Phase 1: Pre-check ports and spawn all workers simultaneously. | |
| # Gunicorn retries binding 5 times with 1s sleep (hardcoded in gunicorn/sock.py), | |
| # so a port conflict wastes ~5s per worker. Pre-checking with ss avoids this. | |
| echo "[$_H] Starting $NUM_WORKERS workers (ports $SANDBOX_WORKER_BASE_PORT-$((SANDBOX_WORKER_BASE_PORT + NUM_WORKERS - 1)))..." | |
| START_SPAWN=$(date +%s) | |
| CONFLICT_COUNT=0 | |
| for i in $(seq 1 $NUM_WORKERS); do | |
| port=$((SANDBOX_WORKER_BASE_PORT + i - 1)) | |
| # Try up to MAX_STARTUP_RETRIES offsets if the port is occupied | |
| for _ in $(seq 0 $MAX_STARTUP_RETRIES); do | |
| if port_is_free $port; then | |
| break | |
| fi | |
| CONFLICT_COUNT=$((CONFLICT_COUNT + 1)) | |
| port=$((port + PORT_INCREMENT)) | |
| echo "[$_H] Port conflict on worker $i, trying port $port" | |
| done | |
| if ! port_is_free $port; then | |
| echo "[$_H] WARNING: Could not find a free port for worker $i after $MAX_STARTUP_RETRIES retries (last tried $port)" | |
| fi | |
| result=$(start_worker_fast $i $port) | |
| WORKER_PIDS[$((i - 1))]="${result%%:*}" | |
| ACTUAL_WORKER_PORTS[$((i - 1))]=$port | |
| done |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 425-425: retry appears unused. Verify use (or export if used externally).
(SC2034)
🤖 Prompt for AI Agents
In `@dockerfiles/sandbox/start-with-nginx.sh` around lines 414 - 437, The
port-conflict loop may exhaust all retries and still call start_worker_fast on
an occupied port; update the loop that uses port_is_free and MAX_STARTUP_RETRIES
to use a throwaway loop variable (_) instead of retry, and after the loop add an
explicit exhaustion check: if no free port was found within MAX_STARTUP_RETRIES
increments, log a clear error/warning including the worker index and attempted
port(s), increment CONFLICT_COUNT as needed, and either skip starting that
worker (set WORKER_PIDS[...] empty and ACTUAL_WORKER_PORTS[...] to an
empty/invalid marker) or exit with a nonzero status so failure is visible; keep
references to start_worker_fast, WORKER_PIDS, ACTUAL_WORKER_PORTS,
PORT_INCREMENT and CONFLICT_COUNT when implementing the handling.
| port = get_free_port(strategy="random") | ||
| worker_base_port = get_free_port(strategy="random") |
There was a problem hiding this comment.
Potential port collision between port and worker_base_port.
Both ports are chosen independently with strategy="random" and neither uses the exclude parameter to avoid the other. In unlucky runs the two could collide, causing a flaky test failure.
Proposed fix
port = get_free_port(strategy="random")
- worker_base_port = get_free_port(strategy="random")
+ worker_base_port = get_free_port(exclude=[port], strategy="random")🤖 Prompt for AI Agents
In `@tests/test_sandbox_network_blocking.py` around lines 41 - 42, The test picks
two free ports independently causing a possible collision: when creating port
and worker_base_port using get_free_port(strategy="random"), change the second
allocation to avoid the first (e.g., call get_free_port(strategy="random",
exclude=[port]) or otherwise pass an exclude argument with port) or derive
worker_base_port deterministically from port (e.g., port+1 after verifying
availability) so that get_free_port and the variables port and worker_base_port
cannot return the same value.
Replace tiangolo/uwsgi-nginx-flask with python:3.10-slim and gunicorn. Add nginx as system package, remove all uwsgi configuration (UWSGI_PROCESSES, UWSGI_CHEAPER, UWSGI_CPU_AFFINITY, uwsgi_read_timeout). Add deprecation warnings for old uwsgi env vars in start-with-nginx.sh. Fix SIGINT handler in shell_worker for gunicorn compatibility. Update icpc/ioi dataset env vars to use GUNICORN_WORKERS. Signed-off-by: George Armstrong <georgea@nvidia.com>
21daa10 to
321a0ba
Compare
Kipok
left a comment
There was a problem hiding this comment.
lgtm as long as all tests work well. Few AI comments from offline review that seem meaningful to check
- Gunicorn with 1024 workers (Medium concern)
In icpc/init.py and ioi/init.py:
"GUNICORN_WORKERS=1024"
uWSGI's cheaper mode allowed scaling from 1 active process up to 1024 on demand. Gunicorn doesn't have an equivalent — it will fork
all 1024 workers at startup. This could cause significant memory overhead and slow startup. Consider whether this is the intended
behavior, or if a smaller worker count should be used. If the 1024 was only needed because uWSGI's cheaper mode made it cheap to
declare a high max, this number likely needs to be reduced.
- worker_had_port_conflict is now dead code (Minor)
The function at line 164 of start-with-nginx.sh:
worker_had_port_conflict() {
grep -q "Address already in use" /var/log/worker${1}.log 2>/dev/null
}
This was used by the old Phase 2 retry loop which has been removed. The function is no longer called anywhere and should be deleted.
In shell_worker, resetting SIGINT to signal.SIG_DFL makes timeout interrupts terminate
the shell process instead of raising KeyboardInterrupt. ShellManager.run_cell() sends SIGINT after timeout expecting the shell to handle interruption and return an
interrupted response; with SIG_DFL, this path turns into EOFError + forced shell restart, dropping session state on timeouts. This regresses persistent-session behavior and
should restore Python's default interrupt handler (signal.default_int_handler) rather than the OS default action.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/dataset/icpc/__init__.py (1)
24-28: Same lazy-scaling concern asioi/__init__.pyapplies here.The uWSGI cheaper-mode removal and the eager-spawn behaviour of
GUNICORN_WORKERS=1024are identical to the change innemo_skills/dataset/ioi/__init__.py. See that file's review comment for the full analysis and the verification script. The changes themselves are structurally consistent across both dataset modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/icpc/__init__.py` around lines 24 - 28, The SANDBOX_ENV_VARS constant in nemo_skills/dataset/icpc/__init__.py is setting GUNICORN_WORKERS=1024 which forces eager worker spawn and defeats lazy-scaling; change that entry to a sensible default (or remove it) so the system can rely on runtime worker-scaling (match the fix applied in ioi/__init__.py), and ensure NUM_WORKERS and STATEFUL_SANDBOX entries remain correct—update the SANDBOX_ENV_VARS list to remove or lower "GUNICORN_WORKERS=1024" and verify behavior of the module-level SANDBOX_ENV_VARS constant after the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/icpc/__init__.py`:
- Around line 24-28: The SANDBOX_ENV_VARS constant in
nemo_skills/dataset/icpc/__init__.py is setting GUNICORN_WORKERS=1024 which
forces eager worker spawn and defeats lazy-scaling; change that entry to a
sensible default (or remove it) so the system can rely on runtime worker-scaling
(match the fix applied in ioi/__init__.py), and ensure NUM_WORKERS and
STATEFUL_SANDBOX entries remain correct—update the SANDBOX_ENV_VARS list to
remove or lower "GUNICORN_WORKERS=1024" and verify behavior of the module-level
SANDBOX_ENV_VARS constant after the change.
Summary
tiangolo/uwsgi-nginx-flaskbase image withpython:3.10-slim+ nginx + gunicornstart-with-nginx.shshell_workerfor gunicorn compatibilityicpc/ioidataset env vars to useGUNICORN_WORKERSsys.executableusageDepends on #1233 (lockfiles). Split from #1226.
Summary by CodeRabbit
Chores
Bug Fixes