Skip to content

fix: sandbox lifecycle dispatch (per-agent / per-task container reuse)#2008

Merged
Aureliolo merged 8 commits into
mainfrom
fix/sandbox-lifecycle
May 18, 2026
Merged

fix: sandbox lifecycle dispatch (per-agent / per-task container reuse)#2008
Aureliolo merged 8 commits into
mainfrom
fix/sandbox-lifecycle

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

Wires the previously-ghost sandbox lifecycle strategy into DockerSandbox. EPIC #1955 child.

  • DockerSandbox.execute() now honours owner_id and dispatches to the configured SandboxLifecycleStrategy instead of cold-starting a one-shot container per call.
  • Unified keep-alive container + docker exec execution model: a long-lived idle container (tail -f /dev/null) is acquired per the strategy and every tool call runs via container.exec(); per-agent/per-task reuse the container across calls, per-call destroys it after the single exec (default, behaviour preserved).
  • Owner key resolution: explicit owner_id → structlog correlation context (agent_id for per-agent, task_id for per-task) → safe ephemeral per-call degrade, with owner-id format validation.
  • release_owner() wired at the task boundary in AgentEngineExecutionService (per-task: immediate destroy; per-agent: grace timer; per-call: no-op). DockerSandbox.cleanup() calls strategy.cleanup_all().
  • Strategy constructed at boot in workers/runtime_builder with the app clock and injected through the sandbox factory; create_lifecycle_strategy flipped PENDING→ENFORCED in scripts/_ghost_wiring_manifest.txt (same PR).
  • reuses_container discriminator added to the strategy protocol + 3 implementations.
  • docker_sandbox.py split (exec/dispatch machinery → docker_sandbox_exec.py); both files <800 lines, oversized functions decomposed.
  • Pre-PR review fixes folded in: dedicated event constants for track/untrack/stream-close/exec-inspect/logs-collect failures; error context on factory backend-build + cleanup_sandbox_backends; per_task release wrapped with destroy-failure logging; test parametrize ids + module marker.

Test plan

  • New TestLifecycleDispatch (8 tests) + TestSandboxOwnerRelease (5 tests): per-agent/per-task container reuse verified, grace teardown via FakeClock, per-call ephemeral, degraded-owner safety, cleanup_all, boot-boundary release. Amplified ×20 (160 runs, RuntimeWarning-as-error) — zero leaked tasks.
  • Existing sandbox suite migrated to the exec model; 529 sandbox + workers unit tests pass.
  • Full pre-push green: ruff, ruff-format, mypy strict, ghost-wiring gate, long-running-loop kill-switch gate, affected unit tests (-n8 --count 2), every convention gate.

Review coverage

Pre-PR pipeline: 12 review agents (code/python/conventions/async-concurrency/security/logging/silent-failure/test-quality/docs-consistency/comment-quality-rot/issue-resolution-verifier/ghost-wiring). issue-resolution-verifier: all 8 acceptance criteria RESOLVED. 9 valid findings implemented; conflicting/out-of-diff/design-approved items triaged with the user.

Closes #1965

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a long-awaited sandbox lifecycle strategy into the DockerSandbox backend. By transitioning from a cold-start per-call container model to a keep-alive container model using docker exec, the system now supports efficient container reuse per agent or per task. The changes include robust owner-key resolution, lifecycle strategy injection at boot, and automated teardown at task boundaries, ensuring resources are reclaimed properly while maintaining backward compatibility with the default ephemeral behavior.

Highlights

  • Sandbox Lifecycle Strategy Integration: Implemented a new lifecycle strategy for Docker sandboxes, allowing for container reuse across tool calls (per-agent or per-task) instead of creating a new container for every execution.
  • Execution Model Refactoring: Split docker_sandbox.py into docker_sandbox.py and docker_sandbox_exec.py to manage the keep-alive container and docker exec logic, improving maintainability.
  • Owner-Based Lifecycle Management: Introduced owner_id resolution based on structlog correlation context, enabling automatic container reuse for specific agents or tasks, with a safe fallback to ephemeral per-call behavior.
  • Lifecycle Release Mechanism: Wired release_owner() into AgentEngineExecutionService to ensure containers are correctly destroyed or grace-timed at the task boundary.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b2efee5-2805-4f5a-b43c-1869617ae90e

📥 Commits

Reviewing files that changed from the base of the PR and between 05abbe3 and d8a873c.

📒 Files selected for processing (2)
  • docs/design/tools.md
  • src/synthorg/tools/sandbox/subprocess_sandbox.py

Walkthrough

This PR implements sandbox container reuse across per-agent and per-task boundaries. It adds DockerSandboxExecMixin for keep-alive container management with owner-key resolution, updates lifecycle strategies to accept destroy callbacks with improved error handling, integrates lifecycle dispatch into DockerSandbox.execute(), extends SandboxBackend and SandboxLifecycleStrategy protocols, wires lifecycle strategies at boot in runtime_builder, adds sandbox owner release at task boundaries in AgentEngineExecutionService, and provides comprehensive test coverage for lifecycle dispatch, protocol compliance, and edge cases.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive Several changes appear tangentially related but not directly required by issue #1965: git-hook logging, test script --porcelain fix, and test file removal warrant clarification on scope. Confirm whether git-hooks/_run-hook.sh, run_affected_tests.py, and test_git_hooks_wrapper.py removal are intentional dependencies or separate cleanup that should be separated into a follow-up PR.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly describes the main change: implementing sandbox lifecycle dispatch with per-agent/per-task container reuse, which is the core objective of this PR.
Description check ✅ Passed Description comprehensively explains the changes: wiring lifecycle strategy into DockerSandbox, keep-alive container execution model, owner resolution, release_owner integration, strategy construction, test coverage, and all pre-PR fixes.
Linked Issues check ✅ Passed PR fully addresses issue #1965 acceptance criteria: DockerSandbox.execute() now honours owner_id, per-agent/per-task container reuse works correctly, per-call default preserved, validated with new test suite and simulation harness.
Docstring Coverage ✅ Passed Docstring coverage is 46.88% which is sufficient. The required threshold is 40.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will not alter performance

✅ 33 untouched benchmarks
⏩ 21 skipped benchmarks1


Comparing fix/sandbox-lifecycle (d8a873c) with main (4140fc5)2

Open in CodSpeed

Footnotes

  1. 21 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (9e4f5c1) during the generation of this report, so 4140fc5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a persistent container execution model for the Docker sandbox, allowing container reuse across tool calls via docker exec. It introduces a lifecycle strategy framework (supporting per-agent, per-task, and per-call behaviors) and refactors the DockerSandbox to use a new DockerSandboxExecMixin. The AgentEngineExecutionService is updated to manage the release of sandbox owners at task boundaries. Review feedback identifies a violation of the repository's exception-handling policy regarding catastrophic errors in the per-task strategy and points out code duplication in the environment merging logic between the sandbox backend and the execution mixin.

Comment thread src/synthorg/tools/sandbox/lifecycle/per_task.py
Comment thread src/synthorg/tools/sandbox/docker_sandbox.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/workers/test_execution_service.py (1)

304-388: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a failure-path release test for execute_once.

These tests cover release behavior on successful runs, but the contract is “release at task boundary regardless of outcome.” Please add a case where engine.run raises and assert release_owner(...) is still awaited.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/workers/test_execution_service.py` around lines 304 - 388, Add a
new async test that simulates a failing engine run to verify the sandbox owner
is still released: create an AgentEngine with
run=AsyncMock(side_effect=RuntimeError("boom")), a SandboxBackend with
release_owner=AsyncMock, instantiate AgentEngineExecutionService (use
lifecycle_strategy_kind=STRATEGY_PER_TASK or STRATEGY_PER_AGENT as appropriate),
call execute_once and catch the RuntimeError if it propagates, then assert
release_owner was awaited once with the expected id; reference
AgentEngineExecutionService.execute_once, AgentEngine.run, and
SandboxBackend.release_owner to locate where to inject this test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/tools.md`:
- Around line 179-192: The docs contain a contradiction: the narrative in
workers/runtime_builder and DockerSandbox describes per-call as the
fallback/default-preserving behavior while the strategy table still lists
per-agent as the default; update the documentation so both places agree on a
single authoritative default (choose either per-call or per-agent), and make
corresponding edits to the strategy table and the explanatory paragraph that
mentions AgentEngineExecutionService, DockerSandbox.cleanup() and cleanup_all()
to reflect that choice (e.g., if choosing per-call, state that owner fallback
degrades to ephemeral per-call and mark per-call as the default in the table; if
choosing per-agent, remove the “degrades to ephemeral per-call” phrasing and
mark per-agent as default). Ensure mentions of DockerSandbox, cleanup_all(), and
agent/task ownership behavior remain consistent with the selected default.

In `@src/synthorg/tools/sandbox/docker_sandbox_exec.py`:
- Around line 783-795: The call to ship_container_logs in the end of the
function can raise and violate the docstring promise that this method "Never
raises ordinary errors", which would bypass execute()'s finally behavior and
prevent _teardown_unowned from running (leaking containers); wrap the
ship_container_logs call (and any post-processing that uses result/sidecar_logs)
in a try/except that catches Exception, logs the error via the same logger used
elsewhere, and ensures the function returns sidecar_logs (or a safe default)
instead of propagating—this guarantees execute()'s finally will run and
_teardown_unowned will not be skipped.

In `@src/synthorg/tools/sandbox/lifecycle/per_task.py`:
- Around line 98-106: The code currently calls destroy_fn(handle) after the
handle has been removed from tracking, and on exception it swallows the error
which can orphan a live container; update the logic so that either (A) you only
remove the handle from the tracking registry after destroy_fn succeeds, or (B)
if the handle was already removed, catch Exception as e, log
SANDBOX_LIFECYCLE_DESTROY_FAILED with the exception details, and reinsert the
handle back into the tracking registry so it remains eligible for retries;
reference the destroy_fn and handle variables and ensure the logger.warning call
includes the exception (e) and preserves owner_id/container_id context.

In `@src/synthorg/tools/sandbox/protocol.py`:
- Around line 66-80: The release_owner method currently types its owner_id
parameter as str allowing blank IDs; change the signature of release_owner to
accept NotBlankStr (matching execute(... owner_id: NotBlankStr | None)) so the
owner-key contract is strict, update any related type hints/imports in
src/synthorg/tools/sandbox/protocol.py to reference NotBlankStr, and adjust
callers/tests if needed to satisfy the new type (or use explicit casts/wrappers)
to preserve behavior while enforcing non-blank identifiers.

In `@src/synthorg/tools/sandbox/subprocess_sandbox.py`:
- Around line 614-616: The release_owner method currently takes an unused
owner_id; make its unused intent explicit to satisfy linters by renaming the
parameter to _owner_id or adding the same suppression comment used in execute()
(e.g., "# noqa: ARG002") next to the parameter, so update async def
release_owner(self, owner_id: str) -> None: to match the execute() pattern and
avoid ARG002 warnings while preserving the method signature semantics.

In `@tests/unit/tools/sandbox/test_docker_sandbox.py`:
- Around line 1205-1223: Add a companion test that mirrors
test_contextvar_fallback_when_no_owner_id but exercises the per-task
owner-resolution: create a new async test (e.g.,
test_contextvar_fallback_when_no_owner_id_per_task) that uses
_make_mock_docker(), FakeClock(), and _per_task_sandbox(...) instead of
_per_agent_sandbox, bind structlog.contextvars with task_id="ctx-task" (omit
owner_id), wrap calls to sandbox.execute(...) inside the
_patch_aiodocker(mock_docker) context, call execute twice (different args),
assert mock_docker.containers.create.await_count == 1, and finally
reset_contextvars using the tokens—this mirrors the per-agent test but verifies
task_id-derived owner fallback.

---

Outside diff comments:
In `@tests/unit/workers/test_execution_service.py`:
- Around line 304-388: Add a new async test that simulates a failing engine run
to verify the sandbox owner is still released: create an AgentEngine with
run=AsyncMock(side_effect=RuntimeError("boom")), a SandboxBackend with
release_owner=AsyncMock, instantiate AgentEngineExecutionService (use
lifecycle_strategy_kind=STRATEGY_PER_TASK or STRATEGY_PER_AGENT as appropriate),
call execute_once and catch the RuntimeError if it propagates, then assert
release_owner was awaited once with the expected id; reference
AgentEngineExecutionService.execute_once, AgentEngine.run, and
SandboxBackend.release_owner to locate where to inject this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad4d43c4-ee3d-432f-924c-eba6715ae419

📥 Commits

Reviewing files that changed from the base of the PR and between 180b38a and b744714.

📒 Files selected for processing (23)
  • docs/design/tools.md
  • scripts/_ghost_wiring_manifest.txt
  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/workers/test_execution_service.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Build Backend
  • GitHub Check: Lighthouse Site
  • GitHub Check: Build Web Assets (melange)
  • GitHub Check: CodSpeed Python benchmarks
  • GitHub Check: Test Integration
  • GitHub Check: Test Conformance (SQLite)
  • GitHub Check: Test E2E
  • GitHub Check: Test Unit
  • GitHub Check: Build Preview
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (7)
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL
Use DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default with read_only_post_init (Cat-2); Cat-3 bootstrap secrets are pure env at boot site. YAML is company-template ingestion format only, not a precedence tier. No os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value
Use from synthorg.observability import get_logger; variable always named logger. Never import logging or use print() in app code. Event names from observability.events. constants; structured kwargs (logger.info(EVENT, key=value))
Error paths log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED after persistence write
Never log error=str(exc) or interpolate {exc}; use error_type=type(exc).name + error=safe_error_description(exc). Never use exc_info=True. OTel: forbidden span.record_exception(exc); use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False. Enforced by check_logger_exception_str_exc.py

Files:

  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py
Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing. Enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py
Do not use from __future__ import annotations (Python 3.14 has PEP 649). Use PEP 758 except: except A, B: requires parens when binding
Type hints required on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines
Define errors as <Domain><Condition>Error inheriting from DomainError; never inherit from Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py
Pydantic v2: use frozen + extra="forbid" on every frozen model project-wide (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- for extra="allow"/"ignore" boundaries). Use @computed_field for derived fields. Use NotBlankStr for identifiers
Args models at every system boundary; parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py
Use immutability patterns: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries
Use asyncio.TaskGroup for fan-out/fan-in async patterns; helpers must catch Exception (re-raise MemoryError/RecursionError)
Clock seam: include clock: Clock | None = None parameter; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable
For untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML
Never use real vendor names in project code/tests; use example-provider, test-provider, example-{large,medium,small}-001. Allowed in .claude/, third-party imports, providers/presets.py, we...

Files:

  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Telemetry: opt-in, off by default. Every event property must be in _ALLOWED_PROPERTIES. See telemetry.md

Files:

  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/sandbox.py
{README.md,docs/**/*.md,data/**/*.md}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs must be sourced from data/runtime_stats.yaml via markers

Files:

  • docs/design/tools.md
docs/**/*.{md,d2}

📄 CodeRabbit inference engine (CLAUDE.md)

Use d2 for architecture/nested containers, mermaid for flowcharts/sequence/pipelines. Markdown tables for tabular data. D2 theme 200 (Dark Mauve), D2 CLI pinned to v0.7.1 in CI

Files:

  • docs/design/tools.md
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test markers: @pytest.mark.{unit,integration,e2e,slow}. Async auto. Timeout 30s global. Coverage 80% min
Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race). Subprocess tests override back
Test doubles: use FakeClock for Clock seam, mock_ofT for typed-boundary substitutions, SimpleNamespace for attribute-bags. Never use bare MagicMock at typed boundary (constructor/fn arg/annotated local/typed fixture return); blocked by scripts/check_mock_spec.py (zero-tolerance, no baseline). Import FakeClock and mock_of from tests._shared
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))
Flaky tests: NEVER skip/xfail; fix fundamentally. Use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
src/synthorg/{workers,api}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Runtime services: synthorg.workers.runtime_builder.build_runtime_services selects behind ONE provider-present switch, returning RuntimeServices (AgentEngineExecutionService + coordinator or NoProviderExecutionService + None). _install_runtime_services appends FIRST after persistence/SettingsService hooks; swap_worker_execution_service / swap_coordinator / swap_provider_registry hold a lock. Empty-company rejects task creation (AgentRuntimeNotConfiguredError 4014) and /coordinate 503s

Files:

  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
🧠 Learnings (5)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/observability/events/sandbox.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/tools/sandbox/test_protocol.py
  • src/synthorg/tools/sandbox/factory.py
  • tests/unit/workers/test_execution_service.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing markdown files for the "Doc Numeric Claims (MANDATORY)" RS-marker rule, only require/flag missing RS markers in the files that are actually in-scope for the rule. The scope is enforced via an identical _SCOPED_FILES allowlist in scripts/check_doc_numeric_macros.py and scripts/inject_runtime_stats.py, and currently includes: README.md; docs/index.md; docs/roadmap/index.md; docs/architecture/decisions.md; docs/reference/convention-gates.md. For any other markdown files (e.g., docs/getting_started.md, docs/guides/*), missing RS markers for numeric claims are no-ops and should NOT be flagged.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo against the `check_doc_numeric_macros.py` gate, account for its documented behavior: it skips fenced code blocks entirely, and it only flags digits that are adjacent to specific stat nouns (`tests`, `providers`, `agents`, `stars`, `releases`). As a result, CLI-style numbers (e.g., `--num-workers=4`) inside fenced bash code blocks should never be treated as violations of this gate; only non-fenced text needs checking, and only around those specific nouns.

Applied to files:

  • docs/design/tools.md
🔇 Additional comments (34)
src/synthorg/tools/sandbox/factory.py (4)

52-76: LGTM!


79-101: LGTM!


115-166: LGTM!


299-307: LGTM!

src/synthorg/workers/execution_service.py (4)

54-62: LGTM!


218-245: LGTM!


284-310: LGTM!


373-420: LGTM!

src/synthorg/workers/runtime_builder.py (3)

40-41: LGTM!

Also applies to: 49-49, 56-56


129-166: LGTM!


354-357: LGTM!

Also applies to: 380-387

src/synthorg/tools/sandbox/lifecycle/protocol.py (1)

46-57: LGTM!

src/synthorg/tools/sandbox/lifecycle/config.py (1)

3-3: LGTM!

Also applies to: 7-13

src/synthorg/tools/sandbox/lifecycle/per_agent.py (1)

59-63: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_call.py (1)

28-32: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_task.py (1)

35-39: LGTM!

src/synthorg/observability/events/docker.py (1)

17-18: LGTM!

src/synthorg/observability/events/sandbox.py (1)

46-52: LGTM!

src/synthorg/observability/events/workers.py (1)

83-88: LGTM!

src/synthorg/tools/sandbox/docker_sandbox_exec.py (7)

1-78: LGTM!


80-157: LGTM!


162-245: LGTM!


247-273: LGTM!


278-439: LGTM!


445-700: LGTM!


706-741: LGTM!

src/synthorg/tools/sandbox/docker_sandbox.py (6)

109-124: LGTM!


126-190: LGTM!


226-252: LGTM!


341-421: LGTM!


527-567: LGTM!


569-681: LGTM!

src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py (2)

28-53: LGTM!


171-219: LGTM!

Comment thread docs/design/tools.md Outdated
Comment thread src/synthorg/tools/sandbox/docker_sandbox_exec.py
Comment thread src/synthorg/tools/sandbox/lifecycle/per_task.py
Comment thread src/synthorg/tools/sandbox/protocol.py Outdated
Comment on lines +614 to +616
async def release_owner(self, owner_id: str) -> None:
"""No-op -- subprocesses hold no per-owner resources."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Unused parameter should be marked.

The owner_id parameter is intentionally unused. Add _ prefix or # noqa: ARG002 to suppress linting warnings consistently with how execute() handles its unused owner_id at line 526.

♻️ Proposed fix
-    async def release_owner(self, owner_id: str) -> None:
+    async def release_owner(self, owner_id: str) -> None:  # noqa: ARG002
         """No-op -- subprocesses hold no per-owner resources."""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/synthorg/tools/sandbox/subprocess_sandbox.py` around lines 614 - 616, The
release_owner method currently takes an unused owner_id; make its unused intent
explicit to satisfy linters by renaming the parameter to _owner_id or adding the
same suppression comment used in execute() (e.g., "# noqa: ARG002") next to the
parameter, so update async def release_owner(self, owner_id: str) -> None: to
match the execute() pattern and avoid ARG002 warnings while preserving the
method signature semantics.

Comment thread tests/unit/tools/sandbox/test_docker_sandbox.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 83.74233% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.04%. Comparing base (4140fc5) to head (d8a873c).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/synthorg/tools/sandbox/docker_sandbox_exec.py 86.66% 17 Missing and 9 partials ⚠️
src/synthorg/tools/sandbox/lifecycle/per_task.py 62.06% 11 Missing ⚠️
src/synthorg/tools/sandbox/lifecycle/per_agent.py 57.89% 8 Missing ⚠️
...synthorg/tools/sandbox/docker_sandbox_lifecycle.py 33.33% 4 Missing ⚠️
src/synthorg/tools/sandbox/factory.py 50.00% 2 Missing ⚠️
src/synthorg/tools/sandbox/docker_sandbox.py 95.45% 1 Missing ⚠️
src/synthorg/workers/execution_service.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2008      +/-   ##
==========================================
+ Coverage   85.03%   85.04%   +0.01%     
==========================================
  Files        1887     1888       +1     
  Lines      111644   111930     +286     
  Branches     9522     9549      +27     
==========================================
+ Hits        94933    95191     +258     
- Misses      14388    14415      +27     
- Partials     2323     2324       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/synthorg/tools/sandbox/docker_sandbox_exec.py`:
- Around line 355-357: The code untracks sidecar retry handles too early causing
leaked sidecars if deletion fails; update the flows in _create_started_container
and _destroy_handle (and the similar blocks around lines 370-373 and 720-741) so
that you do not call _untrack_container(f"_sidecar:{sidecar_id}") until the
sidecar deletion has actually succeeded (i.e., await the delete operation and
confirm success), and only then untrack; in other words, keep the tracked
_sidecar:{sidecar_id} entry until the paired sidecar removal completes, and
adjust the order of await self._track_container(...), container_obj.start(),
sidecar delete, and await self._untrack_container(...) accordingly to ensure
retries via cleanup_all() remain possible.
- Around line 653-664: The helper _exec_returncode currently catches Exception
around exec_obj.inspect(), which incorrectly swallows system-level failures;
update the except block to re-raise MemoryError and RecursionError (i.e. if
isinstance(exc, (MemoryError, RecursionError)): raise) before logging and
returning -1, keeping the existing logger.warning call
(DOCKER_EXEC_INSPECT_FAILED, container_id, error_type,
safe_error_description(exc)) for all other exceptions so non-critical inspect
failures still return -1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 901c0f1c-c8a0-4bcd-be2c-3d932eca6d31

📥 Commits

Reviewing files that changed from the base of the PR and between b744714 and dd842af.

📒 Files selected for processing (12)
  • docs/design/tools.md
  • scripts/git-hooks/_run-hook.sh
  • scripts/run_affected_tests.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/protocol.py
  • tests/unit/scripts/test_git_hooks_wrapper.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/workers/test_execution_service.py
💤 Files with no reviewable changes (1)
  • tests/unit/scripts/test_git_hooks_wrapper.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Build Backend
  • GitHub Check: CodSpeed Python benchmarks
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Lighthouse Site
  • GitHub Check: Build Web Assets (melange)
  • GitHub Check: Test Integration
  • GitHub Check: Test Conformance (SQLite)
  • GitHub Check: Test E2E
  • GitHub Check: Test Unit
🧰 Additional context used
📓 Path-based instructions (5)
{README.md,docs/**/*.md,data/**/*.md}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs must be sourced from data/runtime_stats.yaml via markers

Files:

  • docs/design/tools.md
docs/**/*.{md,d2}

📄 CodeRabbit inference engine (CLAUDE.md)

Use d2 for architecture/nested containers, mermaid for flowcharts/sequence/pipelines. Markdown tables for tabular data. D2 theme 200 (Dark Mauve), D2 CLI pinned to v0.7.1 in CI

Files:

  • docs/design/tools.md
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL
Use DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default with read_only_post_init (Cat-2); Cat-3 bootstrap secrets are pure env at boot site. YAML is company-template ingestion format only, not a precedence tier. No os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value
Use from synthorg.observability import get_logger; variable always named logger. Never import logging or use print() in app code. Event names from observability.events. constants; structured kwargs (logger.info(EVENT, key=value))
Error paths log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED after persistence write
Never log error=str(exc) or interpolate {exc}; use error_type=type(exc).name + error=safe_error_description(exc). Never use exc_info=True. OTel: forbidden span.record_exception(exc); use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False. Enforced by check_logger_exception_str_exc.py

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py
Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing. Enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py
Do not use from __future__ import annotations (Python 3.14 has PEP 649). Use PEP 758 except: except A, B: requires parens when binding
Type hints required on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines
Define errors as <Domain><Condition>Error inheriting from DomainError; never inherit from Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py
Pydantic v2: use frozen + extra="forbid" on every frozen model project-wide (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- for extra="allow"/"ignore" boundaries). Use @computed_field for derived fields. Use NotBlankStr for identifiers
Args models at every system boundary; parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py
Use immutability patterns: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries
Use asyncio.TaskGroup for fan-out/fan-in async patterns; helpers must catch Exception (re-raise MemoryError/RecursionError)
Clock seam: include clock: Clock | None = None parameter; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable
For untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML
Never use real vendor names in project code/tests; use example-provider, test-provider, example-{large,medium,small}-001. Allowed in .claude/, third-party imports, providers/presets.py, we...

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test markers: @pytest.mark.{unit,integration,e2e,slow}. Async auto. Timeout 30s global. Coverage 80% min
Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race). Subprocess tests override back
Test doubles: use FakeClock for Clock seam, mock_ofT for typed-boundary substitutions, SimpleNamespace for attribute-bags. Never use bare MagicMock at typed boundary (constructor/fn arg/annotated local/typed fixture return); blocked by scripts/check_mock_spec.py (zero-tolerance, no baseline). Import FakeClock and mock_of from tests._shared
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))
Flaky tests: NEVER skip/xfail; fix fundamentally. Use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/scripts/test_run_affected_tests.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/scripts/test_run_affected_tests.py
🧠 Learnings (5)
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing markdown files for the "Doc Numeric Claims (MANDATORY)" RS-marker rule, only require/flag missing RS markers in the files that are actually in-scope for the rule. The scope is enforced via an identical _SCOPED_FILES allowlist in scripts/check_doc_numeric_macros.py and scripts/inject_runtime_stats.py, and currently includes: README.md; docs/index.md; docs/roadmap/index.md; docs/architecture/decisions.md; docs/reference/convention-gates.md. For any other markdown files (e.g., docs/getting_started.md, docs/guides/*), missing RS markers for numeric claims are no-ops and should NOT be flagged.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo against the `check_doc_numeric_macros.py` gate, account for its documented behavior: it skips fenced code blocks entirely, and it only flags digits that are adjacent to specific stat nouns (`tests`, `providers`, `agents`, `stars`, `releases`). As a result, CLI-style numbers (e.g., `--num-workers=4`) inside fenced bash code blocks should never be treated as violations of this gate; only non-fenced text needs checking, and only around those specific nouns.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/workers/test_execution_service.py
  • scripts/run_affected_tests.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • tests/unit/scripts/test_run_affected_tests.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
🔇 Additional comments (9)
docs/design/tools.md (1)

179-196: LGTM!

src/synthorg/tools/sandbox/protocol.py (1)

66-80: LGTM!

scripts/git-hooks/_run-hook.sh (1)

42-78: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_agent.py (1)

13-13: LGTM!

Also applies to: 59-63, 132-142, 201-211, 252-261, 322-331

src/synthorg/tools/sandbox/lifecycle/per_task.py (1)

11-11: LGTM!

Also applies to: 35-39, 98-116, 132-141

tests/unit/tools/sandbox/test_docker_sandbox.py (1)

13-25: LGTM!

Also applies to: 32-104, 132-133, 238-243, 255-255, 260-260, 382-383, 601-614, 777-778, 990-991, 1043-1298

tests/unit/workers/test_execution_service.py (1)

25-31: LGTM!

Also applies to: 304-417

scripts/run_affected_tests.py (1)

86-115: LGTM!

Also applies to: 855-855

tests/unit/scripts/test_run_affected_tests.py (1)

902-919: LGTM!

Also applies to: 942-959, 1018-1020

Comment thread src/synthorg/tools/sandbox/docker_sandbox_exec.py
Comment thread src/synthorg/tools/sandbox/docker_sandbox_exec.py
@Aureliolo Aureliolo force-pushed the fix/sandbox-lifecycle branch from dd842af to b29a9e8 Compare May 18, 2026 19:26
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 18, 2026 19:27 — with GitHub Actions Inactive
@Keesan12

This comment was marked as spam.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/synthorg/tools/sandbox/lifecycle/per_agent.py (1)

96-150: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Concurrent first acquires can leak the losing container.

When two acquire() calls race for a new owner_id, the loser path looks up self._destroy_fns.get(owner_id) to tear down the extra handle. That map is only populated in release(), so the first collision just logs "no destroy_fn available" and leaves the losing container running. This needs a destroy callback that is available during acquire() (or another backend-owned teardown path), otherwise parallel calls for the same owner leak warm containers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/synthorg/tools/sandbox/lifecycle/per_agent.py` around lines 96 - 150, The
race in acquire() leaks the losing container because self._destroy_fns is only
populated in release(), so change acquire() (the code around the owner_id
collision check) to ensure a destroy callback is available immediately: when
creating/receiving the new handle in acquire(), store its destroy function into
self._destroy_fns[owner_id] (or attach the destroy callback to the handle and
read it back) before the re-check and insertion into self._containers, and on
the loser path use that stored callback (instead of relying on release()) to
destroy the losing handle; keep existing use of _cancel_timer/_cancel_idle_timer
and ensure any assignment/cleanup of self._destroy_fns mirrors the existing
release() teardown logic so no mapping is left stale.
tests/unit/tools/sandbox/test_sandbox_factory.py (1)

58-63: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add coverage for the injected lifecycle-strategy path.

These assertions only prove the fallback lifecycle_strategy=None case. Please add a case that passes a concrete lifecycle strategy into build_sandbox_backends() and asserts it is forwarded to DockerSandbox; otherwise the new boot-time wiring can regress without failing this suite.

Also applies to: 118-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/tools/sandbox/test_sandbox_factory.py` around lines 58 - 63, Add a
test case that verifies a provided lifecycle strategy is forwarded into
DockerSandbox: call build_sandbox_backends(...) with a concrete
lifecycle_strategy object (not None), then assert
mock_docker_cls.assert_called_once_with(...) includes lifecycle_strategy=<that
object> (and still matches config=config.docker, workspace=tmp_path,
tracked_container_repo=None). Repeat this same pattern for the other location
referenced (the block around lines 118-123) so both test paths cover injected
lifecycle_strategy forwarding.
♻️ Duplicate comments (1)
src/synthorg/tools/sandbox/docker_sandbox_exec.py (1)

278-287: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sidecar failures can still orphan the last cleanup anchor.

_bring_up_sidecar() tracks a pseudo-key (_sidecar:{id}), but cleanup() later iterates _tracked_containers keys as real container IDs. Then Line 357 drops that pseudo-key before container_obj.start() has succeeded. If startup later fails and the sidecar delete also fails once, the sandbox entry can be untracked while no real sidecar ID remains in _tracked_containers, so the sidecar leaks until manual cleanup. Keep an actual sidecar ID tracked until its delete succeeds, or teach cleanup to unwrap the temporary key.

Also applies to: 300-317, 355-377

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/synthorg/tools/sandbox/docker_sandbox_exec.py` around lines 278 - 287,
_brief: The current flow uses a pseudo-key "_sidecar:{id}" in
_tracked_containers while bringing up a sidecar in _bring_up_sidecar and then
untracks that pseudo-key too early, which can lead to leaked containers if
startup fails and removal also fails; change the tracking semantics so the real
Docker container ID is kept in _tracked_containers until its removal succeeds,
and make cleanup aware of the pseudo-key. Concretely: modify _bring_up_sidecar
to not replace the pseudo-key with a non-existent key or to store a mapping from
pseudo-key -> real container id only after container_obj.start() succeeds;
ensure _cleanup_failed_sidecar and _remove_container only call
_untrack_container when the actual container has been successfully removed (keep
the pseudo-key until removal returns true); and update cleanup() to detect and
unwrap pseudo-keys (e.g., treat keys prefixed with "_sidecar:" by resolving them
to the real container id if present) so iterating _tracked_containers will
attempt removal of the real container id rather than silently dropping the
pseudo-key.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/synthorg/workers/execution_service.py`:
- Around line 372-381: The resumed-path in _resume_parked() currently awaits
resume_parked_run() without ensuring sandbox ownership is released, so add the
same release call used in execute_once(): wrap the await
self.resume_parked_run(...) in a try/finally and in the finally call await
self._release_sandbox_owner(identity=ctx.identity, task_id=task_id) so any
reusable container acquired during the resumed run is released (matching
execute_once() behavior).

---

Outside diff comments:
In `@src/synthorg/tools/sandbox/lifecycle/per_agent.py`:
- Around line 96-150: The race in acquire() leaks the losing container because
self._destroy_fns is only populated in release(), so change acquire() (the code
around the owner_id collision check) to ensure a destroy callback is available
immediately: when creating/receiving the new handle in acquire(), store its
destroy function into self._destroy_fns[owner_id] (or attach the destroy
callback to the handle and read it back) before the re-check and insertion into
self._containers, and on the loser path use that stored callback (instead of
relying on release()) to destroy the losing handle; keep existing use of
_cancel_timer/_cancel_idle_timer and ensure any assignment/cleanup of
self._destroy_fns mirrors the existing release() teardown logic so no mapping is
left stale.

In `@tests/unit/tools/sandbox/test_sandbox_factory.py`:
- Around line 58-63: Add a test case that verifies a provided lifecycle strategy
is forwarded into DockerSandbox: call build_sandbox_backends(...) with a
concrete lifecycle_strategy object (not None), then assert
mock_docker_cls.assert_called_once_with(...) includes lifecycle_strategy=<that
object> (and still matches config=config.docker, workspace=tmp_path,
tracked_container_repo=None). Repeat this same pattern for the other location
referenced (the block around lines 118-123) so both test paths cover injected
lifecycle_strategy forwarding.

---

Duplicate comments:
In `@src/synthorg/tools/sandbox/docker_sandbox_exec.py`:
- Around line 278-287: _brief: The current flow uses a pseudo-key
"_sidecar:{id}" in _tracked_containers while bringing up a sidecar in
_bring_up_sidecar and then untracks that pseudo-key too early, which can lead to
leaked containers if startup fails and removal also fails; change the tracking
semantics so the real Docker container ID is kept in _tracked_containers until
its removal succeeds, and make cleanup aware of the pseudo-key. Concretely:
modify _bring_up_sidecar to not replace the pseudo-key with a non-existent key
or to store a mapping from pseudo-key -> real container id only after
container_obj.start() succeeds; ensure _cleanup_failed_sidecar and
_remove_container only call _untrack_container when the actual container has
been successfully removed (keep the pseudo-key until removal returns true); and
update cleanup() to detect and unwrap pseudo-keys (e.g., treat keys prefixed
with "_sidecar:" by resolving them to the real container id if present) so
iterating _tracked_containers will attempt removal of the real container id
rather than silently dropping the pseudo-key.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: ab793aa3-a9b9-430a-a230-e575d74be9af

📥 Commits

Reviewing files that changed from the base of the PR and between dd842af and b29a9e8.

📒 Files selected for processing (27)
  • docs/design/tools.md
  • scripts/_ghost_wiring_manifest.txt
  • scripts/git-hooks/_run-hook.sh
  • scripts/run_affected_tests.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • tests/unit/scripts/test_git_hooks_wrapper.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/workers/test_execution_service.py
💤 Files with no reviewable changes (1)
  • tests/unit/scripts/test_git_hooks_wrapper.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Deploy Preview
  • GitHub Check: Build Backend
  • GitHub Check: Build Web Assets (melange)
  • GitHub Check: Test E2E
  • GitHub Check: Test Unit
  • GitHub Check: Test Conformance (SQLite)
  • GitHub Check: Test Integration
  • GitHub Check: CodSpeed Python benchmarks
  • GitHub Check: Lighthouse Site
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (7)
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test markers: @pytest.mark.{unit,integration,e2e,slow}. Async auto. Timeout 30s global. Coverage 80% min
Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race). Subprocess tests override back
Test doubles: use FakeClock for Clock seam, mock_ofT for typed-boundary substitutions, SimpleNamespace for attribute-bags. Never use bare MagicMock at typed boundary (constructor/fn arg/annotated local/typed fixture return); blocked by scripts/check_mock_spec.py (zero-tolerance, no baseline). Import FakeClock and mock_of from tests._shared
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))
Flaky tests: NEVER skip/xfail; fix fundamentally. Use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL
Use DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default with read_only_post_init (Cat-2); Cat-3 bootstrap secrets are pure env at boot site. YAML is company-template ingestion format only, not a precedence tier. No os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value
Use from synthorg.observability import get_logger; variable always named logger. Never import logging or use print() in app code. Event names from observability.events. constants; structured kwargs (logger.info(EVENT, key=value))
Error paths log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED after persistence write
Never log error=str(exc) or interpolate {exc}; use error_type=type(exc).name + error=safe_error_description(exc). Never use exc_info=True. OTel: forbidden span.record_exception(exc); use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False. Enforced by check_logger_exception_str_exc.py

Files:

  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py
Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing. Enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py
Do not use from __future__ import annotations (Python 3.14 has PEP 649). Use PEP 758 except: except A, B: requires parens when binding
Type hints required on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines
Define errors as <Domain><Condition>Error inheriting from DomainError; never inherit from Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py
Pydantic v2: use frozen + extra="forbid" on every frozen model project-wide (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- for extra="allow"/"ignore" boundaries). Use @computed_field for derived fields. Use NotBlankStr for identifiers
Args models at every system boundary; parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py
Use immutability patterns: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries
Use asyncio.TaskGroup for fan-out/fan-in async patterns; helpers must catch Exception (re-raise MemoryError/RecursionError)
Clock seam: include clock: Clock | None = None parameter; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable
For untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML
Never use real vendor names in project code/tests; use example-provider, test-provider, example-{large,medium,small}-001. Allowed in .claude/, third-party imports, providers/presets.py, we...

Files:

  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/docker_sandbox.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Telemetry: opt-in, off by default. Every event property must be in _ALLOWED_PROPERTIES. See telemetry.md

Files:

  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/workers.py
{README.md,docs/**/*.md,data/**/*.md}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs must be sourced from data/runtime_stats.yaml via markers

Files:

  • docs/design/tools.md
docs/**/*.{md,d2}

📄 CodeRabbit inference engine (CLAUDE.md)

Use d2 for architecture/nested containers, mermaid for flowcharts/sequence/pipelines. Markdown tables for tabular data. D2 theme 200 (Dark Mauve), D2 CLI pinned to v0.7.1 in CI

Files:

  • docs/design/tools.md
src/synthorg/{workers,api}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Runtime services: synthorg.workers.runtime_builder.build_runtime_services selects behind ONE provider-present switch, returning RuntimeServices (AgentEngineExecutionService + coordinator or NoProviderExecutionService + None). _install_runtime_services appends FIRST after persistence/SettingsService hooks; swap_worker_execution_service / swap_coordinator / swap_provider_registry hold a lock. Empty-company rejects task creation (AgentRuntimeNotConfiguredError 4014) and /coordinate 503s

Files:

  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/workers/execution_service.py
🧠 Learnings (5)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • tests/unit/tools/sandbox/test_protocol.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • tests/unit/scripts/test_run_affected_tests.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/workers/runtime_builder.py
  • scripts/run_affected_tests.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing markdown files for the "Doc Numeric Claims (MANDATORY)" RS-marker rule, only require/flag missing RS markers in the files that are actually in-scope for the rule. The scope is enforced via an identical _SCOPED_FILES allowlist in scripts/check_doc_numeric_macros.py and scripts/inject_runtime_stats.py, and currently includes: README.md; docs/index.md; docs/roadmap/index.md; docs/architecture/decisions.md; docs/reference/convention-gates.md. For any other markdown files (e.g., docs/getting_started.md, docs/guides/*), missing RS markers for numeric claims are no-ops and should NOT be flagged.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo against the `check_doc_numeric_macros.py` gate, account for its documented behavior: it skips fenced code blocks entirely, and it only flags digits that are adjacent to specific stat nouns (`tests`, `providers`, `agents`, `stars`, `releases`). As a result, CLI-style numbers (e.g., `--num-workers=4`) inside fenced bash code blocks should never be treated as violations of this gate; only non-fenced text needs checking, and only around those specific nouns.

Applied to files:

  • docs/design/tools.md
🔇 Additional comments (18)
src/synthorg/tools/sandbox/subprocess_sandbox.py (1)

614-616: Make the intentionally unused parameter explicit.

release_owner() keeps owner_id unused without an explicit ignore marker, which can still trip lint checks.

tests/unit/tools/sandbox/test_lifecycle_protocol.py (1)

5-12: LGTM!

Also applies to: 53-83

scripts/_ghost_wiring_manifest.txt (1)

35-35: LGTM!

src/synthorg/observability/events/sandbox.py (1)

46-52: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_call.py (1)

28-32: LGTM!

tests/unit/tools/sandbox/test_protocol.py (1)

38-40: LGTM!

src/synthorg/tools/sandbox/lifecycle/protocol.py (1)

46-58: LGTM!

src/synthorg/observability/events/docker.py (1)

17-18: LGTM!

src/synthorg/tools/sandbox/protocol.py (1)

66-80: LGTM!

src/synthorg/tools/sandbox/lifecycle/config.py (1)

3-3: LGTM!

Also applies to: 7-13

src/synthorg/tools/sandbox/lifecycle/per_task.py (1)

11-11: LGTM!

Also applies to: 35-39, 98-116, 132-134, 139-140

tests/unit/scripts/test_run_affected_tests.py (1)

902-902: LGTM!

Also applies to: 904-907, 942-960, 1018-1018

docs/design/tools.md (1)

179-196: LGTM!

src/synthorg/observability/events/workers.py (1)

83-88: LGTM!

src/synthorg/workers/runtime_builder.py (1)

41-42: LGTM!

Also applies to: 50-50, 57-57, 133-142, 152-160, 164-164, 167-167, 363-363, 394-395

scripts/run_affected_tests.py (1)

86-103: LGTM!

Also applies to: 114-114, 855-855

tests/unit/tools/sandbox/test_sandbox_factory.py (1)

23-23: LGTM!

src/synthorg/tools/sandbox/docker_sandbox.py (1)

16-39: LGTM!

Also applies to: 108-181, 340-360, 520-674

Comment thread src/synthorg/workers/execution_service.py
Aureliolo added 7 commits May 18, 2026 21:48
per_task.release reinstates handle via setdefault on destroy failure so a live container stays tracked for cleanup_all retry; per_task + per_agent (5 sites) re-raise MemoryError/RecursionError and log error_type + safe_error_description; docker_sandbox_exec wraps ship_container_logs so a shipping failure cannot bypass execute() finally and leak the container; protocol.release_owner owner_id str -> NotBlankStr; docker_sandbox._merged_env_list delegates to _resolve_exec_env (DRY) and drops unused build_correlation_env import; docs clarify per-agent is the configured default and per-call degradation is a per-call no-owner fallback; tests add per-task contextvar fallback and engine.run-raises release coverage. Skipped CodeRabbit subprocess noqa ARG002 finding: ruff RUF100 proves ARG002 does not fire on the docstring-only stub release_owner, so the suggested noqa is an unused directive.
run_affected_tests._git stripped the whole git-status --porcelain blob, eating the leading space of the first porcelain line; a worktree-only modification (status ' M path') then parsed via the fixed index-3 slice as 'cripts/...' for 'scripts/...', so the follow-up git restore failed on a bogus pathspec and every push with such a dirtied file aborted. _git now takes strip=False and _tracked_dirty_paths requests unstripped porcelain; added a regression test pinning strip=False. _run-hook.sh tees the full pre-commit/pre-push stream to a per-worktree log under the git dir and re-emits the failing tail + log path to stderr on failure, so a caller's output cap can never hide the real hook diagnostic again. Removed the now-obsolete git-hooks wrapper test.
_exec_returncode now re-raises MemoryError/RecursionError before swallowing inspect() failures and returning -1 (project catastrophic-error convention). _create_started_container start-failure path and _destroy_handle now untrack the container only once removal is confirmed (start path: gate on container removal; _destroy_handle: gate on BOTH sandbox and sidecar removal) so a failed Docker delete keeps the tracking anchor (which carries sidecar_id) for cleanup()'s sweep to retry instead of orphaning the pair; the 'sidecar remains tracked' warning is now truthful. Added tests: _destroy_handle keeps tracking when sidecar removal fails and untracks when both succeed; _exec_returncode re-raises MemoryError and returns -1 on ordinary errors. Skipped the literal line-357 part of the CodeRabbit finding (re-track _sidecar key): line 357 is the correct provisional->handle tracking handoff after _track_container(container_id, sidecar_id); the sidecar stays anchored via the container entry and is reaped by cleanup()'s sweep, whereas keeping the _sidecar:<id> key would leave a permanent stale entry cleanup() can never satisfy (it would call _remove_container with the literal non-id string).
Thread destroy_fn into SandboxLifecycleStrategy.acquire() (protocol + per_agent/per_task/per_call + DockerSandbox caller) so a concurrent first-acquire race tears the losing container down instead of leaking it (per_agent no longer depends on release()-only _destroy_fns; per_task now destroys the losing handle). _resume_parked() releases the sandbox lifecycle owner in a finally, matching execute_once(), so reusable containers acquired during a resumed run are not leaked. _create_started_container start-failure path now untracks the container only when BOTH it and its paired sidecar are confirmed removed (closes the residual sidecar orphan the round-3 fix left in the start path). Added: concurrent-first-acquire loser-destroy tests for per_agent and per_task; injected lifecycle_strategy forwarding test for build_sandbox_backends; updated all lifecycle acquire call sites for the new destroy_fn parameter.
@Aureliolo Aureliolo force-pushed the fix/sandbox-lifecycle branch from b29a9e8 to 05abbe3 Compare May 18, 2026 19:52
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 18, 2026 19:54 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/synthorg/tools/sandbox/subprocess_sandbox.py (1)

614-616: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten release_owner identifier type and mark the arg intentionally unused.

Line 614 currently uses str and leaves an unused parameter unannotated for lint intent. Keep this aligned with the backend contract and make the no-op explicit.

Proposed fix
-    async def release_owner(self, owner_id: str) -> None:
+    async def release_owner(self, owner_id: NotBlankStr) -> None:  # noqa: ARG002
         """No-op -- subprocesses hold no per-owner resources."""
As per coding guidelines: "Use NotBlankStr for identifiers".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/synthorg/tools/sandbox/subprocess_sandbox.py` around lines 614 - 616,
Change the release_owner signature to accept the stricter identifier type and
mark the argument intentionally unused: update async def release_owner(self,
owner_id: str) -> None to async def release_owner(self, _owner_id: NotBlankStr)
-> None (or keep name owner_id but annotate as NotBlankStr and prefix with
underscore) and add the appropriate import for NotBlankStr; leave the body as a
no-op but ensure the unused parameter is clearly marked to satisfy linters and
reflect the backend contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/tools.md`:
- Around line 179-196: The docs currently conflict: the design text around
warm-container reuse (SandboxLifecycleConfig default "per-agent") and the Docker
backend section should be reconciled so they describe the same lifecycle model;
update the Docker backend wording to match the lifecycle described (constructed
at boot in workers/runtime_builder, injected into DockerSandbox, lifecycle owner
resolution by owner_id or structlog context, reuse semantics for
per-agent/per-task, per-call fallback behavior, AgentEngineExecutionService
releasing owners at task boundary, DockerSandbox.cleanup() calling
cleanup_all(), and managed label synthorg.managed=true) so both sections
consistently document the configured reuse strategies and the per-call
degradation behavior.

---

Duplicate comments:
In `@src/synthorg/tools/sandbox/subprocess_sandbox.py`:
- Around line 614-616: Change the release_owner signature to accept the stricter
identifier type and mark the argument intentionally unused: update async def
release_owner(self, owner_id: str) -> None to async def release_owner(self,
_owner_id: NotBlankStr) -> None (or keep name owner_id but annotate as
NotBlankStr and prefix with underscore) and add the appropriate import for
NotBlankStr; leave the body as a no-op but ensure the unused parameter is
clearly marked to satisfy linters and reflect the backend contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: e6122d6f-9185-4a13-9a7c-6fd21c07d036

📥 Commits

Reviewing files that changed from the base of the PR and between b29a9e8 and 05abbe3.

📒 Files selected for processing (30)
  • docs/design/tools.md
  • scripts/_ghost_wiring_manifest.txt
  • scripts/git-hooks/_run-hook.sh
  • scripts/run_affected_tests.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/workers/runtime_builder.py
  • tests/unit/scripts/test_git_hooks_wrapper.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_lifecycle_per_agent.py
  • tests/unit/tools/sandbox/test_lifecycle_per_call.py
  • tests/unit/tools/sandbox/test_lifecycle_per_task.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/workers/test_execution_service.py
💤 Files with no reviewable changes (1)
  • tests/unit/scripts/test_git_hooks_wrapper.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Build Backend
  • GitHub Check: Build Web Assets (melange)
  • GitHub Check: CodSpeed Python benchmarks
  • GitHub Check: Test Unit
  • GitHub Check: Test Conformance (SQLite)
  • GitHub Check: Test E2E
  • GitHub Check: Test Integration
  • GitHub Check: Lighthouse Site
  • GitHub Check: Build Preview
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (7)
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/synthorg/**/*.py: Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL
Use DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default with read_only_post_init (Cat-2); Cat-3 bootstrap secrets are pure env at boot site. YAML is company-template ingestion format only, not a precedence tier. No os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value
Use from synthorg.observability import get_logger; variable always named logger. Never import logging or use print() in app code. Event names from observability.events. constants; structured kwargs (logger.info(EVENT, key=value))
Error paths log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED after persistence write
Never log error=str(exc) or interpolate {exc}; use error_type=type(exc).name + error=safe_error_description(exc). Never use exc_info=True. OTel: forbidden span.record_exception(exc); use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False. Enforced by check_logger_exception_str_exc.py

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py
Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing. Enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py
Do not use from __future__ import annotations (Python 3.14 has PEP 649). Use PEP 758 except: except A, B: requires parens when binding
Type hints required on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines
Define errors as <Domain><Condition>Error inheriting from DomainError; never inherit from Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py
Pydantic v2: use frozen + extra="forbid" on every frozen model project-wide (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- for extra="allow"/"ignore" boundaries). Use @computed_field for derived fields. Use NotBlankStr for identifiers
Args models at every system boundary; parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py
Use immutability patterns: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries
Use asyncio.TaskGroup for fan-out/fan-in async patterns; helpers must catch Exception (re-raise MemoryError/RecursionError)
Clock seam: include clock: Clock | None = None parameter; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable
For untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML
Never use real vendor names in project code/tests; use example-provider, test-provider, example-{large,medium,small}-001. Allowed in .claude/, third-party imports, providers/presets.py, we...

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/tools/sandbox/protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/factory.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/workers/execution_service.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test markers: @pytest.mark.{unit,integration,e2e,slow}. Async auto. Timeout 30s global. Coverage 80% min
Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race). Subprocess tests override back
Test doubles: use FakeClock for Clock seam, mock_ofT for typed-boundary substitutions, SimpleNamespace for attribute-bags. Never use bare MagicMock at typed boundary (constructor/fn arg/annotated local/typed fixture return); blocked by scripts/check_mock_spec.py (zero-tolerance, no baseline). Import FakeClock and mock_of from tests._shared
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))
Flaky tests: NEVER skip/xfail; fix fundamentally. Use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_lifecycle_per_task.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_lifecycle_per_call.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_lifecycle_per_agent.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_lifecycle_per_task.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • tests/unit/tools/sandbox/test_lifecycle_per_call.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • tests/unit/workers/test_execution_service.py
  • tests/unit/tools/sandbox/test_lifecycle_per_agent.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Telemetry: opt-in, off by default. Every event property must be in _ALLOWED_PROPERTIES. See telemetry.md

Files:

  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/docker.py
  • src/synthorg/observability/events/sandbox.py
{README.md,docs/**/*.md,data/**/*.md}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs must be sourced from data/runtime_stats.yaml via markers

Files:

  • docs/design/tools.md
docs/**/*.{md,d2}

📄 CodeRabbit inference engine (CLAUDE.md)

Use d2 for architecture/nested containers, mermaid for flowcharts/sequence/pipelines. Markdown tables for tabular data. D2 theme 200 (Dark Mauve), D2 CLI pinned to v0.7.1 in CI

Files:

  • docs/design/tools.md
src/synthorg/{workers,api}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Runtime services: synthorg.workers.runtime_builder.build_runtime_services selects behind ONE provider-present switch, returning RuntimeServices (AgentEngineExecutionService + coordinator or NoProviderExecutionService + None). _install_runtime_services appends FIRST after persistence/SettingsService hooks; swap_worker_execution_service / swap_coordinator / swap_provider_registry hold a lock. Empty-company rejects task creation (AgentRuntimeNotConfiguredError 4014) and /coordinate 503s

Files:

  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/workers/execution_service.py
🧠 Learnings (5)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • src/synthorg/tools/sandbox/protocol.py
  • tests/unit/tools/sandbox/test_protocol.py
  • src/synthorg/tools/sandbox/subprocess_sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/config.py
  • src/synthorg/observability/events/workers.py
  • src/synthorg/observability/events/docker.py
  • tests/unit/tools/sandbox/test_lifecycle_per_task.py
  • src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py
  • src/synthorg/tools/sandbox/lifecycle/per_call.py
  • tests/unit/scripts/test_run_affected_tests.py
  • tests/unit/tools/sandbox/test_lifecycle_protocol.py
  • scripts/run_affected_tests.py
  • src/synthorg/workers/runtime_builder.py
  • src/synthorg/tools/sandbox/lifecycle/protocol.py
  • src/synthorg/tools/sandbox/factory.py
  • tests/unit/tools/sandbox/test_lifecycle_per_call.py
  • tests/unit/tools/sandbox/test_sandbox_factory.py
  • src/synthorg/observability/events/sandbox.py
  • src/synthorg/tools/sandbox/lifecycle/per_task.py
  • tests/unit/workers/test_execution_service.py
  • src/synthorg/tools/sandbox/docker_sandbox.py
  • src/synthorg/workers/execution_service.py
  • tests/unit/tools/sandbox/test_lifecycle_per_agent.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • src/synthorg/tools/sandbox/docker_sandbox_exec.py
  • src/synthorg/tools/sandbox/lifecycle/per_agent.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing markdown files for the "Doc Numeric Claims (MANDATORY)" RS-marker rule, only require/flag missing RS markers in the files that are actually in-scope for the rule. The scope is enforced via an identical _SCOPED_FILES allowlist in scripts/check_doc_numeric_macros.py and scripts/inject_runtime_stats.py, and currently includes: README.md; docs/index.md; docs/roadmap/index.md; docs/architecture/decisions.md; docs/reference/convention-gates.md. For any other markdown files (e.g., docs/getting_started.md, docs/guides/*), missing RS markers for numeric claims are no-ops and should NOT be flagged.

Applied to files:

  • docs/design/tools.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo against the `check_doc_numeric_macros.py` gate, account for its documented behavior: it skips fenced code blocks entirely, and it only flags digits that are adjacent to specific stat nouns (`tests`, `providers`, `agents`, `stars`, `releases`). As a result, CLI-style numbers (e.g., `--num-workers=4`) inside fenced bash code blocks should never be treated as violations of this gate; only non-fenced text needs checking, and only around those specific nouns.

Applied to files:

  • docs/design/tools.md
🔇 Additional comments (54)
src/synthorg/tools/sandbox/protocol.py (1)

66-80: LGTM!

scripts/_ghost_wiring_manifest.txt (1)

35-35: LGTM!

scripts/git-hooks/_run-hook.sh (1)

42-78: LGTM!

tests/unit/tools/sandbox/test_protocol.py (1)

38-39: LGTM!

src/synthorg/tools/sandbox/lifecycle/config.py (1)

3-3: LGTM!

Also applies to: 7-13

src/synthorg/observability/events/workers.py (1)

83-88: LGTM!

src/synthorg/observability/events/docker.py (1)

17-18: LGTM!

tests/unit/tools/sandbox/test_lifecycle_per_task.py (1)

17-19: LGTM!

Also applies to: 34-35, 46-51, 63-68, 72-101, 118-120, 148-154, 175-176, 180-181, 212-213, 217-218, 237-239

src/synthorg/tools/sandbox/docker_sandbox_lifecycle.py (1)

28-31: LGTM!

Also applies to: 44-45, 49-54, 181-198

src/synthorg/tools/sandbox/lifecycle/per_call.py (1)

28-31: LGTM!

Also applies to: 38-40

tests/unit/scripts/test_run_affected_tests.py (1)

902-908: LGTM!

Also applies to: 942-960, 1018-1020

tests/unit/tools/sandbox/test_lifecycle_protocol.py (1)

5-12: LGTM!

Also applies to: 53-83

scripts/run_affected_tests.py (1)

86-115: LGTM!

Also applies to: 855-856

src/synthorg/workers/runtime_builder.py (1)

41-43: LGTM!

Also applies to: 50-51, 57-58, 133-142, 152-167, 363-366, 394-396

src/synthorg/tools/sandbox/lifecycle/protocol.py (1)

46-57: LGTM!

Also applies to: 59-81

src/synthorg/tools/sandbox/factory.py (1)

14-14: LGTM!

Also applies to: 34-36, 52-76, 79-101, 115-138, 159-166, 299-307

tests/unit/tools/sandbox/test_lifecycle_per_call.py (1)

15-17: LGTM!

Also applies to: 29-33, 44-49

tests/unit/tools/sandbox/test_sandbox_factory.py (1)

20-20: LGTM!

Also applies to: 24-25, 59-91, 146-151

src/synthorg/observability/events/sandbox.py (1)

46-52: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_task.py (1)

11-11: LGTM!

Also applies to: 35-38, 40-109, 111-147, 149-178

tests/unit/workers/test_execution_service.py (1)

28-33: LGTM!

Also applies to: 443-558

src/synthorg/tools/sandbox/docker_sandbox.py (1)

1-8: LGTM!

Also applies to: 16-16, 27-30, 33-33, 39-39, 41-53, 108-112, 115-123, 125-181, 196-233, 235-251, 340-360, 362-414, 520-561, 563-675

src/synthorg/workers/execution_service.py (4)

355-407: LGTM!


444-491: LGTM!


624-639: LGTM!


283-316: LGTM!

tests/unit/tools/sandbox/test_lifecycle_per_agent.py (6)

41-43: LGTM!

Also applies to: 60-72


116-146: LGTM!


152-178: LGTM!

Also applies to: 180-213, 215-226


232-286: LGTM!


320-377: LGTM!


383-416: LGTM!

tests/unit/tools/sandbox/test_docker_sandbox.py (7)

33-104: LGTM!


236-256: LGTM!


737-805: LGTM!


1113-1154: LGTM!


1156-1260: LGTM!


1262-1328: LGTM!


1330-1370: LGTM!

src/synthorg/tools/sandbox/docker_sandbox_exec.py (10)

162-245: LGTM!


247-272: LGTM!


278-317: LGTM!


319-397: LGTM!


399-453: LGTM!


459-612: LGTM!


614-682: LGTM!


684-716: LGTM!


722-763: LGTM!


765-830: LGTM!

src/synthorg/tools/sandbox/lifecycle/per_agent.py (5)

59-62: LGTM!


64-151: LGTM!


153-215: LGTM!


217-266: LGTM!


284-335: LGTM!

Comment thread docs/design/tools.md
docs/design/tools.md: reconcile the Sandbox Backends table + docker config-example comment with the lifecycle-strategy model (DockerSandbox is now a keep-alive container reused per the configured strategy, per-agent default; ~1-2s only on first acquire, warm thereafter; auto_remove governs teardown by the lifecycle strategy, not every execution). subprocess_sandbox.release_owner owner_id str -> NotBlankStr to match the SandboxBackend protocol contract. Skipped (again) the CodeRabbit 'add # noqa: ARG002' sub-part: ruff RUF100 (round 1) proved ARG002 does not fire on the docstring-only stub release_owner, so the directive would be flagged unused; the NotBlankStr alignment is the valid part and is applied.
@Aureliolo Aureliolo merged commit 03d2587 into main May 18, 2026
67 of 70 checks passed
@Aureliolo Aureliolo deleted the fix/sandbox-lifecycle branch May 18, 2026 20:10
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 18, 2026 20:10 — with GitHub Actions Inactive
Aureliolo pushed a commit that referenced this pull request May 19, 2026
<!-- HIGHLIGHTS_START -->
## Highlights

> _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub
Models). Commit-based changelog below._

### What you'll notice
- Multi-agent coordination is now active immediately on startup for
smoother operation.
- Governance rules are fully enforced during use, ensuring compliance at
all times.
- Coordination metrics update live, giving real-time insights into
system activity.
- Review agents are now reliably processed, preventing silent drops in
tasks.
- Sandbox containers can be reused for agents and tasks, speeding up
execution and reducing overhead.

### What's new
- Agents support online runtime with a minimal safety framework to
improve stability.
- Recorded LLM interactions can be deterministically replayed at the
provider interface.
- Distributed path validation has been enhanced for more robust data
routing.
- A client-simulation runtime was added for end-to-end testing of the
IntakeEngine.
- A new work pipeline spine architecture has been introduced to
streamline task processing.

### Under the hood
- Infrastructure, Python, and web dependencies have all been updated to
latest versions.
- Updated apko lockfiles in the CI/CD pipeline improve build
consistency.

<!-- HIGHLIGHTS_END -->

:robot: I have created a release *beep* *boop*
---


##
[0.8.6](v0.8.5...v0.8.6)
(2026-05-19)


### Features

* agent runtime online + minimal safety spine (runtime root)
([#2003](#2003))
([e5eef1a](e5eef1a)),
closes [#1956](#1956)
* deterministic recorded-LLM cassette replay at the provider chokepoint
([#2010](#2010))
([cabf55d](cabf55d))
* distributed path validation + hardening
([#2011](#2011))
([a382e4a](a382e4a)),
closes [#1966](#1966)
* wire IntakeEngine via boot client-simulation runtime (e2e test
harness) ([#2006](#2006))
([6a9c0aa](6a9c0aa)),
closes [#1961](#1961)
* work pipeline spine
([#1960](#1960))
([#2013](#2013))
([29b64e3](29b64e3))


### Bug Fixes

* bring the multi-agent coordinator online at boot
([#2007](#2007))
([180b38a](180b38a)),
closes [#1958](#1958)
* full governance enforcement online
([#1957](#1957))
([#2005](#2005))
([4140fc5](4140fc5))
* harden anti-ghost-wiring gate and fix silently-dropped review agents
([#2000](#2000))
([89b57ce](89b57ce))
* make coordination metrics live
([#1959](#1959))
([#2012](#2012))
([c4775e2](c4775e2))
* sandbox lifecycle dispatch (per-agent / per-task container reuse)
([#2008](#2008))
([03d2587](03d2587)),
closes [#1965](#1965)


### Documentation

* add GitButler concept-only concurrency research
([#1978](#1978))
([#2009](#2009))
([9e4f5c1](9e4f5c1))
* honest-hybrid refresh of README, site, and design specs
([#2001](#2001))
([f485bea](f485bea))


### CI/CD

* update apko lockfiles
([#2004](#2004))
([e2b9eee](e2b9eee))


### Maintenance

* Update Infrastructure dependencies
([#2014](#2014))
([0b16bdf](0b16bdf))
* Update Python dependencies
([#2015](#2015))
([a7224bb](a7224bb))
* Update Web dependencies
([#2016](#2016))
([7a7fe76](7a7fe76))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox lifecycle dispatch (per-agent / per-task reuse)

2 participants