Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ PYTHONPATH=. uv run zensical build # docs
- Two phases: **construction** (`create_app` body) wires synchronous services; **on_startup** (`_build_lifecycle.on_startup`) wires services that need a connected persistence backend.
- Construction-phase ordering invariants: `agent_registry` must be built BEFORE `auto_wire_meetings`; `tunnel_provider` is wired unconditionally (not gated by `integrations.enabled`).
- On-startup ordering invariants: `SettingsService` auto-wire must precede `WorkflowExecutionObserver` registration (so it picks up resolver-driven `max_subworkflow_depth` instead of the seed default); `OntologyService` wires after `persistence.connect()` via `_wire_ontology_service`.
- Worker execution service: `synthorg.workers.runtime_builder.build_worker_execution_service` selects behind the provider-present switch (`AgentEngineExecutionService` with a provider, `NoProviderExecutionService` empty-company backstop) and installs via the `AppState.worker_execution_service` seam. The boot install hook is appended FIRST after the persistence/SettingsService hooks so the once-only `set_worker_execution_service` cannot lose the race with the property's lazy `LifecycleAdvancingExecutionService` default. Empty-company also rejects task creation at the controller (`AgentRuntimeNotConfiguredError`, 4014). `swap_worker_execution_service` / `swap_provider_registry` hold a lock (synchronised against lazy reads).
- Setup completion: `post_setup_reinit()` (provider reload, agent bootstrap, AND worker-execution-service rebuild + hot-swap, defined in `src/synthorg/api/controllers/setup/agent_helpers.py`) propagates failures, and `settings_svc.set("api", "setup_complete", "true")` only runs if reinit returns clean. The whole check/validate/reinit/persist sequence is serialised under `COMPLETE_LOCK` in the same module so two concurrent `/setup/complete` requests cannot race on the flag write. A half-configured runtime presenting itself as "complete" is worse than a clear error the operator can retry after fixing the underlying provider config.
- Runtime services: `synthorg.workers.runtime_builder.build_runtime_services` selects behind ONE provider-present switch and returns a `RuntimeServices` pair (worker execution service + multi-agent coordinator) built from a SINGLE shared boot `AgentEngine`: `AgentEngineExecutionService` + a `build_coordinator(...)` coordinator with a provider, `NoProviderExecutionService` + `None` coordinator as the empty-company backstop. The `_install_runtime_services` boot hook installs both via the `AppState.worker_execution_service` and `AppState.coordinator` seams; it is appended FIRST after the persistence/SettingsService hooks so the once-only `set_worker_execution_service` / `set_coordinator` cannot lose the race with the worker property's lazy `LifecycleAdvancingExecutionService` default. Empty-company rejects task creation at the controller (`AgentRuntimeNotConfiguredError`, 4014) and `/coordinate` honestly 503s (no coordinator). `swap_worker_execution_service` / `swap_coordinator` / `swap_provider_registry` hold a lock (synchronised against lazy reads).
- Setup completion: `post_setup_reinit()` (provider reload, agent bootstrap, AND runtime-services rebuild + dual hot-swap of the worker execution service and coordinator, defined in `src/synthorg/api/controllers/setup/agent_helpers.py`) propagates failures, and `settings_svc.set("api", "setup_complete", "true")` only runs if reinit returns clean. The whole check/validate/reinit/persist sequence is serialised under `COMPLETE_LOCK` in the same module so two concurrent `/setup/complete` requests cannot race on the flag write. A half-configured runtime presenting itself as "complete" is worse than a clear error the operator can retry after fixing the underlying provider config.

## MCP / Telemetry / Resilience

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ These are the capabilities that make SynthOrg an autonomous studio. They are des
- **Best-in-class operate tier**: a golden-company benchmark, mission control with run replay, a cost forecast/kill-switch dial, a measurable learning curve, deterministic replay, run narratives, and an adversarial red-team.
- **Agent capability layer**: a knowledge and provenance retrieval substrate, research mode, continual improvement, governed external API access, headless-browser and virtual-desktop testing, and more.

Until the agent runtime lands, multi-agent coordination, coordination metrics, autonomy/trust enforcement on a live run, and the self-improvement loop are designed and unit-tested but not exercised end to end. The design for each lives in the [Design Specification](https://synthorg.io/docs/design/).
The multi-agent coordinator runs end to end behind the provider-present switch (decompose, route, parallel execution, rollup; `/coordinate` returns a real result when a provider is configured). Coordination metrics, autonomy/trust enforcement on a live run, and the self-improvement loop are designed and unit-tested but not yet exercised end to end. The design for each lives in the [Design Specification](https://synthorg.io/docs/design/).

## Quick Start

Expand Down
2 changes: 1 addition & 1 deletion docs/design/coordination.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ description: Agent crash recovery, graceful shutdown protocol, concurrent worksp

!!! warning "Designed behaviour; runtime in active development"

This page is the source of truth for the **designed** behaviour of this subsystem. Multi-agent coordination is not wired into a running product yet (the `/coordinate` path is not active); this is in active development (see the [Roadmap](../roadmap/index.md)). The code described here is built and unit-tested as components but not yet run by a live agent.
This page is the source of truth for the **designed** behaviour of this subsystem. The multi-agent coordinator is wired at boot behind the provider-present switch: with a provider configured, `/coordinate` runs decompose, route, parallel execution, then rollup end to end; an empty company (no provider) still returns 503. The surrounding resilience features on this page (crash recovery with checkpoint resume, graceful shutdown, the self-improvement loop) remain in active development (see the [Roadmap](../roadmap/index.md)).

This page covers system-level features that span multiple agents and protect against failure: crash recovery with checkpoint resume, graceful shutdown strategies, concurrent workspace isolation (Git worktrees / virtual filesystem / per-branch), and multi-agent coordination topology (centralized, decentralized, context-dependent dispatchers).

Expand Down
2 changes: 1 addition & 1 deletion scripts/_ghost_wiring_manifest.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
# instantiated/called). Bare name match on ast.Call(func=Name|Attribute).

ENFORCED AgentEngine #1956 -- runtime root; construct at boot behind the provider switch
PENDING build_coordinator #1958 -- call at boot to populate app_state.coordinator
ENFORCED build_coordinator #1958 -- called by workers.runtime_builder.build_runtime_services behind the provider switch
PENDING BaselineStore #1959 -- construct at boot (window from budget.baseline_window_size)
PENDING CoordinationMetricsCollector #1959 -- construct at boot, thread into execution
ENFORCED IntakeEngine #1961 -- wired at boot via client/runtime_builder.build_client_simulation_runtime
60 changes: 39 additions & 21 deletions src/synthorg/api/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -992,25 +992,31 @@ def create_app( # noqa: C901, PLR0912, PLR0913, PLR0915
effective_config=effective_config,
)

_worker_service_installed = False
_runtime_services_installed = False

async def _install_worker_execution_service() -> None:
# Installs the worker execution service behind the
# provider-present switch. Appended first (runs immediately
async def _install_runtime_services() -> None:
# Installs the worker execution service AND the multi-agent
# coordinator behind the single provider-present switch, both
# sharing one boot AgentEngine. Appended first (runs immediately
# after the core startup hooks that connect persistence and
# wire SettingsService / ConfigResolver), and before any other
# appended hook, so the once-only ``set_worker_execution_service``
# cannot lose a race with the property's lazy lifecycle-only
# default. With no provider this installs the empty-company
# backstop; a provider added later swaps in the live service via
# ``post_setup_reinit`` (no restart). The closure flag keeps the
# one-shot ``set_`` idempotent across a lifespan re-entry
# (shared-app test fixtures), mirroring ``_wire_chief_of_staff_chat``.
nonlocal _worker_service_installed
if _worker_service_installed:
# / ``set_coordinator`` cannot lose a race with the
# worker-service property's lazy lifecycle-only default. With no
# provider this installs the empty-company backstop and no
# coordinator (``/coordinate`` honestly 503s); a provider added
# later swaps both in via ``post_setup_reinit`` (no restart). The
# closure flag keeps the one-shot ``set_`` calls idempotent
# across a lifespan re-entry (shared-app test fixtures),
# mirroring ``_wire_chief_of_staff_chat``.
nonlocal _runtime_services_installed
if _runtime_services_installed:
return
from synthorg.engine.errors import ( # noqa: PLC0415
RuntimeServicesBuildError,
)
from synthorg.workers.runtime_builder import ( # noqa: PLC0415
build_worker_execution_service,
build_runtime_services,
)

# Pin the sandbox workspace onto the mounted data volume in an
Expand All @@ -1022,7 +1028,7 @@ async def _install_worker_execution_service() -> None:
app_state.set_agent_workspace_root(env_workspace_root)

try:
service = await build_worker_execution_service(
services = await build_runtime_services(
app_state,
workspace_root=app_state.agent_workspace_root,
)
Expand All @@ -1031,16 +1037,28 @@ async def _install_worker_execution_service() -> None:
except Exception as exc:
logger.error(
API_APP_STARTUP,
service="worker_execution_service",
note="failed to build the worker execution service at boot",
service="runtime_services",
note="failed to build the runtime services at boot",
provider_present=app_state.has_active_provider,
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
raise
app_state.set_worker_execution_service(service)
_worker_service_installed = True

startup = [*startup, _install_worker_execution_service]
msg = "Runtime services failed to build at boot"
raise RuntimeServicesBuildError(msg) from exc
app_state.set_worker_execution_service(
services.worker_execution_service,
)
# An explicitly injected coordinator (``create_app(coordinator=)``
# in tests / custom DI) wins over the autowired one, matching the
# injection-over-autowire convention used across ``create_app``.
# ``set_coordinator_if_absent`` makes the check-and-set atomic in
# the seam (no boot-time check-then-act), so an injected
# coordinator is kept and the built one is a logged no-op then.
if services.coordinator is not None:
app_state.set_coordinator_if_absent(services.coordinator)
_runtime_services_installed = True

startup = [*startup, _install_runtime_services]

# Project telemetry: build collector (reads SYNTHORG_TELEMETRY_ENABLED env for
# opt-in, defaults to disabled). Attach to app_state so the health
Expand Down
52 changes: 41 additions & 11 deletions src/synthorg/api/controllers/setup/agent_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,31 +139,61 @@ async def post_setup_reinit(app_state: AppState) -> None:
)
raise

# 3. Rebuild + hot-swap the worker execution service so a provider
# added after an empty-company start wakes the agent runtime
# live, with no process restart. Raise on failure so the caller
# keeps ``setup_complete=false`` rather than presenting a
# half-configured runtime as complete.
# 3. Rebuild + hot-swap BOTH runtime services so a provider added
# after an empty-company start wakes the whole runtime live.
await _rebuild_runtime_services(app_state)


async def _rebuild_runtime_services(app_state: AppState) -> None:
"""Rebuild and hot-swap both runtime services (worker execution + coordinator).

Invoked after provider configuration to bring the full agent runtime
online without a process restart. Swaps the worker execution service
and the multi-agent coordinator so ``/coordinate`` stops returning
503 and the worker-callable execute endpoint uses the new provider.

Raises on failure (either a typed ``RuntimeServicesBuildError`` or a
wrapped exception) so :func:`post_setup_reinit` can keep the setup flag
as incomplete. A half-configured runtime reporting itself as complete is
worse than a clear error the operator can retry after fixing the
underlying provider configuration.
"""
try:
from synthorg.engine.errors import ( # noqa: PLC0415
RuntimeServicesBuildError,
)
from synthorg.workers.runtime_builder import ( # noqa: PLC0415
build_worker_execution_service,
build_runtime_services,
)

service = await build_worker_execution_service(
services = await build_runtime_services(
app_state,
workspace_root=app_state.agent_workspace_root,
)
app_state.swap_worker_execution_service(service)
app_state.swap_worker_execution_service(
services.worker_execution_service,
)
if services.coordinator is not None:
app_state.swap_coordinator(services.coordinator)
except MemoryError, RecursionError:
raise
except RuntimeServicesBuildError:
# Already a typed domain error (logged at its origin); re-raise
# unchanged so post_setup_reinit keeps setup_complete=false.
raise
except Exception as exc:
logger.warning(
# Critical: a provider was configured but the runtime failed to
# wire. ERROR (not WARNING) so monitoring/operator dashboards
# alert; wrapped in a domain error so the /setup/complete
# controller can map it to an actionable status.
logger.error(
SETUP_AGENT_BOOTSTRAP_FAILED,
context="worker_execution_service_rebuild",
context="runtime_services_rebuild",
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
raise
msg = "Runtime services failed to rebuild after provider config"
raise RuntimeServicesBuildError(msg) from exc


async def check_needs_admin(
Expand Down
89 changes: 88 additions & 1 deletion src/synthorg/api/state.py
Original file line number Diff line number Diff line change
Expand Up @@ -1149,9 +1149,91 @@ def coordinator(self) -> MultiAgentCoordinator:

@property
def has_coordinator(self) -> bool:
"""Check whether the coordinator is configured."""
"""Check whether the coordinator is configured.

Unsynchronised by design: a single reference read is atomic
under CPython and ``swap_coordinator`` only ever reassigns one
already-set coordinator for another, so a concurrent reader sees
a consistent old-or-new instance (both non-None). The only
``None -> set`` flip happens once at boot before HTTP traffic.
Locking this hot read (the ``/coordinate`` gate calls it per
request) would add cost for a benign snapshot.
"""
return self._coordinator is not None

def set_coordinator(self, coordinator: MultiAgentCoordinator) -> None:
"""Attach the multi-agent coordinator (once-only, boot only).

Once-only: a second set raises, matching the
``worker_execution_service`` seam. The boot runtime-services
hook uses :meth:`set_coordinator_if_absent` instead so an
explicitly injected coordinator wins; this strict variant is
retained for callers that require the once-only guarantee.
Hot-reload after setup uses :meth:`swap_coordinator`.
"""
self._set_once("_coordinator", coordinator, "Coordinator")

def set_coordinator_if_absent(
self,
coordinator: MultiAgentCoordinator,
) -> bool:
"""Attach the coordinator only if none is configured (atomic).

The boot runtime-services hook calls this unconditionally behind
the provider-present switch, so ``/coordinate`` stops returning
503 once a provider is configured. An explicitly injected
coordinator (constructor ``coordinator=``) is already set and
wins: this is a logged no-op then. The check-and-set is atomic
under ``_lazy_service_lock`` so the boot install cannot race a
concurrent ``swap_coordinator`` or property read (eliminating the
former check-then-act at the call site).

Returns:
``True`` if this call installed the coordinator, ``False``
if one was already configured (injected) and kept.
"""
with self._lazy_service_lock:
if self._coordinator is not None:
logger.info(
API_APP_STARTUP,
service="coordinator",
transition="skipped_injected",
)
return False
self._coordinator = coordinator
logger.info(
API_APP_STARTUP,
service="coordinator",
transition="attached",
)
return True

def swap_coordinator(self, coordinator: MultiAgentCoordinator) -> None:
"""Replace the coordinator (hot-reload).

Distinct from :meth:`set_coordinator`, which is once-only: this
replaces an already-wired coordinator so a provider configured
against an empty-company start brings ``/coordinate`` online
without a restart (``post_setup_reinit``). Holds
``_lazy_service_lock`` so the write is synchronised against
concurrent property reads, mirroring
:meth:`swap_worker_execution_service`.
"""
with self._lazy_service_lock:
previous = self._coordinator
if previous is coordinator:
transition = "noop"
elif previous is None:
transition = "attached"
else:
transition = "replaced"
self._coordinator = coordinator
logger.info(
API_APP_STARTUP,
service="coordinator",
transition=transition,
)

@property
def performance_tracker(self) -> PerformanceTracker:
"""Return performance tracker or raise 503."""
Expand All @@ -1160,6 +1242,11 @@ def performance_tracker(self) -> PerformanceTracker:
"performance_tracker",
)

@property
def has_performance_tracker(self) -> bool:
"""Check whether the performance tracker is configured."""
return self._performance_tracker is not None

@property
def agent_registry(self) -> AgentRegistryService:
"""Return agent registry or raise 503."""
Expand Down
Loading
Loading