Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
aebae71
fix(setup): make agents step sticky-complete with revalidation glyph
Aureliolo May 16, 2026
83b4c6a
feat(web): polling discipline + WS-offline banners on five missing pages
Aureliolo May 16, 2026
bbd4ad6
refactor(web): company-store sentinel contract + fire-and-forget guards
Aureliolo May 16, 2026
cb60573
fix(web): BudgetPage breakpoints stop overflowing at tablet widths
Aureliolo May 16, 2026
feeccaa
feat(web): breadcrumbs on ProviderDetailPage
Aureliolo May 16, 2026
5344532
feat(web): differentiated 429 / 503 / 409 toast copy
Aureliolo May 16, 2026
13a620b
fix(api): set setup_complete AFTER reinit succeeds
Aureliolo May 16, 2026
7549178
test(web): align fine-tuning store test with refined 503 copy
Aureliolo May 16, 2026
c064034
fix(web): pre-PR fixes (review batch 1)
Aureliolo May 16, 2026
fd00b5f
fix(setup): reinit failure blocks setup_complete flag
Aureliolo May 16, 2026
e8a03cd
feat: WS/SSE robustness refinements (review batch 2)
Aureliolo May 16, 2026
9222222
fix(web): pre-PR review quick wins (batch 3)
Aureliolo May 16, 2026
7cb2a3b
fix(setup): close TOCTOU races on setup completion (M7-M9)
Aureliolo May 16, 2026
e77bce8
fix(web): error category exhaustiveness + per-branch refresh logging
Aureliolo May 16, 2026
fa6b3b6
feat(web): list pagination + empty-state + 422 copy (batch 6)
Aureliolo May 16, 2026
aaf15a9
feat(web): search/filter on RequestQueuePage and BudgetPage (C5)
Aureliolo May 16, 2026
791c0cd
feat(setup): onboarding-wizard reliability backbone (C17, C18, C19)
Aureliolo May 16, 2026
84f1373
feat(web): SettingsSinksPage Delete + Reset (C11)
Aureliolo May 16, 2026
3617d85
fix(web): smaller polish (Me4, Mi2)
Aureliolo May 16, 2026
ce92ce2
feat(web): PersonalitiesAdminPage CRUD (C8)
Aureliolo May 16, 2026
0001d63
feat(web): Admin audit-log viewer page (C9)
Aureliolo May 16, 2026
f259452
feat(web): bulk-selection UI on AgentsPage, ProvidersPage, ClientList…
Aureliolo May 16, 2026
39ce011
fix(setup,web): extract complete_setup helpers + widen tablet dialogs
Aureliolo May 16, 2026
3b818e0
fix(web): finishing-touch polish (Me2, Me3, Me7, Mi4, M5)
Aureliolo May 16, 2026
59b0df4
fix(web): personalities endpoint URL satisfies dead-api gate
Aureliolo May 16, 2026
52a23ac
fix(web,setup): pre-PR review fixes for WP-6 (14 findings)
Aureliolo May 16, 2026
6e6a266
fix: babysit round 1, 22 findings (21 coderabbit, 1 gemini)
Aureliolo May 16, 2026
32c97e5
fix: babysit round 2, 11 findings (11 coderabbit) + codecov/patch
Aureliolo May 16, 2026
eb5aa0c
fix: babysit round 3, 11 findings (11 coderabbit)
Aureliolo May 16, 2026
c5ef146
fix: babysit round 5, 2 findings (1 coderabbit, 1 user-directed)
Aureliolo May 16, 2026
b7ab105
fix: babysit round 7, 2 findings (1 coderabbit, 1 ci)
Aureliolo May 16, 2026
3768913
fix: babysit round 9, 1 finding (1 coderabbit)
Aureliolo May 16, 2026
03baf13
Merge branch 'main' into feat/wp6-frontend-ux-polish
Aureliolo May 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ PYTHONPATH=. uv run zensical build # docs
- Two phases: **construction** (`create_app` body) wires synchronous services; **on_startup** (`_build_lifecycle.on_startup`) wires services that need a connected persistence backend.
- Construction-phase ordering invariants: `agent_registry` must be built BEFORE `auto_wire_meetings`; `tunnel_provider` is wired unconditionally (not gated by `integrations.enabled`).
- On-startup ordering invariants: `SettingsService` auto-wire must precede `WorkflowExecutionObserver` registration (so it picks up resolver-driven `max_subworkflow_depth` instead of the seed default); `OntologyService` wires after `persistence.connect()` via `_wire_ontology_service`.
- Setup completion: `post_setup_reinit()` (provider reload + agent bootstrap, defined in `src/synthorg/api/controllers/setup/agent_helpers.py`) propagates failures, and `settings_svc.set("api", "setup_complete", "true")` only runs if reinit returns clean. The whole check/validate/reinit/persist sequence is serialised under `COMPLETE_LOCK` in the same module so two concurrent `/setup/complete` requests cannot race on the flag write. A half-configured runtime presenting itself as "complete" is worse than a clear error the operator can retry after fixing the underlying provider config.

## MCP / Telemetry / Resilience

Expand Down
13 changes: 12 additions & 1 deletion docs/design/page-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,18 @@ Full-page authentication. JWT-based. On success, redirects to `/` (Dashboard) or

#### Setup Wizard (`/setup`)

Multi-step first-run flow. After account creation (conditional), a mode selection gate asks the user to choose **Guided Setup** (recommended, full wizard) or **Quick Setup** (minimal: provider + company name, configure rest later in Settings). Guided mode steps: account (conditional), mode selection, template selection, provider setup, company creation, agent configuration, theme customization, and completion. Quick mode steps: account (conditional), mode selection, provider setup, company creation, and completion. Providers are configured before company creation and agents so model assignment is available downstream. Each step is URL-addressable (`/setup/{step}`). The mode selection step is hidden from the progress bar. Redirects to `/` if setup is already complete.
Multi-step first-run flow. After account creation (conditional), a mode selection gate asks the user to choose **Guided Setup** (recommended, full wizard) or **Quick Setup** (minimal: provider + company name, configure rest later in Settings). Providers are configured before company creation and agents so model assignment is available downstream. Each step is URL-addressable (`/setup/{step}`). The mode selection step is hidden from the progress bar. Redirects to `/` if setup is already complete.

| Step | Route | Guided | Quick |
|------|-------|--------|-------|
| Account | `/setup/account` | conditional | conditional |
| Mode selection | `/setup/mode` | yes (hidden from progress) | yes (hidden from progress) |
| Template | `/setup/template` | yes | no |
| Providers | `/setup/providers` | yes | yes |
| Company | `/setup/company` | yes | yes |
| Agents | `/setup/agents` | yes | no |
| Theme | `/setup/theme` | yes | no |
| Complete | `/setup/complete` | yes | yes |

**Provider step layout** (`web/src/pages/setup/ProvidersStep.tsx`): a three-section picker reused on both the wizard and the Settings → Providers page. (a) **Cloud providers** -- a logo-and-name grid for hosted providers; click a card to open the credential form pre-filled with that preset. (b) **Detected on this machine** -- only renders when an auto-detect probe found a reachable local server; rows include the URL, model count, and `[Add local]` / `[Add cloud]` buttons (the cloud variant is offered when a local preset has a hosted counterpart, e.g. local Ollama → Ollama Cloud). The probe is a single batch call to `POST /providers/probe-local` issued once on mount, with a manual rescan button. (c) **Configure manually** -- opens the credential form in custom-endpoint mode. The "Detected" section is hidden entirely when nothing was detected; vLLM is intentionally omitted from auto-detect because its default port (8000) collides with the SynthOrg backend.

Expand Down
1 change: 1 addition & 0 deletions docs/reference/web-design-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ Every shared building block in `web/src/components/ui/`. Reuse before creating n
| `EmptyState` | `@/components/ui/empty-state` | No-data / no-results placeholder with icon, title, description, optional action button. |
| `ErrorBoundary` | `@/components/ui/error-boundary` | React error boundary with retry. `level` prop: `page` / `section` / `component`. |
| `ErrorBanner` | `@/components/ui/error-banner` | Error / warning / info banner for list-fetch failures, offline state, onboarding retry guidance. `severity` maps to `role=alert` (error) or `role=status` (warning / info). `variant='offline'` forces warning + WifiOff icon. Optional `onRetry`, `retryAfterSeconds` (live "Retry in Ns" countdown that disables the Retry button until the cooldown expires; pair with `ApiRequestError.retryAfter` when surfacing 429 / 503 responses), `onDismiss`, `action` slots. Use this for every page-level or form-level error surface; use toasts for mutation outcomes instead. |
| `WsConnectionBanner` | `@/components/ui/ws-connection-banner` | Page-level offline notice for surfaces that depend on live WebSocket updates. Renders only after a first successful connection (gated by an internal `everConnectedRef`) so a slow initial handshake does not flash the banner. Drop in at the top of any page that drives state from `useWebSocketStore`. |
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated
| `ConfirmDialog` | `@/components/ui/confirm-dialog` | Confirmation modal (Base UI AlertDialog) with `default` / `destructive` variants and `loading` state. |
| `ProgressIndicator` | `@/components/ui/progress-indicator` | Long-running operation progress. Variants: `determinate` (labeled bar + percentage), `indeterminate` (shimmer), `stages` (multi-step list with done / running / pending / failed). Use for fine-tuning pipelines, setup flows, provider probes. |

Expand Down
57 changes: 49 additions & 8 deletions src/synthorg/api/controllers/events.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,40 @@ async def _resolve_sse_keepalive_seconds(app_state: AppState | None) -> float:
# control-character session IDs reaching the hub.
_SESSION_ID_PATTERN = r"^[a-zA-Z0-9_-]{1,128}$"

# Maximum consecutive revalidation failures (transient persistence
# blips) before the SSE stream terminates so the client can reconnect
# against a healthy replica.
_SSE_REVALIDATE_MAX_FAILURES: int = 3
# Fallback for ``api.sse_revalidate_max_failures`` when the settings
# chain is unavailable (test harness, anonymous boot, resolver outage).
# Mirrors the registry default in
# ``src/synthorg/settings/definitions/api.py``.
_SSE_REVALIDATE_MAX_FAILURES_FALLBACK: int = 3


async def _resolve_sse_revalidate_max_failures(app_state: AppState | None) -> int:
"""Resolve the SSE revalidation failure tolerance through settings.

Falls back to :data:`_SSE_REVALIDATE_MAX_FAILURES_FALLBACK` when no
:class:`ConfigResolver` is wired (test harness, anonymous boot) or
when the resolver itself raises -- a transient settings outage
must not collapse the failure ceiling to zero.
"""
if app_state is None or not getattr(app_state, "has_config_resolver", False):
return _SSE_REVALIDATE_MAX_FAILURES_FALLBACK
try:
return await app_state.config_resolver.get_int(
"api", "sse_revalidate_max_failures"
)
except asyncio.CancelledError:
raise
except MemoryError, RecursionError:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Invalid syntax for catching multiple exceptions. In Python 3, multiple exceptions must be caught using a parenthesized tuple. The current syntax except E1, E2: is interpreted as except E1 as E2:, which only catches MemoryError and binds it to the name RecursionError.

    except (MemoryError, RecursionError):

raise
except Exception as exc:
logger.warning(
EVENT_STREAM_PROJECTION_FAILED,
note="failed to resolve api.sse_revalidate_max_failures; using fallback",
error_type=type(exc).__name__,
error=safe_error_description(exc),
fallback=_SSE_REVALIDATE_MAX_FAILURES_FALLBACK,
)
return _SSE_REVALIDATE_MAX_FAILURES_FALLBACK


async def _user_revocation_reason(
Expand All @@ -116,7 +146,7 @@ async def _user_revocation_reason(
must kick a live SSE stream within one revalidation interval).

``ok`` is False when the persistence call itself failed (transient
backend error). Callers tolerate ``_SSE_REVALIDATE_MAX_FAILURES``
backend error). Callers tolerate ``api.sse_revalidate_max_failures``
consecutive ``ok=False`` ticks before tearing down the stream.
"""
try:
Expand Down Expand Up @@ -346,13 +376,18 @@ async def _run_revalidation_tick(
app_state: AppState,
user: AuthenticatedUser,
consecutive_failures: int,
max_failures: int,
) -> _RevalidationVerdict:
"""Execute one revalidation check and return what the loop should do.

Centralises the failure-counter / role-check / session-revocation
decision tree so :func:`_sse_event_stream` does not exceed the
McCabe complexity ceiling. The caller advances its
``next_revalidate_ts`` regardless of the verdict.

``max_failures`` is the resolved ``api.sse_revalidate_max_failures``
setting; the loop tolerates this many consecutive transient
persistence errors before yielding a ``revoked`` frame.
"""
reason, ok = await _user_revocation_reason(
app_state,
Expand All @@ -361,7 +396,11 @@ async def _run_revalidation_tick(
)
if not ok:
new_failures = consecutive_failures + 1
if new_failures >= _SSE_REVALIDATE_MAX_FAILURES:
# Strictly greater-than: the docstring contract is to tolerate
# ``max_failures`` consecutive transient errors and revoke only
# once that ceiling is exceeded (failure max_failures+1), not on
# the max_failures-th failure itself.
if new_failures > max_failures:
return _RevalidationVerdict(
consecutive_failures=new_failures,
revoked_event={
Expand Down Expand Up @@ -394,8 +433,8 @@ async def _sse_event_stream( # noqa: PLR0915, PLR0912, C901
independent revalidation deadline (``SSE_REVALIDATE_INTERVAL_SECONDS``)
and fires it even on busy streams that never hit a keepalive
timeout. On revocation, yields a final ``revoked`` event
and terminates the stream. Tolerates ``_SSE_REVALIDATE_MAX_FAILURES``
transient persistence errors before escalating.
and terminates the stream. Tolerates ``api.sse_revalidate_max_failures``
consecutive transient persistence errors before escalating.
"""
consecutive_failures = 0
# Track the disconnect reason by exit path so the
Expand All @@ -420,6 +459,7 @@ async def _sse_event_stream( # noqa: PLR0915, PLR0912, C901
)
revalidation_armed = app_state is not None and user is not None
keepalive_seconds = await _resolve_sse_keepalive_seconds(app_state)
revalidate_max_failures = await _resolve_sse_revalidate_max_failures(app_state)
# Use ``app_state.clock.monotonic()`` so tests inject FakeClock
# rather than monkey-patching ``asyncio.get_event_loop().time``.
# The bare loop timer is still acceptable for async waits below.
Expand Down Expand Up @@ -469,6 +509,7 @@ async def _sse_event_stream( # noqa: PLR0915, PLR0912, C901
app_state=app_state,
user=user,
consecutive_failures=consecutive_failures,
max_failures=revalidate_max_failures,
)
consecutive_failures = verdict.consecutive_failures
if verdict.revoked_event is not None:
Expand Down
77 changes: 58 additions & 19 deletions src/synthorg/api/controllers/setup/agent_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,22 @@

logger = get_logger(__name__)

# Inverted-convention result from ``auto_select_embedder``: ``None``
# means success (a model was ranked and persisted); a ``str`` carries
# the human-readable failure reason. Aliased here so the call site
# can pass the result directly to
# ``SetupCompleteResponse.embedder_failure_reason`` without re-stating
# the inversion at every call.
type EmbedderSelectResult = str | None

# Module-level lock: serializes read-modify-write on agents settings.
AGENT_LOCK = asyncio.Lock()

# Module-level lock: serializes the entire /setup/complete flow so two
# concurrent clients cannot both pass the ``setup_complete=false`` check
# and then race on reinit + flag write.
COMPLETE_LOCK = asyncio.Lock()


def validate_agent_index(
agent_index: int,
Expand All @@ -72,9 +85,14 @@ def validate_agent_index(
async def post_setup_reinit(app_state: AppState) -> None:
"""Reload providers and bootstrap agents after setup completion.

Both operations are non-fatal: setup completion must succeed
even if re-init partially fails (the user can restart the
server to pick up changes).
Raises on failure so the caller can keep ``setup_complete=false``
when reinit cannot finish; a half-configured runtime presenting
itself as "complete" is worse than a clear error the operator can
retry after fixing the underlying provider config.

The matching call site in
:func:`SetupController.complete_setup` only persists the completion
flag when this function returns without raising.

Args:
app_state: Application state containing services.
Expand All @@ -92,11 +110,13 @@ async def post_setup_reinit(app_state: AppState) -> None:
app_state.swap_provider_registry(new_registry)
except MemoryError, RecursionError:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Invalid syntax for catching multiple exceptions. Use a tuple to catch both MemoryError and RecursionError.

    except (MemoryError, RecursionError):

raise
except Exception:
except Exception as exc:
logger.warning(
SETUP_PROVIDER_RELOAD_FAILED,
error="Provider reload failed after setup (non-fatal)",
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
raise

# 2. Bootstrap agents into runtime registry.
if app_state.has_agent_registry:
Expand All @@ -111,11 +131,13 @@ async def post_setup_reinit(app_state: AppState) -> None:
)
except MemoryError, RecursionError:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Invalid syntax for catching multiple exceptions. Use a tuple to catch both MemoryError and RecursionError.

        except (MemoryError, RecursionError):

raise
except Exception:
except Exception as exc:
logger.warning(
SETUP_AGENT_BOOTSTRAP_FAILED,
error="Agent bootstrap failed (non-fatal)",
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
raise


async def check_needs_admin(
Expand Down Expand Up @@ -340,7 +362,7 @@ async def auto_select_embedder(
available_model_ids: tuple[str, ...],
provider_preset_name: str | None = None,
has_gpu: bool | None = None,
) -> None:
) -> EmbedderSelectResult:
"""Auto-select an embedding model and persist the choice.

Best-effort: logs a warning but does not raise on failure.
Expand All @@ -351,6 +373,13 @@ async def auto_select_embedder(
available_model_ids: Model IDs discovered from providers.
provider_preset_name: Provider preset for tier inference.
has_gpu: Whether the host has a GPU.

Returns:
``None`` on success (a model was ranked and persisted), or a
short human-readable failure reason string when selection or
persistence failed. The inverted convention (None = success,
str = failure) keeps the caller free to pass the result
directly to ``SetupCompleteResponse.embedder_failure_reason``.
"""
from synthorg.memory.embedding.selector import ( # noqa: PLC0415
infer_deployment_tier,
Expand All @@ -373,20 +402,14 @@ async def auto_select_embedder(
# Try without tier filter as fallback.
ranking = select_embedding_model(available_model_ids)
if ranking is None:
reason = "no ranked embedding model available for configured providers"
logger.warning(
MEMORY_EMBEDDER_AUTO_SELECT_FAILED,
available_models=len(available_model_ids),
tier=tier.value,
reason="no LMEB-ranked model in available models",
reason=reason,
)
return
logger.info(
MEMORY_EMBEDDER_AUTO_SELECTED,
model_id=ranking.model_id,
tier=tier.value,
overall_score=ranking.overall,
dims=ranking.output_dims,
)
return reason
try:
await settings_svc.set(
"memory",
Expand All @@ -400,8 +423,24 @@ async def auto_select_embedder(
)
except MemoryError, RecursionError:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Invalid syntax for catching multiple exceptions. Use a tuple to catch both MemoryError and RecursionError.

    except (MemoryError, RecursionError):

raise
except Exception:
except Exception as exc:
reason = "failed to persist embedder settings"
logger.warning(
MEMORY_EMBEDDER_AUTO_SELECT_FAILED,
reason="failed to persist embedder settings",
reason=reason,
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
return reason
# INFO log emitted AFTER the persistence writes succeed so the
# event accurately reflects committed state. A pre-write log
# would otherwise misleadingly claim success when the writes
# below fail and fall through to the warning branch.
logger.info(
MEMORY_EMBEDDER_AUTO_SELECTED,
model_id=ranking.model_id,
tier=tier.value,
overall_score=ranking.overall,
dims=ranking.output_dims,
)
return None
63 changes: 63 additions & 0 deletions src/synthorg/api/controllers/setup_agents.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,69 @@ def validate_model_assignment(
_validate_provider_model_pair(providers, data.model_provider, data.model_id)


def validate_persisted_agents_against_providers(
providers: Mapping[str, Any],
agents: list[dict[str, Any]],
) -> None:
"""Verify every persisted agent points at a real provider+model pair.

Called from the setup-complete flow so an agent whose provider /
model was deleted between agent creation and setup completion
cannot land on a "complete" dashboard with broken model references.

Args:
providers: Provider name -> config mapping resolved from
provider_management.list_providers().
agents: Persisted agent dicts loaded from the ``company.agents``
setting (each entry has ``model.provider`` and
``model.model_id`` keys).

Raises:
ValidationError: If any agent references a provider that no
longer exists OR a model the provider no longer exposes.
The error message names the offending agent + reference so
the wizard can highlight the right row.
"""
for idx, agent in enumerate(agents):
model = agent.get("model")
if not isinstance(model, dict):
continue
provider_name = model.get("provider")
model_id = model.get("model_id")
if not isinstance(provider_name, str) or not isinstance(model_id, str):
continue
agent_label = agent.get("name") or f"agent {idx}"
if provider_name not in providers:
msg = (
f"Agent {agent_label!r} references provider "
f"{provider_name!r}, which is no longer configured. "
"Re-edit the agent or restore the provider before "
"completing setup."
)
logger.warning(
SETUP_PROVIDER_NOT_FOUND,
provider=provider_name,
agent_index=idx,
)
raise ValidationError(msg)
provider_config = providers[provider_name]
known_ids = {m.id for m in provider_config.models}
if model_id not in known_ids:
msg = (
f"Agent {agent_label!r} references model "
f"{model_id!r} on provider {provider_name!r}, which "
"the provider no longer exposes. Re-edit the agent's "
"model before completing setup."
)
logger.warning(
SETUP_MODEL_NOT_FOUND,
provider=provider_name,
model=model_id,
agent_index=idx,
)
raise ValidationError(msg)


def validate_provider_and_model(
providers: Mapping[str, Any],
data: SetupAgentRequest,
Expand Down
Loading
Loading