test: replace test placeholders with real subsystem wiring#1845
Conversation
Make REAL_LLM_API_KEY optional so the test can target local providers (Ollama, LM Studio, vLLM) reached via REAL_LLM_BASE_URL. At least one of the two must be set; both unset still skips cleanly with a documented reason.
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Code Review
This pull request implements end-to-end smoke tests for real LLM providers using LiteLLMDriver and enhances integration tests for the SelfImprovementService by integrating a real ApprovalStore workflow. Feedback focuses on improving test consistency by sharing a FakeClock instance between components to ensure synchronized time-dependent logic and timestamps.
| approval_store = ApprovalStore() | ||
| svc = SelfImprovementService( | ||
| config=SelfImprovementConfig( | ||
| enabled=True, | ||
| config_tuning_enabled=True, | ||
| ), | ||
| clock=FakeClock(), | ||
| snapshot_builder=snapshot_builder, | ||
| approval_store=ApprovalStore(), | ||
| approval_store=approval_store, | ||
| ) |
There was a problem hiding this comment.
The ApprovalStore and SelfImprovementService are initialized with different clock sources. While SelfImprovementService is explicitly given a FakeClock, ApprovalStore defaults to SystemClock. For integration tests using a fake clock, it is safer and more consistent to share the same clock instance across all components that support it. This ensures that any time-dependent logic (like expiration checks in the store) remains synchronized with the test's simulated time.
| approval_store = ApprovalStore() | |
| svc = SelfImprovementService( | |
| config=SelfImprovementConfig( | |
| enabled=True, | |
| config_tuning_enabled=True, | |
| ), | |
| clock=FakeClock(), | |
| snapshot_builder=snapshot_builder, | |
| approval_store=ApprovalStore(), | |
| approval_store=approval_store, | |
| ) | |
| clock = FakeClock() | |
| approval_store = ApprovalStore(clock=clock) | |
| svc = SelfImprovementService( | |
| config=SelfImprovementConfig( | |
| enabled=True, | |
| config_tuning_enabled=True, | |
| ), | |
| clock=clock, | |
| snapshot_builder=snapshot_builder, | |
| approval_store=approval_store, | |
| ) |
| # first-writer-wins path the API and MCP approval handlers | ||
| # take, so a regression in the store's concurrency model also | ||
| # surfaces here. | ||
| decided_at = datetime.now(UTC) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/test_single_agent_e2e.py`:
- Around line 386-392: The test docstrings/messages currently use vendor-branded
names ("LiteLLM", "Ollama", "LM Studio", "vLLM"); replace those with
vendor-agnostic phrases such as "configured provider driver", "local provider
endpoint", or "local/self-hosted provider" in the text that explains
REAL_LLM_MODEL, REAL_LLM_PROVIDER, REAL_LLM_API_KEY, and REAL_LLM_BASE_URL and
in the other occurrences near those same docstrings (the nearby strings
referenced in the comment). Ensure the meaning stays the same (auth via api_key
and endpoint via base_url) while removing specific vendor names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6e660204-71a0-4911-8075-1b3897b9daa9
📒 Files selected for processing (2)
tests/e2e/test_single_agent_e2e.pytests/integration/meta/test_meta_cycle.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Type Check
- GitHub Check: Test (Python 3.14)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Test markers: use@pytest.mark.{unit,integration,e2e,slow}. Async tests useauto. Global timeout 30s. Coverage 80% minimum.
xdist: use-n 8 --dist=loadfile(auto-applied via pyprojectaddopts;loadfileprevents 3.14+ Windows ProactorEventLoop leak).
Windows: unit tests useWindowsSelectorEventLoopPolicy(3.14 IOCP teardown race). Subprocess tests override back.
Mock-spec: every Mock declaresspec=ConcreteClass; baseline atscripts/mock_spec_baseline.txt.
FakeClock: import fromtests._shared.fake_clock; inject viaclock=parameter.
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add@example(...)). Never skip/xfail for flaky tests; fix fundamentally.
Useasyncio.Event().wait()instead ofsleep(large)to avoid flaky tests.
Files:
tests/e2e/test_single_agent_e2e.pytests/integration/meta/test_meta_cycle.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/e2e/test_single_agent_e2e.pytests/integration/meta/test_meta_cycle.py
{src,tests}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Vendor-agnostic naming: NEVER use real vendor names in project code/tests. Use
example-provider,test-provider,example-{large,medium,small}-001. Allowed only in.claude/, third-party imports,providers/presets.py,web/public/provider-logos/.
Files:
tests/e2e/test_single_agent_e2e.pytests/integration/meta/test_meta_cycle.py
🧠 Learnings (1)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
tests/e2e/test_single_agent_e2e.pytests/integration/meta/test_meta_cycle.py
🔇 Additional comments (1)
tests/integration/meta/test_meta_cycle.py (1)
131-195: Nice integration hardening of the approval path.This now validates the real
ApprovalStorelifecycle (register → fetch pending →save_if_pending→ rollout with persisted decision metadata), which gives strong regression signal for guard/store coupling.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1845 +/- ##
==========================================
- Coverage 84.78% 84.78% -0.01%
==========================================
Files 1800 1800
Lines 104864 104864
Branches 9202 9202
==========================================
- Hits 88910 88904 -6
- Misses 13716 13722 +6
Partials 2238 2238 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- tests/integration/meta/test_meta_cycle.py: share a single FakeClock between ApprovalStore and SelfImprovementService and source decided_at from clock.now() so time-dependent paths in the store (expiration, decision timestamps) stay in lockstep with the service's simulated time (Gemini #3213807073, #3213807075). - tests/e2e/test_single_agent_e2e.py: drop vendor-branded names (LiteLLM, Ollama, LM Studio, vLLM) from docstrings and skip messages in TestRealLLMIntegration; use vendor-agnostic phrasing (configured provider driver, local / self-hosted providers, provider routing key) per CLAUDE.md vendor-agnostic naming (CodeRabbit #3213808360). The LiteLLMDriver class identifier is retained: it is the project-internal wrapper around the third-party litellm package, which CLAUDE.md exempts under third-party imports.
## Summary `scripts/check_setting_to_startup_trace.py` froze 8 ghost-wired settings: 7 × `backup.*` and `security.timeout_check_interval_seconds`. Both owning services were either factory-gated to `None` in default config (`BackupService`) or bootstrapped with a hardcoded interval and never re-read at runtime (`ApprovalTimeoutScheduler`). The registered settings had no live consumer at boot. This branch wires both services through the canonical DB > env > YAML > default chain and regenerates the baseline to header-only. ## Changes **Phase 1 — `BackupService`** - Drop the `if not backup_config.enabled: return None` early return in `build_backup_service`. The factory always returns a real `BackupService`. `BackupService.start()` already gates internally on `self._config.enabled`, so a deployment with `backup.enabled=false` still avoids scheduler ticks. Adds an INFO log on the dormant path so operators can confirm the deliberate state from the boot log. **Phase 2 — `ApprovalTimeoutScheduler`** - New `_apply_security_timeout_interval` helper in `lifecycle_helpers.py`, called after `_apply_bridge_config`, resolves `security.timeout_check_interval_seconds` via `ConfigResolver` and calls `scheduler.reschedule(...)`. - New `SecurityTimeoutSettingsSubscriber` (mirrors `BackupSettingsSubscriber._reschedule`) handles live operator changes via the dispatcher. - `_build_settings_dispatcher` + `BuildDispatcherFn` Protocol + `auto_wire_settings` extended with the new optional `approval_timeout_scheduler` parameter; scheduler construction in `app.py` moved earlier so the dispatcher can wire the matching subscriber against the same instance. **Phase 3 — Baseline** - `scripts/setting_to_startup_trace_baseline.txt` regenerated to header-only (`uv run python scripts/check_setting_to_startup_trace.py --update-baseline`). - Smoke test `test_real_repo_violations_match_expected` updated to assert empty-violation set as the new ground truth. - `scripts/loop_bound_init_baseline.txt` line 74 → 75 to track the shifted import in `backup/service.py`. **Pre-PR review fixes (commit 4)** - Drop now-stale ghost-service examples in `docs/reference/configuration-precedence.md`; replace with a generic "fixing a ghost-wired service" guide referencing both as end-to-end examples. - Rewrite two app.py comments to explain the current architectural reason instead of framing the change relative to the prior site (CLAUDE.md no-migration-framing). - Bind exception in `build_backup_service` failure path; log via `safe_error_description` + `error_type` per SEC-1 redaction. - Bind float-parse failure in `SecurityTimeoutSettingsSubscriber`; add `error_type` so operators can distinguish a non-numeric value from a None value. - Document `BACKUP_SCHEDULER_STARTED` as the started-branch's observable signal in `BackupService.start`'s docstring. ## Test Plan - 16 new tests across 4 files: factory (3), helper (5), subscriber (8); plus updated lint smoke test. - Full unit suite: 29,463 passed (50 skipped Windows / opt-in markers). - Focused integration: 55 passed (3 skipped — Docker unavailable). - Gates clean: `check_setting_to_startup_trace.py`, `check_no_loop_bound_init.py`, `check_logger_exception_str_exc.py`, `check_no_magic_numbers.py`, `check_mock_spec.py`. - ruff + mypy strict + ruff format: clean. ## Review coverage `/pre-pr-review` ran 17 specialized agents in parallel (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, resilience-audit, conventions-enforcer, security-reviewer, test-quality-reviewer, async-concurrency-reviewer, comment-analyzer, docs-consistency, comment-quality-rot, api-contract-drift, plus 5 mini-pass agents from the codebase audit: missing-logger, missing-event-constants, missing-state-transition-log, unwired-settings, race-conditions). 6 valid findings retained and fixed; 1 false-positive cluster (PEP 758 `except A, B:` flagged as Python 2 syntax) discarded with rationale (CLAUDE.md explicitly permits this form). ## Notes - No PR-tracked GitHub issue (per /pre-pr-review prompt: user chose to proceed without). - Rebased onto `origin/main` to pull in `#1845` (unrelated test refactor) before pushing.
<!-- HIGHLIGHTS_START --> ## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - Improved error logging and Prometheus instrumentation provide better system monitoring. - Eliminated race conditions in CI tagging for more reliable development releases. - Fixed critical configuration access and kill-switch bugs to enhance system stability. - Enhanced client experience with retry-after headers and better websocket reconnect behavior. ### What's new - Introduced composite indexes and cursor pagination for faster data queries. - Added server-sent events rate limiting and Ollama input sanitization for improved security. ### Under the hood - Centralized workflow error mappings to standardize error handling. - Refactored API lifecycle fallback to use a configuration snapshot for consistency. - Tightened startup settings baseline and reduced controller error baseline to zero. - Replaced flaky contributor-assistant GitHub action with a custom stable step. - Consolidated Renovate dependency groups to avoid update conflicts. - Upgraded in-toto-golang dependency to fix security vulnerabilities and dropped unnecessary CVE waivers. - Extensive lock file maintenance and multiple infrastructure and Python dependency updates. <!-- HIGHLIGHTS_END --> :robot: I have created a release *beep* *boop* --- ## [0.8.2](v0.8.1...v0.8.2) (2026-05-10) ### Features * close audit gaps in error logging and Prometheus instrumentation ([#1821](#1821)) ([ef00fdc](ef00fdc)) ### Bug Fixes * **ci:** eliminate dev-release tag-vs-downstream race + CI hygiene audit ([#1827](#1827)) ([b7b9a59](b7b9a59)) * **config:** close 6 settings reachability + kill-switch gaps ([#1798](#1798)) ([410cb3b](410cb3b)) * correctness / safety fixes from 2026-05-05 audit (Wave 28) ([#1823](#1823)) ([d01e624](d01e624)) ### Performance * composite indexes + cursor pagination + clock seam + SSE rate-limit + Ollama sanitization + retry-after web client + WS reconnect jitter ([#1822](#1822)) ([d1faf86](d1faf86)) ### Refactoring * **api:** move activities lifecycle-cap fallback to ApiBridgeConfig snapshot ([#1840](#1840)) ([7a56e9c](7a56e9c)) * centralise workflow error mapping and shared error codes ([#1778](#1778) sub-tasks A + E) ([#1843](#1843)) ([11132cd](11132cd)) * drive controller-error baseline to zero ([#1778](#1778) sub-task A tail) ([#1846](#1846)) ([e96ae20](e96ae20)) * slim CLAUDE.md, port pr-review-toolkit agents, sync .opencode parity ([#1833](#1833)) ([e6372b8](e6372b8)) * tighten settings → startup-trace baseline (8 → 0) ([#1847](#1847)) ([3376ee2](3376ee2)) ### Documentation * fix CLAUDE.md inaccuracies and drop drift-prone counts ([#1844](#1844)) ([371925f](371925f)) ### Tests * replace test placeholders with real subsystem wiring ([#1845](#1845)) ([ddbb666](ddbb666)) ### CI/CD * **cla:** replace flaky contributor-assistant action with custom read-path step ([#1819](#1819)) ([11aeafe](11aeafe)) * tidy dev-release notes + stagger renovate lockfile day ([#1824](#1824)) ([ec746a9](ec746a9)) ### Maintenance * cleanup roundup, sub-tasks a/c/d/g/h/j/l/m of [#1781](#1781) ([#1838](#1838)) ([099b871](099b871)) * close remaining 5 sub-tasks of [#1781](#1781) (b/e/f/i/k) ([#1852](#1852)) ([59cf0b2](59cf0b2)) * collapse Renovate dep groups into Python / Web / Infrastructure to remove cross-PR overlap ([#1813](#1813)) ([4cbd857](4cbd857)) * **deps,security:** bump in-toto-golang v0.11.0 + drop two patched CVE waivers ([#1851](#1851)) ([0b8b5bb](0b8b5bb)) * disable Renovate vulnerabilityAlerts so security flows into normal updates ([#1834](#1834)) ([6b7d15f](6b7d15f)) * Lock file maintenance ([#1820](#1820)) ([ccbad73](ccbad73)) * Lock file maintenance ([#1842](#1842)) ([13b68a5](13b68a5)) * Lock file maintenance ([#1853](#1853)) ([db6650b](db6650b)) * Update dhi.io/nats:2.14-debian13 Docker digest to eb768bf ([#1841](#1841)) ([37f84fc](37f84fc)) * Update Infrastructure dependencies ([#1815](#1815)) ([75b12fe](75b12fe)) * Update Infrastructure dependencies ([#1831](#1831)) ([3f3c50b](3f3c50b)) * Update Python dependencies ([#1817](#1817)) ([e11332f](e11332f)) * Update Python dependencies ([#1832](#1832)) ([4515c8e](4515c8e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Closes #1837.
Two tests carried
pytest.skip/model_copyworkarounds with TODO comments. Both now exercise their real subsystems.What changed
Sub-task A — real LLM provider e2e (
tests/e2e/test_single_agent_e2e.py)TestRealLLMIntegration.test_real_provider_text_completionno longer unconditionally callspytest.skip("Real LLM provider integration not yet wired"). It now constructs a realLiteLLMDriverwith a minimalProviderConfigand runsAgentEngine.runend-to-end against the configured model.Env-var gates (skip if any missing, CI-safe by default):
REAL_LLM_TEST=1REAL_LLM_MODELllama3.2:1b,gpt-5-mini)REAL_LLM_PROVIDERopenai,anthropic,ollama, …)REAL_LLM_API_KEYREAL_LLM_BASE_URLAt least one of
REAL_LLM_API_KEYorREAL_LLM_BASE_URLmust be set; both unset still skips cleanly. Empty-string env vars are normalised toNonesoProviderConfig.NotBlankStrvalidation accepts them.Sub-task B — ApprovalStore wiring in meta-cycle test (
tests/integration/meta/test_meta_cycle.py)test_proposal_rollout_succeedsno longer mutates the proposal viaproposal.model_copy(update={"status": APPROVED, ...})in isolation. The approval round-trips through the realApprovalStore:run_cycletriggersApprovalGateGuard.evaluate, which registers anApprovalItemwith deterministic iduuid5(NAMESPACE_URL, "proposal:<id>").approval_store.get(approval_id)and assertsis not None+status == PENDING.approval_store.save_if_pending(...)— same first-writer-wins path assynthorg.api.controllers.approvals._decideandsynthorg.meta.mcp.handlers.approvals._decide.execute_rolloutcarries decision metadata (decided_at/decided_by/decision_reason) sourced from the store-residentApprovalItem, not from a free-standing test mutation.A regression in the guard (different id derivation, failure to register, store concurrency bug) now surfaces as a failed
assert item is not Nonerather than a silently-bypassed flow. Uses in-memoryApprovalStore()(no persistence wiring) — persistence parity is covered bytests/conformance/persistence/.Test plan
uv run ruff check src/ tests/ --fix— cleanuv run ruff format src/ tests/— clean (3689 unchanged)uv run mypy src/ tests/—Success: no issues found in 3689 source filesuv run python -m pytest tests/integration/meta/ tests/e2e/ tests/unit/api/ -n 0— 3253 passed, 4 skipped (one xdist scheduler crash on the full unit suite was unrelated worker churn — affected-module pre-push gate stayed clean)llama3.2:1b(REAL_LLM_TEST=1,REAL_LLM_PROVIDER=ollama,REAL_LLM_BASE_URL=http://localhost:11434, no API key) — passed in 31s on the first run, 10s on the warm re-run after a docstring tweak.REAL_LLM_TEST=1set, the test skips with the precise method-level reason instead of running.Review coverage
Commits
137ff3f— wire realLiteLLMDriverin single-agent e2e test5694937— route meta-cycle proposal approval through realApprovalStore8bcd1b5— support local LLM providers (api_key optional,REAL_LLM_BASE_URL)ca6ab9a— clarify auth-config docstring + comment in real-LLM test