refactor(memory): per-call processing_kwargs + observability for ST encode#1943
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (7)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
🧰 Additional context used📓 Path-based instructions (4)src/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
⚙️ CodeRabbit configuration file
Files:
**/*.{md,rst}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
docs/**/*.{md,d2,mermaid}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
tests/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (1)📚 Learning: 2026-05-05T09:04:46.195ZApplied to files:
🔇 Additional comments (26)
WalkthroughThis pull request refactors the embedding fine-tuning pipeline (Stages 2 and 4) to enforce per-role token-length constraints and improve observability. It adds two observability events, shared encoding helpers that apply different max-length caps for queries (128 tokens) and passages (512 tokens) with truncation enabled and a truncation-likely warning, refactors hard-negative mining and evaluation to use these helpers and centralized persistence, extends unit tests to assert per-call max_length behavior and warnings, and updates docs and Gemini review configuration. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces explicit token limits and truncation observability for the embedding fine-tuning pipeline, refactoring the encoding logic to ensure consistency between mining and evaluation. The reviewer identified critical runtime NameError risks caused by type hints for CancellationToken and EvalMetrics being evaluated before their imports. Additionally, the reviewer noted that the cancellation token was bypassed during the encoding phase of hard negative mining, which prevents the process from being interruptible.
| model_name: str, | ||
| queries: list[str], | ||
| passages: list[str], | ||
| cancellation: CancellationToken | None, |
There was a problem hiding this comment.
The type hint CancellationToken will cause a NameError at runtime because it is only imported within a TYPE_CHECKING block. Since from __future__ import annotations is not used in this file, type hints in function signatures are evaluated at module definition time. You should use a string forward reference (e.g., "CancellationToken | None") for any type hints that are not available at runtime. This applies to several other new functions in this module as well.
cancellation: "CancellationToken | None",| passages: list[str], | ||
| cancellation: CancellationToken | None, | ||
| progress_callback: ProgressCallback | None, | ||
| ) -> EvalMetrics: |
There was a problem hiding this comment.
| eval_metrics_cls: type[EvalMetrics], | ||
| ) -> EvalMetrics: |
There was a problem hiding this comment.
Similar to other occurrences in this refactor, EvalMetrics is not defined at the module level and will cause a NameError when the function signature is evaluated during module import. Use string forward references for both the argument and the return type.
eval_metrics_cls: type["EvalMetrics"],
) -> "EvalMetrics":| model_name=model_name, | ||
| queries=queries, | ||
| passages=passages, | ||
| cancellation=None, |
There was a problem hiding this comment.
The cancellation token is explicitly set to None here, which bypasses cancellation checks during the encoding phase of hard negative mining. This is typically the most time-consuming part of the process and should remain interruptible by passing the cancellation token received by the parent function.
cancellation=cancellation,
Merging this PR will not alter performance
Comparing Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/memory.md`:
- Around line 268-271: Replace hardcoded numeric caps in the prose with
runtime-stats markers: substitute "max_length=512" and "max_length=128" with
<!--RS:MAX_PASSAGE_LENGTH--> and <!--RS:MAX_QUERY_LENGTH--> respectively, and
replace any explicit "token cap" numeric mentions on the later lines (275-276)
with an appropriate <!--RS:...--> marker (e.g., <!--RS:TOKEN_CAP-->); keep the
surrounding text and the warning reference
memory.fine_tune.encode_truncation_likely WARNING unchanged, and ensure the
marker names match keys present in data/runtime_stats.yaml.
In `@docs/reference/embedding-evaluation.md`:
- Around line 197-198: The documentation line "Evaluation (NDCG@10, Recall@10)
re-applies the same query (128) / passage (512) token caps with truncation
enabled..." is inconsistently placed under Stage 3 wording; update the sentence
to explicitly place evaluation under Stage 4 (Deploy) to match the pipeline
stage labels — e.g., change the wording to "Stage 4 — Evaluation (NDCG@10,
Recall@10) re-applies..." or otherwise move the evaluation description into the
Stage 4 section so that the phrase "Evaluation (NDCG@10, Recall@10)..." clearly
belongs to Stage 4 and removes any ambiguity about stage ownership.
- Around line 185-188: Replace the hardcoded numeric caps "128" and "512" in the
embedding-evaluation doc with runtime-stats markers: substitute the literal
`_QUERY_MAX_LENGTH = 128` and `_PASSAGE_MAX_LENGTH = 512` occurrences with
`<!--RS:QUERY_MAX_LENGTH-->` and `<!--RS:PASSAGE_MAX_LENGTH-->` respectively
(also update the second occurrence around lines 197-198), keeping the
surrounding wording and the `processing_kwargs={"text": {"max_length": NAME,
"truncation": True}}` example and the
`memory.fine_tune.encode_truncation_likely` log reference unchanged so the doc
pulls values from data/runtime_stats.yaml at build time.
In `@src/synthorg/memory/embedding/fine_tune.py`:
- Around line 446-472: The _mine_negatives_from_pairs function currently
disables cancellation by passing cancellation=None to
_encode_query_passage_pair; update the call to forward the received cancellation
token (use cancellation=cancellation) so encoding respects cancellation checks
like _run_eval_pipeline does; modify the _encode_query_passage_pair invocation
in _mine_negatives_from_pairs to pass through the cancellation parameter and
keep all other args unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4eca8776-9a12-47a6-96af-53b369016b97
📒 Files selected for processing (5)
docs/design/memory.mddocs/reference/embedding-evaluation.mdsrc/synthorg/memory/embedding/fine_tune.pysrc/synthorg/observability/events/memory.pytests/unit/memory/embedding/test_fine_tune.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Build Backend
- GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Lighthouse Site
- GitHub Check: Build Web Assets (melange)
- GitHub Check: CodSpeed Python benchmarks
- GitHub Check: Test (Python 3.14)
- GitHub Check: Type Check
- GitHub Check: Build Preview
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Configuration precedence: DB > env > code default viaSettingsService/ConfigResolver(Cat-1); env > code default (Cat-2,read_only_post_init); Cat-3 bootstrap secrets pure env at boot site per docs/reference/configuration-precedence.md
Noos.environ.getoutside startup; pre-init Cat-2 reads usesettings.bootstrap_resolver.resolve_init_value
No hardcoded numeric values; numerics live insettings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal) per check_no_magic_numbers.py
Use PEP 758 exception syntax; noexcept A, B:without parens unless binding; nofrom __future__ import annotations(Python 3.14 has PEP 649)
Type hints on public functions; mypy strict mode required
Use Google-style docstrings
Error classes follow<Domain><Condition>Errornaming fromDomainError; never inheritException/RuntimeErrordirectly per check_domain_error_hierarchy.py
Pydantic v2 frozen models withextra="forbid"on API DTOs (Request/Response/Snapshot/Result/Envelope/Status/Info/Summary suffixes); use@computed_fieldfor derived fields; useNotBlankStrfor identifiers
Args models at every system boundary; useparse_typed()for every external dict ingestion per check_boundary_typed.py
Immutability: usemodel_copy(update=...)orcopy.deepcopy(); deepcopy at system boundaries
Async: useasyncio.TaskGroupfor fan-out/fan-in; helper functions catchException(re-raiseMemoryError/RecursionError)
Clock seam: injectclock: Clock | None = None; tests injectFakeClockfor time-dependent code
Lifecycle services own_lifecycle_lock; timed-out stops mark services as unrestartable
Untrusted content (SEC-1): usewrap_untrusted()fromengine.prompt_safety; useHTMLParseGuardfor HTML per sec-prompt-safety.md
Repository CRUD: implementsave(entity),get(id),delete(id) -> bool,list_items(...), `query(......
Files:
src/synthorg/observability/events/memory.pysrc/synthorg/memory/embedding/fine_tune.py
⚙️ CodeRabbit configuration file
This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.
Files:
src/synthorg/observability/events/memory.pysrc/synthorg/memory/embedding/fine_tune.py
**/*.{md,rst}
📄 CodeRabbit inference engine (CLAUDE.md)
Numerics in README and public docs sourced from
data/runtime_stats.yamlvia<!--RS:NAME-->markers per data/README.md
Files:
docs/design/memory.mddocs/reference/embedding-evaluation.md
docs/**/*.{md,d2,mermaid}
📄 CodeRabbit inference engine (CLAUDE.md)
Use D2 (theme 200, Dark Mauve, v0.7.1) for architecture/nested containers; use mermaid for flowcharts/sequence/pipelines; Markdown tables for tabular data
Files:
docs/design/memory.mddocs/reference/embedding-evaluation.md
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Test markers: use@pytest.mark.{unit,integration,e2e,slow}; async auto; timeout 30s global; coverage 80% minimum
Windows unit tests useWindowsSelectorEventLoopPolicy(3.14 IOCP teardown race); subprocess tests override back
Test doubles: useFakeClockfor Clock seam,mock_of[T](**overrides)for typed-boundary substitutions,SimpleNamespacefor attribute-bags; bareMagicMockat typed boundary blocked by check_mock_spec.py (zero-tolerance)
FakeClock andmock_ofimport fromtests._shared; inject viaclock=and helper subscript
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add@example(...))
Flaky tests: never skip/xfail; fix fundamentally; useasyncio.Event().wait()notsleep(large)for timeouts
Files:
tests/unit/memory/embedding/test_fine_tune.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/unit/memory/embedding/test_fine_tune.py
🧠 Learnings (1)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
src/synthorg/observability/events/memory.pytests/unit/memory/embedding/test_fine_tune.pysrc/synthorg/memory/embedding/fine_tune.py
🔇 Additional comments (11)
src/synthorg/observability/events/memory.py (1)
116-119: LGTM!src/synthorg/memory/embedding/fine_tune.py (7)
30-31: LGTM!
87-93: LGTM!
95-99: LGTM!
101-134: LGTM!
137-164: LGTM!
167-190: LGTM!
475-476: LGTM!Also applies to: 479-516, 707-780
tests/unit/memory/embedding/test_fine_tune.py (3)
31-83: LGTM!
86-101: LGTM!
287-521: LGTM!
| 2. **Hard negative mining**: base model embeds all passages (max_length=512) and queries | ||
| (max_length=128) with truncation enabled; top-k semantically similar but non-matching | ||
| passages become hard negatives. Inputs that overflow the token cap surface a | ||
| `memory.fine_tune.encode_truncation_likely` WARNING so silent quality loss is visible |
There was a problem hiding this comment.
Use runtime-stats markers for newly added numeric values in public docs.
The newly added numeric caps are hardcoded in prose. Please source these numerics via <!--RS:NAME--> markers to satisfy the docs data-sourcing rule.
As per coding guidelines, "Numerics in README and public docs sourced from data/runtime_stats.yaml via <!--RS:NAME--> markers per data/README.md".
Also applies to: 275-276
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/design/memory.md` around lines 268 - 271, Replace hardcoded numeric caps
in the prose with runtime-stats markers: substitute "max_length=512" and
"max_length=128" with <!--RS:MAX_PASSAGE_LENGTH--> and
<!--RS:MAX_QUERY_LENGTH--> respectively, and replace any explicit "token cap"
numeric mentions on the later lines (275-276) with an appropriate <!--RS:...-->
marker (e.g., <!--RS:TOKEN_CAP-->); keep the surrounding text and the warning
reference memory.fine_tune.encode_truncation_likely WARNING unchanged, and
ensure the marker names match keys present in data/runtime_stats.yaml.
| - Encoding constraints: per-call `processing_kwargs={"text": {"max_length": N, "truncation": True}}` | ||
| with `_QUERY_MAX_LENGTH = 128` for queries and `_PASSAGE_MAX_LENGTH = 512` for passages. | ||
| Inputs whose word count likely exceeds the token limit emit a | ||
| `memory.fine_tune.encode_truncation_likely` WARNING log. |
There was a problem hiding this comment.
Replace hardcoded numeric caps with runtime-stats markers.
These newly introduced public-doc numerics should be injected via <!--RS:NAME--> markers instead of inline literals.
As per coding guidelines, "Numerics in README and public docs sourced from data/runtime_stats.yaml via <!--RS:NAME--> markers per data/README.md".
Also applies to: 197-198
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/reference/embedding-evaluation.md` around lines 185 - 188, Replace the
hardcoded numeric caps "128" and "512" in the embedding-evaluation doc with
runtime-stats markers: substitute the literal `_QUERY_MAX_LENGTH = 128` and
`_PASSAGE_MAX_LENGTH = 512` occurrences with `<!--RS:QUERY_MAX_LENGTH-->` and
`<!--RS:PASSAGE_MAX_LENGTH-->` respectively (also update the second occurrence
around lines 197-198), keeping the surrounding wording and the
`processing_kwargs={"text": {"max_length": NAME, "truncation": True}}` example
and the `memory.fine_tune.encode_truncation_likely` log reference unchanged so
the doc pulls values from data/runtime_stats.yaml at build time.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1943 +/- ##
==========================================
+ Coverage 85.14% 85.20% +0.06%
==========================================
Files 1846 1846
Lines 107753 107802 +49
Branches 9287 9288 +1
==========================================
+ Hits 91741 91855 +114
+ Misses 13763 13694 -69
- Partials 2249 2253 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3db012f to
b92ec03
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/synthorg/memory/embedding/fine_tune.py`:
- Around line 173-179: The code currently assumes every JSONL record has "query"
and "positive_passage" and will raise opaque KeyError; iterate the loaded pairs
and validate each record at the ingestion boundary (use parse_typed() or a small
schema checker) to ensure required fields exist and are the right type, raising
a contextual ValueError identifying the bad record (index and missing/invalid
fields) if validation fails; after validation (or after filtering out invalid
rows if intended) enforce require_non_empty and then build queries = [p["query"]
for p in validated_pairs] and passages = [p["positive_passage"] for p in
validated_pairs]; reference _read_jsonl, pairs, require_non_empty, parse_typed()
and the returned queries/passages to locate where to add this check.
- Around line 146-154: The code currently calls _encode_with_observability to
produce q_embs before honoring a potential pre-cancellation; add an early
cancellation check (call cancellation.check()) immediately before invoking
_encode_with_observability so a cancelled run skips the expensive encode. Update
the block around q_embs = await _encode_with_observability(...) to call
cancellation.check() if cancellation is not None prior to awaiting the encode,
leaving the existing post-encode check in place.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7b39860-2367-4847-88d8-6f0375d2306b
📒 Files selected for processing (7)
.gemini/config.yaml.gemini/styleguide.mddocs/design/memory.mddocs/reference/embedding-evaluation.mdsrc/synthorg/memory/embedding/fine_tune.pysrc/synthorg/observability/events/memory.pytests/unit/memory/embedding/test_fine_tune.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
- GitHub Check: Build Backend
- GitHub Check: CodSpeed Python benchmarks
- GitHub Check: Build Web Assets (melange)
- GitHub Check: Test (Python 3.14)
- GitHub Check: Type Check
- GitHub Check: Lighthouse Site
- GitHub Check: Build Preview
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Configuration precedence: DB > env > code default viaSettingsService/ConfigResolver(Cat-1); env > code default (Cat-2,read_only_post_init); Cat-3 bootstrap secrets pure env at boot site per docs/reference/configuration-precedence.md
Noos.environ.getoutside startup; pre-init Cat-2 reads usesettings.bootstrap_resolver.resolve_init_value
No hardcoded numeric values; numerics live insettings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal) per check_no_magic_numbers.py
Use PEP 758 exception syntax; noexcept A, B:without parens unless binding; nofrom __future__ import annotations(Python 3.14 has PEP 649)
Type hints on public functions; mypy strict mode required
Use Google-style docstrings
Error classes follow<Domain><Condition>Errornaming fromDomainError; never inheritException/RuntimeErrordirectly per check_domain_error_hierarchy.py
Pydantic v2 frozen models withextra="forbid"on API DTOs (Request/Response/Snapshot/Result/Envelope/Status/Info/Summary suffixes); use@computed_fieldfor derived fields; useNotBlankStrfor identifiers
Args models at every system boundary; useparse_typed()for every external dict ingestion per check_boundary_typed.py
Immutability: usemodel_copy(update=...)orcopy.deepcopy(); deepcopy at system boundaries
Async: useasyncio.TaskGroupfor fan-out/fan-in; helper functions catchException(re-raiseMemoryError/RecursionError)
Clock seam: injectclock: Clock | None = None; tests injectFakeClockfor time-dependent code
Lifecycle services own_lifecycle_lock; timed-out stops mark services as unrestartable
Untrusted content (SEC-1): usewrap_untrusted()fromengine.prompt_safety; useHTMLParseGuardfor HTML per sec-prompt-safety.md
Repository CRUD: implementsave(entity),get(id),delete(id) -> bool,list_items(...), `query(......
Files:
src/synthorg/observability/events/memory.pysrc/synthorg/memory/embedding/fine_tune.py
⚙️ CodeRabbit configuration file
This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.
Files:
src/synthorg/observability/events/memory.pysrc/synthorg/memory/embedding/fine_tune.py
**/*.{md,rst}
📄 CodeRabbit inference engine (CLAUDE.md)
Numerics in README and public docs sourced from
data/runtime_stats.yamlvia<!--RS:NAME-->markers per data/README.md
Files:
docs/reference/embedding-evaluation.mddocs/design/memory.md
docs/**/*.{md,d2,mermaid}
📄 CodeRabbit inference engine (CLAUDE.md)
Use D2 (theme 200, Dark Mauve, v0.7.1) for architecture/nested containers; use mermaid for flowcharts/sequence/pipelines; Markdown tables for tabular data
Files:
docs/reference/embedding-evaluation.mddocs/design/memory.md
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Test markers: use@pytest.mark.{unit,integration,e2e,slow}; async auto; timeout 30s global; coverage 80% minimum
Windows unit tests useWindowsSelectorEventLoopPolicy(3.14 IOCP teardown race); subprocess tests override back
Test doubles: useFakeClockfor Clock seam,mock_of[T](**overrides)for typed-boundary substitutions,SimpleNamespacefor attribute-bags; bareMagicMockat typed boundary blocked by check_mock_spec.py (zero-tolerance)
FakeClock andmock_ofimport fromtests._shared; inject viaclock=and helper subscript
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add@example(...))
Flaky tests: never skip/xfail; fix fundamentally; useasyncio.Event().wait()notsleep(large)for timeouts
Files:
tests/unit/memory/embedding/test_fine_tune.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/unit/memory/embedding/test_fine_tune.py
🧠 Learnings (1)
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
src/synthorg/observability/events/memory.pytests/unit/memory/embedding/test_fine_tune.pysrc/synthorg/memory/embedding/fine_tune.py
🔇 Additional comments (4)
.gemini/config.yaml (1)
1-18: LGTM!.gemini/styleguide.md (1)
1-147: LGTM!docs/design/memory.md (1)
268-271: Use runtime-stats markers instead of inline numeric caps in public docs.This still uses hardcoded numeric values in prose for token caps.
As per coding guidelines, "Numerics in README and public docs sourced from
data/runtime_stats.yamlvia<!--RS:NAME-->markers per data/README.md".Also applies to: 275-276
docs/reference/embedding-evaluation.md (1)
185-188: Replace hardcoded token-cap numerics with runtime-stats markers.Public-doc numeric values are still inline literals instead of marker-based sourced values.
As per coding guidelines, "Numerics in README and public docs sourced from
data/runtime_stats.yamlvia<!--RS:NAME-->markers per data/README.md".Also applies to: 202-203
…1 reviewer-config) - Forward cancellation token in _mine_negatives_from_pairs so the encoding phase of hard-negative mining is interruptible (CodeRabbit + Gemini, fine_tune.py:462). - Restructure embedding-evaluation.md pipeline stages: split the old Stage 3 evaluation bullet into a dedicated Stage 4, renumber Deploy to Stage 5 so the doc matches the fine_tune.py module header (5-stage pipeline) and removes stage-ownership ambiguity (CodeRabbit). - Add .gemini/styleguide.md + .gemini/config.yaml so Gemini Code Assist stops flagging PEP 649 TYPE_CHECKING annotations as NameError risks on every PR, and reads project conventions from CLAUDE.md. Skipped (factually wrong against current code): - Gemini #1/#2/#3 (NameError on CancellationToken / EvalMetrics in TYPE_CHECKING block): Python 3.14 PEP 649 evaluates annotations lazily; existing unquoted CancellationToken | None at fine_tune.py:568 has been in CI without issue. CLAUDE.md mandates this convention. - CodeRabbit RS-marker findings on docs/design/memory.md and docs/reference/embedding-evaluation.md: out of scope for scripts/check_doc_numeric_macros.py, which scopes to README + 4 docs (neither file present) and matches digits only adjacent to specific stat nouns (tests/providers/agents/stars/releases).
- Add an early cancellation.check() before the query encode in _encode_query_passage_pair so a pre-cancelled run skips the expensive first encode call (CodeRabbit, fine_tune.py:154). - Validate JSONL records at the ingestion boundary in _load_query_passage_pairs: every record must have the 'query' and 'positive_passage' string fields. Raise contextual ValueError / TypeError identifying the bad record index, and log MEMORY_FINE_TUNE_VALIDATION_FAILED so silent data-shape regressions surface during fine-tune runs (CodeRabbit, fine_tune.py:179). - Cover both new failure paths in test_fine_tune.py (missing-field + non-string-field records). Skipped (factually wrong against current code, same as round 1): - CodeRabbit RS-marker findings on docs/design/memory.md and docs/reference/embedding-evaluation.md: both files remain out of scope for scripts/check_doc_numeric_macros.py, which scans only README + 4 specific docs and only matches digits adjacent to tracked stat nouns (tests/providers/agents/stars/releases). max_length=N is not a tracked stat noun and the gate would not fail on these lines.
b92ec03 to
0c5f90f
Compare
<!-- HIGHLIGHTS_START --> ## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - Frontend WP-6 update with UX polish improves user interface and workflow. - Dashboard and training endpoint improvements enhance observability and dispatch behavior. - Web storybook now supports change detection for more responsive UI interactions. - Git hooks now isolated per worktree for cleaner repository management. - Providers automatically detect native streaming support in Litellm models. ### What's new - Added a new pipeline to convert Pydantic DTOs to TypeScript for better front-end compatibility. ### Under the hood - Refactored settings to three precedence categories, removing YAML tier for simpler configuration. - Completed RootConfig mirror coverage for enhanced configuration consistency. - Adopted API conventions with better query performance and forbidden extra fields for stricter validation. - Improved persistence, layer discipline, and restart safety in core work packages. - CI updated with split test jobs and tightened coverage gates for better test quality. - Switched to direct Trivy binary for security scans, removing previous Trivy action dependency. - Enhanced memory management with per-call processing options and better observability during speech-to-text encoding. - Various dependency updates for Python, infrastructure, and lock files maintain security and stability. - Removed TypeScript DTO type-tightening overlays to simplify type management. - Codebase audit tightened skill sets to prevent false positivity in class detection by 2026. <!-- HIGHLIGHTS_END --> :robot: I have created a release *beep* *boop* --- ## [0.8.5](v0.8.4...v0.8.5) (2026-05-17) ### Features * **codegen:** pydantic-to-typescript DTO pipeline + parity gate (closes [#1889](#1889)) ([#1909](#1909)) ([0265ef5](0265ef5)) * **storybook:** enable changeDetection + trim web/CLAUDE.md ([#1939](#1939)) ([3b1f4c0](3b1f4c0)) * **web,setup:** WP-6 frontend + UX polish ([#1941](#1941)) ([d9ca76d](d9ca76d)) ### Bug Fixes * correct invalid git for-each-ref syntax in post-merge-cleanup skill ([#1946](#1946)) ([69a1649](69a1649)) * dashboard polish, training endpoint dispatch, and observability cleanup ([#1911](#1911)) ([b61e9e8](b61e9e8)) * per-worktree git-hook isolation + hookify gate migration + MSW drift fix ([#1949](#1949)) ([e3f8495](e3f8495)) * **providers:** read supports_native_streaming from litellm model info ([#1942](#1942)) ([60364ca](60364ca)) * security and audit coverage (closes [#1883](#1883)) ([#1904](#1904)) ([d8ebf55](d8ebf55)) ### Performance * **ci:** mypy --num-workers=4 + enable ruff TID255 ([#1944](#1944)) ([484c1d3](484c1d3)) ### Refactoring * **ci:** drop aquasecurity/trivy-action, use direct trivy binary ([#1940](#1940)) ([df1f946](df1f946)) * **memory:** per-call processing_kwargs + observability for ST encode ([#1943](#1943)) ([3aa9d20](3aa9d20)) * Phase 7 follow-up — complete RootConfig mirror coverage (closes [#1907](#1907)) ([#1914](#1914)) ([605500b](605500b)) * **settings:** collapse precedence to three categories; drop YAML tier (closes [#1890](#1890)) ([#1910](#1910)) ([efd54c9](efd54c9)) * WP-3 API conventions + query performance + project-wide extra=forbid ([#1953](#1953)) ([504d579](504d579)), closes [#1918](#1918) * WP-4 settings + cross-cutting (clock seam, contextvars, dispatch, plugin surfaces) ([#1954](#1954)) ([7207d92](7207d92)) * **wp1:** persistence + layer discipline + restart safety ([#1945](#1945)) ([57586fb](57586fb)) ### Documentation * **wp5:** public-facing truth refresh ([#1924](#1924)) ([afb5cc5](afb5cc5)) ### CI/CD * split test job by marker with airtight aggregate coverage gate ([#1948](#1948)) ([0b818d5](0b818d5)), closes [#1938](#1938) [#1937](#1937) ### Maintenance * **codebase-audit:** tighten skill to prevent 2026-05-15 FP classes ([#1923](#1923)) ([9317ed1](9317ed1)) * Lock file maintenance ([#1913](#1913)) ([c08a355](c08a355)) * Lock file maintenance ([#1950](#1950)) ([8940ab1](8940ab1)) * remove TS DTO type-tightening overlays ([#1915](#1915)) ([d296214](d296214)), closes [#1906](#1906) * Update Infrastructure dependencies ([#1928](#1928)) ([d19fae5](d19fae5)) * Update Python dependencies ([#1929](#1929)) ([75cc2c8](75cc2c8)) * **wp7:** hygiene, stubs, test/CI/tooling, doc gaps, boundary patterns doc ([#1926](#1926)) ([c29eb32](c29eb32)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
Two-stage refactor of
src/synthorg/memory/embedding/fine_tune.pyto adopt the sentence-transformers 5.5 per-callprocessing_kwargsoverride.Stage 1 (first commit,
b978eadc4): the 2 encode calls inmine_hard_negativesand the 4 inevaluate_checkpointnow pass distinctprocessing_kwargs={"text": {"max_length": N, "truncation": True}}for short queries (_QUERY_MAX_LENGTH = 128) vs long passages (_PASSAGE_MAX_LENGTH = 512) instead of relying on the model tokenizer default. Two new module-levelFinal[int]constants alongside the existing_DEFAULT_*block. Initial pair of unit tests asserting kwargs forwarding via aSentenceTransformerspy.Stage 2 (second commit,
3db012f8e): pre-PR review pipeline addressed 14 findings:truncation=True(resilience-audit, Critical): added 2 event constants (memory.fine_tune.encode_invoked,memory.fine_tune.encode_truncation_likely) and an_encode_with_observabilityhelper that emits a DEBUG log per encode call (role / model / max_length / batch_size) and a WARNING when any text's word count exceedsmax_length / _TOKENS_PER_WORD(1.33). Quality regressions are no longer silent.docs/reference/embedding-evaluation.mdStage 2/3 anddocs/design/memory.md"Domain-Specific Embedding Fine-Tuning". Evaluation stage now explicitly references the same caps.kwargsdict (includingshow_progress_bar=False), not just theprocessing_kwargssub-dict. Order-dependent positional unpacking replaced with role-keyed lookup ((model_name, texts) -> call)._RecordingEncodernow raises if production switches toencode_query()/encode_document(), so the alternate sentence-transformers 3+/5+ API can't silently no-op the spy.mine_hard_negatives68 -> 49 lines;evaluate_checkpoint78 -> 43 lines. Both now under the 50-line CLAUDE.md limit. Extracted helpers:_encode_with_observability,_encode_query_passage_pair,_load_query_passage_pairs,_persist_triples,_mine_negatives_from_pairs,_run_eval_pipeline,_persist_eval_metrics,_report_progress,_select_hard_negatives.test_emits_truncation_warning_for_long_query,test_emits_truncation_warning_for_long_passage) that exercise the new WARNING path._FakeSentenceTransformersModuleProtocolfor the test fake's return type plus a 2-line comment on thenp.eye(max(len(texts), 1), ...)[: len(texts)]empty-input guard.Track B no-op (logfire
level=)The 2026-05-16 bump also added
level=to@logfire.instrument(...). A repo-wide grep finds zero such decorators insrc/synthorg/; all observability is OpenTelemetrytracer.start_as_current_span(...)which has nolevel=kwarg. The kwarg has no applicable target here, so this PR ships Track A only. Detail in the plan file (.claude/plans/polished-twirling-stonebraker.md).Test plan
uv run ruff check src/synthorg/ tests/-> clean.uv run mypy src/synthorg/ tests/-> clean (3816 source files).uv run python -m pytest tests/ -m unit -n 8-> 28957 passed, 18 skipped (logfire extra + POSIX-only paths), 146.8 s.uv run pre-commit run --files <all modified>-> all gates green; no em-dashes, no reviewer citations, doc-count floors satisfied, magic-numbers gate clean.Review coverage
Pre-reviewed by 16 agents (docs-consistency, code-reviewer, python-reviewer, type-design-analyzer, async-concurrency-reviewer, comment-quality-rot, pr-test-analyzer, comment-analyzer, logging-audit, resilience-audit, conventions-enforcer, test-quality-reviewer, plus the 4 codebase-audit mini-pass agents). Full triage table:
_audit/pre-pr-review/triage.md. 14 actionable items + 3 info; 14 fixed, 2 user-skipped (item 14 pre-existing PEP 758 comment in untouched code; item 17 sub-agent's own Bash policy note, not actionable).