fix(eval): disable ingest rate limit in M6 seeder to unblock baseline (#58 followup)#305
Merged
Conversation
…#58 followup) PR #304's first CI baseline produced overall recall 0.000 with 14/25 cases erroring — root cause: the M6 seeder runs 25 cases back-to-back in a single process, and the LLM-08 ingest rate limiter (#216, burst=10 / refill=1.0/s) refuses cases 12+ with `_IngestRefused("rate_limit_ exceeded")`. Math: 10 initial tokens + ~1 refill while seeding the first 11 cases = 11 cases through, then 14 cases (U4-U8 + all 9 T*) erred. The rate limiter is for production agent-loop safety, not eval throughput. There's already a documented env var to disable it (see `handlers.ingest._check_rate_limit` docstring): ``BICAMERAL_INGEST_RATE_LIMIT_DISABLE`` truthy → bucket check is short-circuited. Setting it in the seeder's per-case env setup (saved + restored like `REPO_PATH` and `SURREAL_URL`) is the documented path. Symptom before this fix (post-#304 CI on dev): M6 preflight retrieval recall eval — 25 cases overall recall : 0.000 errors: 14 transitive_relevance : 0/9 surfaced, 9 errors ← all rate-limited unbound_decision : 0/8 surfaced, 5 errors ← last 5 rate-limited vocabulary_mismatch : 0/8 surfaced, 0 errors ← first 8, ran clean Expected after this fix: vocabulary_mismatch stays 0/8 surfaced (that's the honest BM25-can't-bridge-vocab baseline the eval was designed to surface). transitive_relevance + unbound_decision should produce non-zero recall once the seeder doesn't trip the rate limiter. Belt-and-suspenders alternatives considered: - clear the `_RATE_LIMIT_REGISTRY` dict between cases — works but reaches into private state and skips the env-var contract - sleep between cases to allow refill — works but slow + hides the fact that the rate limiter isn't appropriate for evals - lower burst/refill via `.bicameral/config.yaml` in the synthetic repo — works but requires every Phase B eval surface to re-author the same config The env-var path is the documented API and one line. Smoke verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator - ruff check + format + mypy all green on the touched file Refs #58 (Phase A baseline). Followup to PR #304. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tail-end fix on top of PR #304. The first M6 baseline reading on dev surfaced overall recall 0.000 with 14/25 cases erroring — root cause is the LLM-08 ingest rate limiter (#216, burst=10 / refill=1.0/s) refusing cases 12+ during seeding. The rate limiter is for production agent-loop safety, not eval throughput; there's already a documented env var to bypass it (
BICAMERAL_INGEST_RATE_LIMIT_DISABLE).Math behind the brokenness
The seeder runs 25 cases back-to-back in one process. Bucket starts at 10 tokens, refills at 1/s. Seeding is fast, so:
Post-#304 CI on dev confirms the pattern exactly:
(The vocabulary_mismatch zero is the honest BM25-can't-bridge-vocab baseline — designed to surface that miss mode. The eval is working; it just can't measure the other two categories because the seeder doesn't reach them.)
Fix
One-line env var in the seeder's per-case setup, saved + restored like
REPO_PATHandSURREAL_URL:Plus the matching restore block in the
finally:clause. Total diff: 15 lines.Alternatives considered + rejected
_RATE_LIMIT_REGISTRYbetween cases.bicameral/config.yamlper fixtureThe env-var path is the documented API and one line.
Expected after this fix
Local verification
bicameral.link_commitclean — 0 drift, 0 pending checksRefs
Refs #58 (Phase A baseline). Followup to PR #304.
🤖 Generated with Claude Code