Fixes: crawl code storage issue with <think> tags for ollama models.#775
Fixes: crawl code storage issue with <think> tags for ollama models.#775
Conversation
WalkthroughAdds a new public table Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Crawler
participant DB as Database
participant PM as archon_page_metadata
participant CP as archon_crawled_pages
C->>DB: INSERT INTO `archon_page_metadata` (url, content, section, counts, metadata)
DB->>PM: Enforce UNIQUE(url), indexes, comments, RLS policy
C->>DB: INSERT/UPDATE `archon_crawled_pages` with `page_id` -> PM.id
DB->>CP: Maintain FK (ON DELETE SET NULL)
Note over PM,CP: New nullable FK links crawled pages to page-level metadata
sequenceDiagram
autonumber
participant S as CodeStorageService
participant P as ProviderSelector
participant C as SharedLLMClient
participant M as ModelAPI
S->>S: _is_reasoning_text_response(text)\n- detects "<think>" early
S->>P: choose provider path (Grok, Ollama, OpenRouter, OpenAI, etc.)
alt Shared client available
S->>C: use shared client
C->>M: request with provider-specific response_format
else No shared client
S->>M: create client per-call and request
end
M-->>S: model response (JSON/text)
S->>S: validate/sanitize JSON, apply fallbacks, aggregate summaries
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
migration/complete_setup.sql(4 hunks)python/src/server/services/storage/code_storage_service.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead
Files:
python/src/server/services/storage/code_storage_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)
Files:
python/src/server/services/storage/code_storage_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/
Files:
python/src/server/services/storage/code_storage_service.py
🔇 Additional comments (4)
python/src/server/services/storage/code_storage_service.py (1)
85-87: LGTM!The XML-style
<think>tag detection provides a defensive fallback for extended-thinking models. The case-insensitive check within the first 100 characters is reasonable, and false positives are handled gracefully by the downstream reasoning extraction logic.migration/complete_setup.sql (3)
970-1027: LGTM!The
archon_page_metadatatable schema is well-designed with:
- Proper primary key, foreign keys, and constraints
- Appropriate nullable/not-null column specifications
- Comprehensive indexing strategy including GIN index for JSONB metadata
- Nullable
page_idforeign key onarchon_crawled_pagescorrectly handles existing data- Thorough column documentation
786-790: Policy looks correct; ensure table creation order is fixed.The RLS policy for
archon_page_metadatafollows the correct pattern and grants appropriate public read access. However, this policy depends on the table creation order issue at line 765 being resolved.
1065-1067: Migration tracking entries look correct.The migration tracking entries are properly formatted with appropriate version numbers and descriptive names. Note that there are two migration entries numbered "009" (cascade_delete and provider_placeholders), which is mentioned in the PR description. While unusual, this is not an issue for a complete setup script where execution order is managed by the SQL structure itself.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
migration/complete_setup.sql(3 hunks)python/src/server/services/storage/code_storage_service.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead
Files:
python/src/server/services/storage/code_storage_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)
Files:
python/src/server/services/storage/code_storage_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/
Files:
python/src/server/services/storage/code_storage_service.py
🧬 Code graph analysis (1)
python/src/server/services/storage/code_storage_service.py (1)
python/src/server/services/llm_provider_service.py (4)
get_llm_client(313-548)prepare_chat_completion_params(1095-1132)extract_message_text(885-922)synthesize_json_from_reasoning(979-1092)
| CREATE POLICY "Allow public read access to archon_page_metadata" | ||
| ON archon_page_metadata | ||
| FOR SELECT | ||
| TO public | ||
| USING (true); |
There was a problem hiding this comment.
Move archon_page_metadata policy after the table is created
This CREATE POLICY runs before archon_page_metadata exists (the table is only created later in Section 6.5), so complete_setup.sql will fail on clean installs. Please relocate the policy definition to immediately after the table creation/RLS enablement in Section 6.5.
-CREATE POLICY "Allow public read access to archon_page_metadata"
- ON archon_page_metadata
- FOR SELECT
- TO public
- USING (true);
...
-- Enable RLS on archon_page_metadata
ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY;
+
+CREATE POLICY "Allow public read access to archon_page_metadata"
+ ON archon_page_metadata
+ FOR SELECT
+ TO public
+ USING (true);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CREATE POLICY "Allow public read access to archon_page_metadata" | |
| ON archon_page_metadata | |
| FOR SELECT | |
| TO public | |
| USING (true); | |
| -CREATE POLICY "Allow public read access to archon_page_metadata" | |
| - ON archon_page_metadata | |
| - FOR SELECT | |
| - TO public | |
| - USING (true); | |
| -- Enable RLS on archon_page_metadata | |
| ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY; | |
| CREATE POLICY "Allow public read access to archon_page_metadata" | |
| ON archon_page_metadata | |
| FOR SELECT | |
| TO public | |
| USING (true); |
🤖 Prompt for AI Agents
In migration/complete_setup.sql around lines 785 to 789, the CREATE POLICY for
archon_page_metadata is defined before the archon_page_metadata table exists
which causes failures on clean installs; move this CREATE POLICY statement to
immediately after the archon_page_metadata table creation and its RLS (row-level
security) enablement in Section 6.5 so the table exists when the policy is
created, ensuring ordering: create table -> enable RLS -> CREATE POLICY.
|
This looks good @sean-eskerium! Going to merge this now and do some testing on main |
…oleam00#775) * Fixes: crawl code storage issue with <think> tags for ollama models. * updates from code rabbit review
…AG nodes (#775) * feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new TurnOptions.outputSchema support so Codex DAG nodes can use output_format for structured JSON responses — previously warn-and-skipped as Claude-only. - Replace custom CodexThreadOptions with SDK's ThreadOptions type - Pass outputSchema and abort signal via TurnOptions to runStreamed() - Simplify extractUsageFromCodexEvent to use typed SDK Usage fields - Remove output_format warn-and-skip for Codex in DAG executor - Add outputFormat to Codex node options in resolveNodeProviderAndModel - Suppress false-positive structured output warning for Codex nodes (Codex returns structured output inline in agent_message text) - Update type comments to reflect both Claude and Codex support - Add tests for outputSchema passthrough, signal passthrough, and Codex DAG node output_format with downstream condition evaluation * fix: address review findings for Codex SDK bump - Guard extractUsageFromCodexEvent against null usage (warn + zero fallback) - Validate Codex structured output is JSON, warn user if not - Use if/assign instead of conditional spreads for TurnOptions and Codex options (consistency with Claude path) - Remove eslint-disable comments in tests, use chunks.push() pattern - Update docs: fix 6 stale "Claude only" references in docs/authoring-workflows.md and CLAUDE.md
…AG nodes (coleam00#775) * feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new TurnOptions.outputSchema support so Codex DAG nodes can use output_format for structured JSON responses — previously warn-and-skipped as Claude-only. - Replace custom CodexThreadOptions with SDK's ThreadOptions type - Pass outputSchema and abort signal via TurnOptions to runStreamed() - Simplify extractUsageFromCodexEvent to use typed SDK Usage fields - Remove output_format warn-and-skip for Codex in DAG executor - Add outputFormat to Codex node options in resolveNodeProviderAndModel - Suppress false-positive structured output warning for Codex nodes (Codex returns structured output inline in agent_message text) - Update type comments to reflect both Claude and Codex support - Add tests for outputSchema passthrough, signal passthrough, and Codex DAG node output_format with downstream condition evaluation * fix: address review findings for Codex SDK bump - Guard extractUsageFromCodexEvent against null usage (warn + zero fallback) - Validate Codex structured output is JSON, warn user if not - Use if/assign instead of conditional spreads for TurnOptions and Codex options (consistency with Claude path) - Remove eslint-disable comments in tests, use chunks.push() pattern - Update docs: fix 6 stale "Claude only" references in docs/authoring-workflows.md and CLAUDE.md
…AG nodes (coleam00#775) * feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new TurnOptions.outputSchema support so Codex DAG nodes can use output_format for structured JSON responses — previously warn-and-skipped as Claude-only. - Replace custom CodexThreadOptions with SDK's ThreadOptions type - Pass outputSchema and abort signal via TurnOptions to runStreamed() - Simplify extractUsageFromCodexEvent to use typed SDK Usage fields - Remove output_format warn-and-skip for Codex in DAG executor - Add outputFormat to Codex node options in resolveNodeProviderAndModel - Suppress false-positive structured output warning for Codex nodes (Codex returns structured output inline in agent_message text) - Update type comments to reflect both Claude and Codex support - Add tests for outputSchema passthrough, signal passthrough, and Codex DAG node output_format with downstream condition evaluation * fix: address review findings for Codex SDK bump - Guard extractUsageFromCodexEvent against null usage (warn + zero fallback) - Validate Codex structured output is JSON, warn user if not - Use if/assign instead of conditional spreads for TurnOptions and Codex options (consistency with Claude path) - Remove eslint-disable comments in tests, use chunks.push() pattern - Update docs: fix 6 stale "Claude only" references in docs/authoring-workflows.md and CLAUDE.md
Pull Request
Summary
Fixes JSON parsing errors during code extraction by enabling JSON mode for Ollama provider and adds missing migration 010 (page metadata table) to complete database setup script.
Changes Made
response_format: {"type": "json_object"}to force JSON-only responses_is_reasoning_text_response()to detect XML-style<think>tags from extended thinking modelscomplete_setup.sql(archon_page_metadata table with foreign keys, indexes, and RLS policies)009_add_provider_placeholders) to complete_setup.sqlType of Change
Affected Services
Testing
Test Evidence
Checklist
Breaking Changes
None. These changes are backward compatible.
For fresh installations:
For existing installations:
migration/0.1.0/010_add_page_metadata_table.sqlOR re-run complete_setup.sqlAdditional Notes
Bug Context
The Ollama provider was not included in the list of providers that support JSON mode (
response_format: {"type": "json_object"}). This caused qwen3:30b and other Ollama models to respond with extended thinking format using<think>XML tags, which failed JSON parsing during code extraction.Root Cause:
code_storage_service.pyonly checked for OpenAI, Google, and AnthropicFix Strategy:
"ollama"tosupports_response_format_base(line 665)<think>tag detection as fallback (line 85-87)Migration 010 Context:
Migration 010 adds the
archon_page_metadatatable for page-based RAG retrieval, which stores complete documentation pages alongside chunks for improved agent context. This was missing from complete_setup.sql, causing schema mismatches for fresh installations.Files Changed:
python/src/server/services/storage/code_storage_service.py(2 changes)migration/complete_setup.sql(3 additions: table creation, RLS policies, migration tracking)After Restart:
Code extraction will now receive clean JSON responses from Ollama models without parsing errors.
Summary by CodeRabbit
New Features
Improvements
Chores