Skip to content

Fixes: crawl code storage issue with <think> tags for ollama models.#775

Merged
coleam00 merged 2 commits intomainfrom
bug/code-storage
Oct 10, 2025
Merged

Fixes: crawl code storage issue with <think> tags for ollama models.#775
coleam00 merged 2 commits intomainfrom
bug/code-storage

Conversation

@sean-esk
Copy link
Copy Markdown
Collaborator

@sean-esk sean-esk commented Oct 10, 2025

Pull Request

Summary

Fixes JSON parsing errors during code extraction by enabling JSON mode for Ollama provider and adds missing migration 010 (page metadata table) to complete database setup script.

Changes Made

  • Added Ollama to the list of providers that support response_format: {"type": "json_object"} to force JSON-only responses
  • Enhanced _is_reasoning_text_response() to detect XML-style <think> tags from extended thinking models
  • Added complete migration 010 content to complete_setup.sql (archon_page_metadata table with foreign keys, indexes, and RLS policies)
  • Added migration 009b tracking entry (009_add_provider_placeholders) to complete_setup.sql
  • Added RLS policies for archon_page_metadata table in complete_setup.sql

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • All existing tests pass
  • Added new tests for new functionality
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

# Verified Ollama JSON mode fix by crawling with qwen3:30b
# Before fix: "Failed to parse JSON response from LLM (non-strict attempt). Error: Expecting property name..."
# After fix: Clean JSON responses with no <think> tags

# Verified migration 010 is present in complete_setup.sql
grep -n "archon_page_metadata" migration/complete_setup.sql
# Output shows table creation at line 970, indexes at 1001-1006, and RLS at line 765

# Verified migration tracking entries
grep -n "010_add_page_metadata_table\|009_add_provider_placeholders" migration/complete_setup.sql
# Output: 1059: ('0.1.0', '009_add_provider_placeholders')
#         1060: ('0.1.0', '010_add_page_metadata_table')

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

None. These changes are backward compatible.

For fresh installations:

  • No action needed - complete_setup.sql now includes migration 010

For existing installations:

  • If you've already run migration 010 individually, no action needed
  • If not, either run migration/0.1.0/010_add_page_metadata_table.sql OR re-run complete_setup.sql

Additional Notes

Bug Context

The Ollama provider was not included in the list of providers that support JSON mode (response_format: {"type": "json_object"}). This caused qwen3:30b and other Ollama models to respond with extended thinking format using <think> XML tags, which failed JSON parsing during code extraction.

Root Cause:

  • Line 664-666 in code_storage_service.py only checked for OpenAI, Google, and Anthropic
  • Ollama DOES support JSON mode but wasn't being told to use it
  • Models like qwen3:30b default to extended thinking mode when JSON is not enforced

Fix Strategy:

  1. Primary fix: Added "ollama" to supports_response_format_base (line 665)
  2. Defensive fix: Enhanced <think> tag detection as fallback (line 85-87)

Migration 010 Context:
Migration 010 adds the archon_page_metadata table for page-based RAG retrieval, which stores complete documentation pages alongside chunks for improved agent context. This was missing from complete_setup.sql, causing schema mismatches for fresh installations.

Files Changed:

  • python/src/server/services/storage/code_storage_service.py (2 changes)
  • migration/complete_setup.sql (3 additions: table creation, RLS policies, migration tracking)

After Restart:
Code extraction will now receive clean JSON responses from Ollama models without parsing errors.

Summary by CodeRabbit

  • New Features

    • Page-level metadata added (content, sections, word/char/chunk counts) with public read access.
  • Improvements

    • Crawled pages linked to page metadata for richer context and navigation.
    • Added indexes for faster metadata lookups.
    • Better detection and handling of hidden “thinking” content for cleaner outputs.
    • Broader provider compatibility and more reliable, faster batch code-example summaries.
  • Chores

    • Migration tracking updated to include the new metadata table.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 10, 2025

Walkthrough

Adds a new public table archon_page_metadata with indexes, comments, RLS and a public SELECT policy; links archon_crawled_pages via a nullable page_id FK (ON DELETE SET NULL); updates migration records. Refactors code storage summarization to detect XML <think> tags and expands provider/LLM client handling (including Ollama, Grok, OpenRouter).

Changes

Cohort / File(s) Summary
Database migration: page metadata & linkage
migration/complete_setup.sql
Adds archon_page_metadata table (UUID PK, source_id, url, full_content, section fields, word/char/chunk counts, timestamps, metadata JSONB), UNIQUE(url), FK to archon_sources (ON DELETE CASCADE), indexes, comments, enables RLS and creates public SELECT policy; adds nullable page_id UUID FK to archon_crawled_pages (REFERENCES archon_page_metadata(id) ON DELETE SET NULL) and index; updates archon_migrations entries and migration sections.
Code service: reasoning detection & summary generation refactor
python/src/server/services/storage/code_storage_service.py
_is_reasoning_text_response now detects XML-style <think> tags (start or within first 100 chars). Large refactor of async code example summarization: new optional client parameter and helper _generate_summary_with_client, shared LLM client reuse in batch processing, provider-specific handling paths (Grok, Ollama, OpenRouter, OpenAI, etc.), stricter JSON extraction/validation and fallbacks, enriched logging, batching and mapping of results, and improved error handling and deduplication logic.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Crawler
  participant DB as Database
  participant PM as archon_page_metadata
  participant CP as archon_crawled_pages

  C->>DB: INSERT INTO `archon_page_metadata` (url, content, section, counts, metadata)
  DB->>PM: Enforce UNIQUE(url), indexes, comments, RLS policy
  C->>DB: INSERT/UPDATE `archon_crawled_pages` with `page_id` -> PM.id
  DB->>CP: Maintain FK (ON DELETE SET NULL)
  Note over PM,CP: New nullable FK links crawled pages to page-level metadata
Loading
sequenceDiagram
  autonumber
  participant S as CodeStorageService
  participant P as ProviderSelector
  participant C as SharedLLMClient
  participant M as ModelAPI

  S->>S: _is_reasoning_text_response(text)\n- detects "<think>" early
  S->>P: choose provider path (Grok, Ollama, OpenRouter, OpenAI, etc.)
  alt Shared client available
    S->>C: use shared client
    C->>M: request with provider-specific response_format
  else No shared client
    S->>M: create client per-call and request
  end
  M-->>S: model response (JSON/text)
  S->>S: validate/sanitize JSON, apply fallbacks, aggregate summaries
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I thump my paws on fields of bytes,
New pages nested in moonlit nights.
A "" twitches—caught in view,
Ollama hums and Grok peeks through.
FK burrows link the pile just right—hop, migrate, and write! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title is a concise, single sentence that clearly summarizes the primary bug fix regarding <think> tag handling for Ollama models without extraneous details.
Description Check ✅ Passed The pull request description follows the repository’s template with all required sections present and adequately filled in, including a summary, detailed changes made, type of change, affected services, testing steps with evidence, checklist, breaking changes, and additional notes.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bug/code-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sean-esk sean-esk requested a review from coleam00 October 10, 2025 03:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ad1fb0 and 81c0f49.

📒 Files selected for processing (2)
  • migration/complete_setup.sql (4 hunks)
  • python/src/server/services/storage/code_storage_service.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/storage/code_storage_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/storage/code_storage_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/storage/code_storage_service.py
🔇 Additional comments (4)
python/src/server/services/storage/code_storage_service.py (1)

85-87: LGTM!

The XML-style <think> tag detection provides a defensive fallback for extended-thinking models. The case-insensitive check within the first 100 characters is reasonable, and false positives are handled gracefully by the downstream reasoning extraction logic.

migration/complete_setup.sql (3)

970-1027: LGTM!

The archon_page_metadata table schema is well-designed with:

  • Proper primary key, foreign keys, and constraints
  • Appropriate nullable/not-null column specifications
  • Comprehensive indexing strategy including GIN index for JSONB metadata
  • Nullable page_id foreign key on archon_crawled_pages correctly handles existing data
  • Thorough column documentation

786-790: Policy looks correct; ensure table creation order is fixed.

The RLS policy for archon_page_metadata follows the correct pattern and grants appropriate public read access. However, this policy depends on the table creation order issue at line 765 being resolved.


1065-1067: Migration tracking entries look correct.

The migration tracking entries are properly formatted with appropriate version numbers and descriptive names. Note that there are two migration entries numbered "009" (cascade_delete and provider_placeholders), which is mentioned in the PR description. While unusual, this is not an issue for a complete setup script where execution order is managed by the SQL structure itself.

Comment thread migration/complete_setup.sql Outdated
Comment thread python/src/server/services/storage/code_storage_service.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81c0f49 and ab853b7.

📒 Files selected for processing (2)
  • migration/complete_setup.sql (3 hunks)
  • python/src/server/services/storage/code_storage_service.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/storage/code_storage_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/storage/code_storage_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/storage/code_storage_service.py
🧬 Code graph analysis (1)
python/src/server/services/storage/code_storage_service.py (1)
python/src/server/services/llm_provider_service.py (4)
  • get_llm_client (313-548)
  • prepare_chat_completion_params (1095-1132)
  • extract_message_text (885-922)
  • synthesize_json_from_reasoning (979-1092)

Comment on lines +785 to +789
CREATE POLICY "Allow public read access to archon_page_metadata"
ON archon_page_metadata
FOR SELECT
TO public
USING (true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move archon_page_metadata policy after the table is created

This CREATE POLICY runs before archon_page_metadata exists (the table is only created later in Section 6.5), so complete_setup.sql will fail on clean installs. Please relocate the policy definition to immediately after the table creation/RLS enablement in Section 6.5.

-CREATE POLICY "Allow public read access to archon_page_metadata"
-  ON archon_page_metadata
-  FOR SELECT
-  TO public
-  USING (true);
...
 -- Enable RLS on archon_page_metadata
 ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY;
+
+CREATE POLICY "Allow public read access to archon_page_metadata"
+  ON archon_page_metadata
+  FOR SELECT
+  TO public
+  USING (true);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CREATE POLICY "Allow public read access to archon_page_metadata"
ON archon_page_metadata
FOR SELECT
TO public
USING (true);
-CREATE POLICY "Allow public read access to archon_page_metadata"
- ON archon_page_metadata
- FOR SELECT
- TO public
- USING (true);
-- Enable RLS on archon_page_metadata
ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Allow public read access to archon_page_metadata"
ON archon_page_metadata
FOR SELECT
TO public
USING (true);
🤖 Prompt for AI Agents
In migration/complete_setup.sql around lines 785 to 789, the CREATE POLICY for
archon_page_metadata is defined before the archon_page_metadata table exists
which causes failures on clean installs; move this CREATE POLICY statement to
immediately after the archon_page_metadata table creation and its RLS (row-level
security) enablement in Section 6.5 so the table exists when the policy is
created, ensuring ordering: create table -> enable RLS -> CREATE POLICY.

@coleam00
Copy link
Copy Markdown
Owner

This looks good @sean-eskerium! Going to merge this now and do some testing on main

@coleam00 coleam00 merged commit 7c3823e into main Oct 10, 2025
12 checks passed
leonj1 pushed a commit to leonj1/Archon that referenced this pull request Oct 13, 2025
…oleam00#775)

* Fixes: crawl code storage issue with <think> tags for ollama models.

* updates from code rabbit review
@Wirasm Wirasm deleted the bug/code-storage branch April 6, 2026 07:37
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…AG nodes (#775)

* feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes

Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new
TurnOptions.outputSchema support so Codex DAG nodes can use output_format
for structured JSON responses — previously warn-and-skipped as Claude-only.

- Replace custom CodexThreadOptions with SDK's ThreadOptions type
- Pass outputSchema and abort signal via TurnOptions to runStreamed()
- Simplify extractUsageFromCodexEvent to use typed SDK Usage fields
- Remove output_format warn-and-skip for Codex in DAG executor
- Add outputFormat to Codex node options in resolveNodeProviderAndModel
- Suppress false-positive structured output warning for Codex nodes
  (Codex returns structured output inline in agent_message text)
- Update type comments to reflect both Claude and Codex support
- Add tests for outputSchema passthrough, signal passthrough, and
  Codex DAG node output_format with downstream condition evaluation

* fix: address review findings for Codex SDK bump

- Guard extractUsageFromCodexEvent against null usage (warn + zero fallback)
- Validate Codex structured output is JSON, warn user if not
- Use if/assign instead of conditional spreads for TurnOptions and Codex
  options (consistency with Claude path)
- Remove eslint-disable comments in tests, use chunks.push() pattern
- Update docs: fix 6 stale "Claude only" references in
  docs/authoring-workflows.md and CLAUDE.md
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…AG nodes (coleam00#775)

* feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes

Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new
TurnOptions.outputSchema support so Codex DAG nodes can use output_format
for structured JSON responses — previously warn-and-skipped as Claude-only.

- Replace custom CodexThreadOptions with SDK's ThreadOptions type
- Pass outputSchema and abort signal via TurnOptions to runStreamed()
- Simplify extractUsageFromCodexEvent to use typed SDK Usage fields
- Remove output_format warn-and-skip for Codex in DAG executor
- Add outputFormat to Codex node options in resolveNodeProviderAndModel
- Suppress false-positive structured output warning for Codex nodes
  (Codex returns structured output inline in agent_message text)
- Update type comments to reflect both Claude and Codex support
- Add tests for outputSchema passthrough, signal passthrough, and
  Codex DAG node output_format with downstream condition evaluation

* fix: address review findings for Codex SDK bump

- Guard extractUsageFromCodexEvent against null usage (warn + zero fallback)
- Validate Codex structured output is JSON, warn user if not
- Use if/assign instead of conditional spreads for TurnOptions and Codex
  options (consistency with Claude path)
- Remove eslint-disable comments in tests, use chunks.push() pattern
- Update docs: fix 6 stale "Claude only" references in
  docs/authoring-workflows.md and CLAUDE.md
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…AG nodes (coleam00#775)

* feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes

Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new
TurnOptions.outputSchema support so Codex DAG nodes can use output_format
for structured JSON responses — previously warn-and-skipped as Claude-only.

- Replace custom CodexThreadOptions with SDK's ThreadOptions type
- Pass outputSchema and abort signal via TurnOptions to runStreamed()
- Simplify extractUsageFromCodexEvent to use typed SDK Usage fields
- Remove output_format warn-and-skip for Codex in DAG executor
- Add outputFormat to Codex node options in resolveNodeProviderAndModel
- Suppress false-positive structured output warning for Codex nodes
  (Codex returns structured output inline in agent_message text)
- Update type comments to reflect both Claude and Codex support
- Add tests for outputSchema passthrough, signal passthrough, and
  Codex DAG node output_format with downstream condition evaluation

* fix: address review findings for Codex SDK bump

- Guard extractUsageFromCodexEvent against null usage (warn + zero fallback)
- Validate Codex structured output is JSON, warn user if not
- Use if/assign instead of conditional spreads for TurnOptions and Codex
  options (consistency with Claude path)
- Remove eslint-disable comments in tests, use chunks.push() pattern
- Update docs: fix 6 stale "Claude only" references in
  docs/authoring-workflows.md and CLAUDE.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants