Fixes: crawl code storage issue with <think> tags for ollama models. by sean-esk · Pull Request #775 · coleam00/Archon

sean-esk · 2025-10-10T03:33:09Z

Pull Request

Summary

Fixes JSON parsing errors during code extraction by enabling JSON mode for Ollama provider and adds missing migration 010 (page metadata table) to complete database setup script.

Changes Made

Added Ollama to the list of providers that support response_format: {"type": "json_object"} to force JSON-only responses
Enhanced _is_reasoning_text_response() to detect XML-style <think> tags from extended thinking models
Added complete migration 010 content to complete_setup.sql (archon_page_metadata table with foreign keys, indexes, and RLS policies)
Added migration 009b tracking entry (009_add_provider_placeholders) to complete_setup.sql
Added RLS policies for archon_page_metadata table in complete_setup.sql

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Affected Services

Testing

All existing tests pass
Added new tests for new functionality
Manually tested affected user flows
Docker builds succeed for all services

Test Evidence

# Verified Ollama JSON mode fix by crawling with qwen3:30b
# Before fix: "Failed to parse JSON response from LLM (non-strict attempt). Error: Expecting property name..."
# After fix: Clean JSON responses with no <think> tags

# Verified migration 010 is present in complete_setup.sql
grep -n "archon_page_metadata" migration/complete_setup.sql
# Output shows table creation at line 970, indexes at 1001-1006, and RLS at line 765

# Verified migration tracking entries
grep -n "010_add_page_metadata_table\|009_add_provider_placeholders" migration/complete_setup.sql
# Output: 1059: ('0.1.0', '009_add_provider_placeholders')
#         1060: ('0.1.0', '010_add_page_metadata_table')

Checklist

My code follows the service architecture patterns
If using an AI coding assistant, I used the CLAUDE.md rules
I have added tests that prove my fix/feature works
All new and existing tests pass locally
My changes generate no new warnings
I have updated relevant documentation
I have verified no regressions in existing features

Breaking Changes

None. These changes are backward compatible.

For fresh installations:

No action needed - complete_setup.sql now includes migration 010

For existing installations:

If you've already run migration 010 individually, no action needed
If not, either run migration/0.1.0/010_add_page_metadata_table.sql OR re-run complete_setup.sql

Additional Notes

Bug Context

The Ollama provider was not included in the list of providers that support JSON mode (response_format: {"type": "json_object"}). This caused qwen3:30b and other Ollama models to respond with extended thinking format using <think> XML tags, which failed JSON parsing during code extraction.

Root Cause:

Line 664-666 in code_storage_service.py only checked for OpenAI, Google, and Anthropic
Ollama DOES support JSON mode but wasn't being told to use it
Models like qwen3:30b default to extended thinking mode when JSON is not enforced

Fix Strategy:

Primary fix: Added "ollama" to supports_response_format_base (line 665)
Defensive fix: Enhanced <think> tag detection as fallback (line 85-87)

Migration 010 Context:
Migration 010 adds the archon_page_metadata table for page-based RAG retrieval, which stores complete documentation pages alongside chunks for improved agent context. This was missing from complete_setup.sql, causing schema mismatches for fresh installations.

Files Changed:

python/src/server/services/storage/code_storage_service.py (2 changes)
migration/complete_setup.sql (3 additions: table creation, RLS policies, migration tracking)

After Restart:
Code extraction will now receive clean JSON responses from Ollama models without parsing errors.

Summary by CodeRabbit

New Features
- Page-level metadata added (content, sections, word/char/chunk counts) with public read access.
Improvements
- Crawled pages linked to page metadata for richer context and navigation.
- Added indexes for faster metadata lookups.
- Better detection and handling of hidden “thinking” content for cleaner outputs.
- Broader provider compatibility and more reliable, faster batch code-example summaries.
Chores
- Migration tracking updated to include the new metadata table.

coderabbitai · 2025-10-10T03:33:17Z

Walkthrough

Adds a new public table archon_page_metadata with indexes, comments, RLS and a public SELECT policy; links archon_crawled_pages via a nullable page_id FK (ON DELETE SET NULL); updates migration records. Refactors code storage summarization to detect XML <think> tags and expands provider/LLM client handling (including Ollama, Grok, OpenRouter).

Changes

Cohort / File(s)	Summary
Database migration: page metadata & linkage `migration/complete_setup.sql`	Adds `archon_page_metadata` table (UUID PK, `source_id`, `url`, `full_content`, section fields, word/char/chunk counts, timestamps, `metadata` JSONB), UNIQUE(`url`), FK to `archon_sources` (ON DELETE CASCADE), indexes, comments, enables RLS and creates public SELECT policy; adds nullable `page_id` UUID FK to `archon_crawled_pages` (REFERENCES `archon_page_metadata(id)` ON DELETE SET NULL) and index; updates `archon_migrations` entries and migration sections.
Code service: reasoning detection & summary generation refactor `python/src/server/services/storage/code_storage_service.py`	`_is_reasoning_text_response` now detects XML-style `<think>` tags (start or within first 100 chars). Large refactor of async code example summarization: new optional client parameter and helper `_generate_summary_with_client`, shared LLM client reuse in batch processing, provider-specific handling paths (Grok, Ollama, OpenRouter, OpenAI, etc.), stricter JSON extraction/validation and fallbacks, enriched logging, batching and mapping of results, and improved error handling and deduplication logic.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Crawler
  participant DB as Database
  participant PM as archon_page_metadata
  participant CP as archon_crawled_pages

  C->>DB: INSERT INTO `archon_page_metadata` (url, content, section, counts, metadata)
  DB->>PM: Enforce UNIQUE(url), indexes, comments, RLS policy
  C->>DB: INSERT/UPDATE `archon_crawled_pages` with `page_id` -> PM.id
  DB->>CP: Maintain FK (ON DELETE SET NULL)
  Note over PM,CP: New nullable FK links crawled pages to page-level metadata

sequenceDiagram
  autonumber
  participant S as CodeStorageService
  participant P as ProviderSelector
  participant C as SharedLLMClient
  participant M as ModelAPI

  S->>S: _is_reasoning_text_response(text)\n- detects "<think>" early
  S->>P: choose provider path (Grok, Ollama, OpenRouter, OpenAI, etc.)
  alt Shared client available
    S->>C: use shared client
    C->>M: request with provider-specific response_format
  else No shared client
    S->>M: create client per-call and request
  end
  M-->>S: model response (JSON/text)
  S->>S: validate/sanitize JSON, apply fallbacks, aggregate summaries

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I thump my paws on fields of bytes,
New pages nested in moonlit nights.
A "" twitches—caught in view,
Ollama hums and Grok peeks through.
FK burrows link the pile just right—hop, migrate, and write! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title is a concise, single sentence that clearly summarizes the primary bug fix regarding `<think>` tag handling for Ollama models without extraneous details.
Description Check	✅ Passed	The pull request description follows the repository’s template with all required sections present and adequately filled in, including a summary, detailed changes made, type of change, affected services, testing steps with evidence, checklist, breaking changes, and additional notes.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bug/code-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ad1fb0 and 81c0f49.

📒 Files selected for processing (2)

migration/complete_setup.sql (4 hunks)
python/src/server/services/storage/code_storage_service.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

python/src/server/services/storage/code_storage_service.py

python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

python/src/server/services/storage/code_storage_service.py

python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

python/src/server/services/storage/code_storage_service.py

🔇 Additional comments (4)

python/src/server/services/storage/code_storage_service.py (1)

85-87: LGTM!

The XML-style <think> tag detection provides a defensive fallback for extended-thinking models. The case-insensitive check within the first 100 characters is reasonable, and false positives are handled gracefully by the downstream reasoning extraction logic.

migration/complete_setup.sql (3)

970-1027: LGTM!

The archon_page_metadata table schema is well-designed with:

Proper primary key, foreign keys, and constraints

Appropriate nullable/not-null column specifications

Comprehensive indexing strategy including GIN index for JSONB metadata

Nullable page_id foreign key on archon_crawled_pages correctly handles existing data

Thorough column documentation

786-790: Policy looks correct; ensure table creation order is fixed.

The RLS policy for archon_page_metadata follows the correct pattern and grants appropriate public read access. However, this policy depends on the table creation order issue at line 765 being resolved.

1065-1067: Migration tracking entries look correct.

The migration tracking entries are properly formatted with appropriate version numbers and descriptive names. Note that there are two migration entries numbered "009" (cascade_delete and provider_placeholders), which is mentioned in the PR description. While unusual, this is not an issue for a complete setup script where execution order is managed by the SQL structure itself.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81c0f49 and ab853b7.

📒 Files selected for processing (2)

migration/complete_setup.sql (3 hunks)
python/src/server/services/storage/code_storage_service.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

python/src/server/services/storage/code_storage_service.py

python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

python/src/server/services/storage/code_storage_service.py

python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

python/src/server/services/storage/code_storage_service.py

🧬 Code graph analysis (1)

python/src/server/services/storage/code_storage_service.py (1)

python/src/server/services/llm_provider_service.py (4)

get_llm_client (313-548)

prepare_chat_completion_params (1095-1132)

extract_message_text (885-922)

synthesize_json_from_reasoning (979-1092)

coderabbitai · 2025-10-10T19:51:50Z

+CREATE POLICY "Allow public read access to archon_page_metadata"
+  ON archon_page_metadata
+  FOR SELECT
+  TO public
+  USING (true);


⚠️ Potential issue | 🔴 Critical

Move archon_page_metadata policy after the table is created

This CREATE POLICY runs before archon_page_metadata exists (the table is only created later in Section 6.5), so complete_setup.sql will fail on clean installs. Please relocate the policy definition to immediately after the table creation/RLS enablement in Section 6.5.

-CREATE POLICY "Allow public read access to archon_page_metadata" - ON archon_page_metadata - FOR SELECT - TO public - USING (true); ... -- Enable RLS on archon_page_metadata ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY; + +CREATE POLICY "Allow public read access to archon_page_metadata" + ON archon_page_metadata + FOR SELECT + TO public + USING (true);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CREATE POLICY "Allow public read access to archon_page_metadata"

ON archon_page_metadata

FOR SELECT

TO public

USING (true);

-CREATE POLICY "Allow public read access to archon_page_metadata"

- ON archon_page_metadata

- FOR SELECT

- TO public

- USING (true);

-- Enable RLS on archon_page_metadata

ALTER TABLE archon_page_metadata ENABLE ROW LEVEL SECURITY;

CREATE POLICY "Allow public read access to archon_page_metadata"

ON archon_page_metadata

FOR SELECT

TO public

USING (true);

🤖 Prompt for AI Agents

In migration/complete_setup.sql around lines 785 to 789, the CREATE POLICY for archon_page_metadata is defined before the archon_page_metadata table exists which causes failures on clean installs; move this CREATE POLICY statement to immediately after the archon_page_metadata table creation and its RLS (row-level security) enablement in Section 6.5 so the table exists when the policy is created, ensuring ordering: create table -> enable RLS -> CREATE POLICY.

coleam00 · 2025-10-10T22:09:51Z

This looks good @sean-eskerium! Going to merge this now and do some testing on main

…oleam00#775) * Fixes: crawl code storage issue with <think> tags for ollama models. * updates from code rabbit review

…AG nodes (#775) * feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new TurnOptions.outputSchema support so Codex DAG nodes can use output_format for structured JSON responses — previously warn-and-skipped as Claude-only. - Replace custom CodexThreadOptions with SDK's ThreadOptions type - Pass outputSchema and abort signal via TurnOptions to runStreamed() - Simplify extractUsageFromCodexEvent to use typed SDK Usage fields - Remove output_format warn-and-skip for Codex in DAG executor - Add outputFormat to Codex node options in resolveNodeProviderAndModel - Suppress false-positive structured output warning for Codex nodes (Codex returns structured output inline in agent_message text) - Update type comments to reflect both Claude and Codex support - Add tests for outputSchema passthrough, signal passthrough, and Codex DAG node output_format with downstream condition evaluation * fix: address review findings for Codex SDK bump - Guard extractUsageFromCodexEvent against null usage (warn + zero fallback) - Validate Codex structured output is JSON, warn user if not - Use if/assign instead of conditional spreads for TurnOptions and Codex options (consistency with Claude path) - Remove eslint-disable comments in tests, use chunks.push() pattern - Update docs: fix 6 stale "Claude only" references in docs/authoring-workflows.md and CLAUDE.md

…AG nodes (coleam00#775) * feat(core): bump Codex SDK to 0.116.0, enable structured output for DAG nodes Bump @openai/codex-sdk from ^0.104.0 to ^0.116.0 and wire up the new TurnOptions.outputSchema support so Codex DAG nodes can use output_format for structured JSON responses — previously warn-and-skipped as Claude-only. - Replace custom CodexThreadOptions with SDK's ThreadOptions type - Pass outputSchema and abort signal via TurnOptions to runStreamed() - Simplify extractUsageFromCodexEvent to use typed SDK Usage fields - Remove output_format warn-and-skip for Codex in DAG executor - Add outputFormat to Codex node options in resolveNodeProviderAndModel - Suppress false-positive structured output warning for Codex nodes (Codex returns structured output inline in agent_message text) - Update type comments to reflect both Claude and Codex support - Add tests for outputSchema passthrough, signal passthrough, and Codex DAG node output_format with downstream condition evaluation * fix: address review findings for Codex SDK bump - Guard extractUsageFromCodexEvent against null usage (warn + zero fallback) - Validate Codex structured output is JSON, warn user if not - Use if/assign instead of conditional spreads for TurnOptions and Codex options (consistency with Claude path) - Remove eslint-disable comments in tests, use chunks.push() pattern - Update docs: fix 6 stale "Claude only" references in docs/authoring-workflows.md and CLAUDE.md

Fixes: crawl code storage issue with <think> tags for ollama models.

81c0f49

sean-esk requested a review from coleam00 October 10, 2025 03:33

coderabbitai Bot reviewed Oct 10, 2025

View reviewed changes

Comment thread migration/complete_setup.sql Outdated

Comment thread python/src/server/services/storage/code_storage_service.py Outdated

updates from code rabbit review

ab853b7

coderabbitai Bot reviewed Oct 10, 2025

View reviewed changes

coleam00 merged commit 7c3823e into main Oct 10, 2025
12 checks passed

leonj1 pushed a commit to leonj1/Archon that referenced this pull request Oct 13, 2025

Fixes: crawl code storage issue with <think> tags for ollama models. (c…

799c9cb

…oleam00#775) * Fixes: crawl code storage issue with <think> tags for ollama models. * updates from code rabbit review

coderabbitai Bot mentioned this pull request Feb 22, 2026

Add crawl checkpoint/resume infrastructure #936

Closed

Wirasm deleted the bug/code-storage branch April 6, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes: crawl code storage issue with <think> tags for ollama models.#775

Fixes: crawl code storage issue with <think> tags for ollama models.#775
coleam00 merged 2 commits intomainfrom
bug/code-storage

sean-esk commented Oct 10, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Oct 10, 2025

Uh oh!

coleam00 commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sean-esk commented Oct 10, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

Changes Made

Type of Change

Affected Services

Testing

Test Evidence

Checklist

Breaking Changes

Additional Notes

Bug Context

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coleam00 commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sean-esk commented Oct 10, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 10, 2025 •

edited

Loading