Skip to content

Fix/(Recrawl) archon crawled pages url chunk number key (409) error#378

Closed
Chillbruhhh wants to merge 8 commits intocoleam00:mainfrom
Chillbruhhh:FIX/archon_crawled_pages_url_chunk_number_key-(409)-error
Closed

Fix/(Recrawl) archon crawled pages url chunk number key (409) error#378
Chillbruhhh wants to merge 8 commits intocoleam00:mainfrom
Chillbruhhh:FIX/archon_crawled_pages_url_chunk_number_key-(409)-error

Conversation

@Chillbruhhh
Copy link
Copy Markdown

@Chillbruhhh Chillbruhhh commented Aug 20, 2025

Pull Request

Summary

Changes Made

  • changed supabases insert to upsert, noticed this issue when recrawling documents.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Affected Services

  • Server (FastAPI backend)
  • Database (migrations/schema)

Testing

  • All existing tests pass
  • Manually tested affected user flows

Test Evidence

# Example: python -m pytest tests/
# Example: cd archon-ui-main && npm run test

#recrawl knowledge base, check server logs

Checklist

  • My code follows the service architecture patterns
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

Additional Notes

Summary by CodeRabbit

  • Bug Fixes
    • Made document ingestion idempotent by upserting records based on URL and chunk number, preventing duplicate-key errors and ensuring existing entries are updated.
    • Improves stability of batch uploads and retry/partial re-runs, reducing failed jobs and user-visible errors during repeated crawls or reprocessing.
    • Minimizes unnecessary alerts from conflict errors, resulting in a smoother ingestion experience.

@Chillbruhhh Chillbruhhh changed the title Fix/archon crawled pages url chunk number key (409) error Fix/(Recrawl) archon crawled pages url chunk number key (409) error Aug 20, 2025
@coleam00
Copy link
Copy Markdown
Owner

@Chillbruhhh Thanks for this! Using your PR to test CodeRabbit for the first time here. Nice since it's a simple change too.

@coderabbitai Review this PR

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 20, 2025

@coleam00 I'll review this PR for you! It's great that you're testing CodeRabbit with this straightforward bug fix. Let me examine the changes to ensure the switch from insert to upsert properly addresses the 409 conflict error when recrawling documents.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 20, 2025

Walkthrough

Replaced insert calls with upsert(on_conflict="url,chunk_number") in two locations inside document_storage_service: the batch write path and the per-record fallback. Batch processing, progress reporting, and error handling remain unchanged; no public signature changes.

Changes

Cohort / File(s) Summary of Changes
Document storage upsert handling
python/src/server/services/storage/document_storage_service.py
Replaced insert(...).execute() with upsert(..., on_conflict="url,chunk_number").execute() in the main batch insertion and the per-record fallback; conflict handling now updates existing rows by (url, chunk_number); control flow and error handling preserved.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DocumentStorageService as Service
  participant DB

  Client->>Service: storeBatch(batch_data)
  Service->>DB: upsert(batch_data) on (url, chunk_number)
  alt Upsert succeeds
    DB-->>Service: OK
    Service-->>Client: Batch stored
  else Error during batch upsert
    DB-->>Service: Error
    loop Per-record fallback
      Service->>DB: upsert(record) on (url, chunk_number)
      DB-->>Service: OK/Error per record
    end
    Service-->>Client: Fallback completed with results
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I thump my paws on fields of bytes,
Upserts hop where inserts might,
Two keys aligned—chunk and URL—
Conflicts settle, rows sit well.
Batch or nibble, I persist with cheer. 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abe14a3 and f3af161.

📒 Files selected for processing (1)
  • python/src/server/services/storage/document_storage_service.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/storage/document_storage_service.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/storage/document_storage_service.py (1)

342-351: Bug: per-record fallback re-upserts the entire batch; should upsert the single record

Inside the per-record fallback loop, you’re calling upsert(batch_data, ...) instead of upserting the current record. This will repeat the full-batch upsert N times and misreport successful_inserts. It defeats the purpose of an item-by-item fallback and can cause unnecessary load.

Apply this diff to upsert only the current record:

-                            try:
-                                client.table("archon_crawled_pages").upsert(batch_data, on_conflict="url,chunk_number").execute()
+                            try:
+                                # Upsert the single record in the per-record fallback
+                                client.table("archon_crawled_pages").upsert([record], on_conflict="url,chunk_number").execute()
🧹 Nitpick comments (1)
python/src/server/services/storage/document_storage_service.py (1)

330-358: Nit: logging still says “insert”; update to “upsert” for clarity

Now that the operation is an upsert, the log messages should reflect that to avoid confusion during debugging and ops.

-                        search_logger.warning(
-                            f"Error inserting batch (attempt {retry + 1}/{max_retries}): {e}"
-                        )
+                        search_logger.warning(
+                            f"Error upserting batch (attempt {retry + 1}/{max_retries}): {e}"
+                        )
...
-                        search_logger.error(
-                            f"Failed to insert batch after {max_retries} attempts: {e}"
-                        )
+                        search_logger.error(
+                            f"Failed to upsert batch after {max_retries} attempts: {e}"
+                        )
...
-                                search_logger.error(
-                                    f"Failed individual insert for {record['url']}: {individual_error}"
-                                )
+                                search_logger.error(
+                                    f"Failed individual upsert for {record['url']}: {individual_error}"
+                                )
...
-                        search_logger.info(
-                            f"Individual inserts: {successful_inserts}/{len(batch_data)} successful"
-                        )
+                        search_logger.info(
+                            f"Individual upserts: {successful_inserts}/{len(batch_data)} successful"
+                        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b29d20 and abe14a3.

📒 Files selected for processing (1)
  • python/src/server/services/storage/document_storage_service.py (2 hunks)

Comment thread python/src/server/services/storage/document_storage_service.py
mistakenly added batch_data twice instead of record to line 349.
Copy link
Copy Markdown
Author

@Chillbruhhh Chillbruhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mistakenly typed batch data for fallback instead of record, even though i had record in my codebase. 🤦‍♂️committed code changes straight on github, should've committed out of cli. fixed now.

@coleam00
Copy link
Copy Markdown
Owner

Thanks for this @Chillbruhhh! I will be closing this PR though because I actually like having insert statements here instead of upsert because there should never be an upsert in this case, so it would mask an underlying issue (which in this case we have since addressed!).

I am taking a look at your other PR and testing it right now! #437

@coleam00 coleam00 closed this Aug 30, 2025
POWERFULMOVES pushed a commit to POWERFULMOVES/PMOVES-Archon that referenced this pull request Feb 12, 2026
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…378) (#409)

* Investigate issues #252 and #378: test infrastructure failures

Root cause for #378: github-context.test.ts globally mocks fs/promises
with readFile returning empty string, which leaks to version.test.ts
causing JSON.parse('') to fail.

Issue #252 (test hang) no longer reproduces but preventive cleanup
should be added to executor.test.ts.

* Fix test infrastructure: mock.module leak and executor cleanup (#252, #378)

Remove global mock.module('fs/promises') from github-context.test.ts that was
leaking a readFile mock across all test files, causing 3 version.test.ts failures
in CI. The mock was redundant since ensureRepoReady and autoDetectAndLoadCommands
are already mocked at the adapter level.

Add afterAll mock.restore() to executor.test.ts to prevent module mocks and
pending timers from leaking to other test files.

Changes:
- Remove mock.module('fs/promises') from github-context.test.ts
- Add afterAll cleanup to executor.test.ts

Fixes #252, Fixes #378
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…m00#252, coleam00#378) (coleam00#409)

* Investigate issues coleam00#252 and coleam00#378: test infrastructure failures

Root cause for coleam00#378: github-context.test.ts globally mocks fs/promises
with readFile returning empty string, which leaks to version.test.ts
causing JSON.parse('') to fail.

Issue coleam00#252 (test hang) no longer reproduces but preventive cleanup
should be added to executor.test.ts.

* Fix test infrastructure: mock.module leak and executor cleanup (coleam00#252, coleam00#378)

Remove global mock.module('fs/promises') from github-context.test.ts that was
leaking a readFile mock across all test files, causing 3 version.test.ts failures
in CI. The mock was redundant since ensureRepoReady and autoDetectAndLoadCommands
are already mocked at the adapter level.

Add afterAll mock.restore() to executor.test.ts to prevent module mocks and
pending timers from leaking to other test files.

Changes:
- Remove mock.module('fs/promises') from github-context.test.ts
- Add afterAll cleanup to executor.test.ts

Fixes coleam00#252, Fixes coleam00#378
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…m00#252, coleam00#378) (coleam00#409)

* Investigate issues coleam00#252 and coleam00#378: test infrastructure failures

Root cause for coleam00#378: github-context.test.ts globally mocks fs/promises
with readFile returning empty string, which leaks to version.test.ts
causing JSON.parse('') to fail.

Issue coleam00#252 (test hang) no longer reproduces but preventive cleanup
should be added to executor.test.ts.

* Fix test infrastructure: mock.module leak and executor cleanup (coleam00#252, coleam00#378)

Remove global mock.module('fs/promises') from github-context.test.ts that was
leaking a readFile mock across all test files, causing 3 version.test.ts failures
in CI. The mock was redundant since ensureRepoReady and autoDetectAndLoadCommands
are already mocked at the adapter level.

Add afterAll mock.restore() to executor.test.ts to prevent module mocks and
pending timers from leaking to other test files.

Changes:
- Remove mock.module('fs/promises') from github-context.test.ts
- Add afterAll cleanup to executor.test.ts

Fixes coleam00#252, Fixes coleam00#378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants