-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix race condition in concurrent crawling with unique source IDs #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 11 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
47edbb1
Fix race condition in concurrent crawling with unique source IDs
Wirasm bc11a0c
Fix title generation to use source_display_name for better AI context
Wirasm 40abaf9
Skip AI title generation when display name is available
Wirasm 56220bd
Fix critical issues from code review
Wirasm b9d52fb
Add safety improvements from code review
Wirasm 5e603ea
Fix code extraction to use hash-based source_ids and improve display …
Wirasm 76bf0f0
Fix critical variable shadowing and source_type determination issues
Wirasm 698e3b9
Fix URL canonicalization and document metrics calculation
Wirasm f5de76d
Fix synchronous extract_source_summary blocking async event loop
Wirasm 353264d
Fix synchronous update_source_info blocking async event loop
Wirasm 52187e2
Fix race condition in source creation using upsert
Wirasm fd9209c
Add migration detection UI components
Wirasm a7da288
Integrate migration banner into main app
Wirasm 49f9280
Enhance backend startup error instructions
Wirasm a8b5a65
Add database schema caching to health endpoint
Wirasm 3eda01e
Clean up knowledge API imports and logging
Wirasm f65c4ae
Remove unused instructions prop from MigrationBanner
Wirasm 75958f4
Add schema_valid flag to migration_required health response
Wirasm 7dca34b
Merge remote-tracking branch 'origin/main' into fix/source-id-archite…
Wirasm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| -- ===================================================== | ||
| -- Add source_url and source_display_name columns | ||
| -- ===================================================== | ||
| -- This migration adds two new columns to better identify sources: | ||
| -- - source_url: The original URL that was crawled | ||
| -- - source_display_name: Human-readable name for UI display | ||
| -- | ||
| -- This solves the race condition issue where multiple crawls | ||
| -- to the same domain would conflict by using domain as source_id | ||
| -- ===================================================== | ||
|
|
||
| -- Add new columns to archon_sources table | ||
| ALTER TABLE archon_sources | ||
| ADD COLUMN IF NOT EXISTS source_url TEXT, | ||
| ADD COLUMN IF NOT EXISTS source_display_name TEXT; | ||
|
|
||
| -- Add indexes for the new columns for better query performance | ||
| CREATE INDEX IF NOT EXISTS idx_archon_sources_url ON archon_sources(source_url); | ||
| CREATE INDEX IF NOT EXISTS idx_archon_sources_display_name ON archon_sources(source_display_name); | ||
|
|
||
| -- Add comments to document the new columns | ||
| COMMENT ON COLUMN archon_sources.source_url IS 'The original URL that was crawled to create this source'; | ||
| COMMENT ON COLUMN archon_sources.source_display_name IS 'Human-readable name for UI display (e.g., "GitHub - microsoft/typescript")'; | ||
|
|
||
| -- Backfill existing data | ||
| -- For existing sources, copy source_id to both new fields as a fallback | ||
| UPDATE archon_sources | ||
| SET | ||
| source_url = COALESCE(source_url, source_id), | ||
| source_display_name = COALESCE(source_display_name, source_id) | ||
| WHERE | ||
| source_url IS NULL | ||
| OR source_display_name IS NULL; | ||
|
|
||
| -- Note: source_id will now contain a unique hash instead of domain | ||
| -- This ensures no conflicts when multiple sources from same domain are crawled |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Schema extension is correct; add an updated_at trigger for archon_sources.
You already use update_updated_at_column() for other tables; archon_sources lacks that trigger. Today, the app attempts to set updated_at to "now()" (string), which will store a literal string unless a trigger updates it.
Add this DDL (outside this hunk):
Then remove manual "updated_at": "now()" writes in app code (see review in source_management_service.py).
🤖 Prompt for AI Agents