-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add crawl checkpoint/resume infrastructure #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Zebastjan
wants to merge
25
commits into
coleam00:main
from
Zebastjan:feature/crawl-checkpoint-resume
Closed
Changes from 8 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
5b3500f
Optimize code summary prompt for small language models
Zebastjan 401c12e
Add integration tests for code summary prompt validation
Zebastjan 547440d
Add comprehensive testing documentation and results
Zebastjan ee980d1
Fix backend validation bug: add 'discovery' status to CrawlProgressRe…
Zebastjan 827a1a0
Add crawl URL state tracking for checkpoint/resume
Zebastjan 3c94561
Integrate crawl URL state tracking into crawling service
Zebastjan 0aaee50
Update ADR-001 with accurate implementation state and roadmap
Zebastjan 0f83aa6
Add unit tests for CrawlUrlStateService (checkpoint/resume)
Zebastjan 990b8bb
Add provenance tracking to frontend and reconcile ADRs
Zebastjan 70d4db7
Implement checkpoint/resume and provenance tracking backend
Zebastjan e93d42e
Add archon_migrations tracking to migration files 001-007, 012-013
Zebastjan f4c868d
Add re-vectorize, re-summarize endpoints and code summarization agent…
Zebastjan a66e355
Add progress tracking and concurrency limiting for re-vectorize/re-su…
Zebastjan 377324e
Move Code Summarization to third tab in LLM Provider Settings
Zebastjan a7306d2
Fix Code Summary tab layout and Config button
Zebastjan ca3d358
Add Summary tab to LLM Provider Settings
Zebastjan e3fef61
Fix ternary syntax and add full model selection for Summary tab
Zebastjan 9a12773
feat(ingestion): add restartable RAG pipeline with separable stages
Zebastjan 940d0de
feat(ingestion): add worker API and crawl service integration
Zebastjan a60fe0d
Add roadmap document outlining project direction
Zebastjan fb01b9a
Add progress persistence, pause/resume, and RAG settings fixes
Zebastjan 1fd5849
feat: add pause/resume infrastructure for crawl operations (WIP)
Zebastjan 5e99e72
test: add comprehensive pause/resume/cancel testing infrastructure
Zebastjan 9d24967
docs: add KNOWN_ISSUES - crawls currently broken
Zebastjan 24739fb
fix: use 'idle' instead of invalid 'initializing' pipeline_status
Zebastjan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # ADR-001: Crawl & Ingestion Pipeline Improvements | ||
|
|
||
| **Status:** In Progress | ||
| **Date:** 2026-02-22 | ||
| **Authors:** Zebastjan Johanzen | ||
|
|
||
| --- | ||
|
|
||
| ## Context | ||
|
|
||
| Archon's crawler and ingestion pipeline is the foundation everything else | ||
| depends on — MCP agent quality, RAG search accuracy, and AI coding assistant | ||
| usefulness all trace back to whether the knowledge base contains clean, | ||
| well-processed, verifiable data. | ||
|
|
||
| This ADR tracks remaining improvements needed for the crawl & ingestion pipeline. | ||
|
|
||
| --- | ||
|
|
||
| ## Completed ✅ | ||
|
|
||
| The following have already been implemented: | ||
|
|
||
| | Feature | Status | Notes | | ||
| |---------|--------|-------| | ||
| | `CrawlStatus.discovery` enum | ✅ Done | Progress model includes discovery stage | | ||
| | Domain filtering | ✅ Done | Both UI controls and backend filtering | | ||
| | Priority discovery (llms.txt → sitemap → full) | ✅ Done | DiscoveryService with correct priority order | | ||
| | Per-chunk embedding metadata | ✅ Done | `embedding_model`, `embedding_dimension` on `archon_crawled_pages` | | ||
| | Chunk deduplication | ✅ Done | Unique constraint on `(url, chunk_number)` | | ||
|
|
||
| --- | ||
|
|
||
| ## Remaining Work | ||
|
|
||
| ### Phase 1: Crawl Checkpoint & Resume | ||
|
|
||
| **Scope:** Add crawl state tracking so interrupted crawls can resume. | ||
|
|
||
| **Problems solved:** | ||
| - Mid-crawl crashes produce duplicate entries | ||
| - No recovery path; must clean DB and restart entire crawl | ||
|
|
||
| **Implementation:** | ||
| - Add `crawl_url_state` table: `pending | fetched | embedded | failed` | ||
| - Make chunk writes idempotent (upsert keyed on URL + chunk hash) | ||
| - On restart, skip `embedded`, retry `failed` | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 2: Re-vectorization Without Re-crawl | ||
|
|
||
| **Scope:** Allow reprocessing existing chunks with new embedding settings. | ||
|
|
||
| **Problems solved:** | ||
| - Can't change embedding provider (e.g., OpenAI → Ollama) without re-crawling | ||
| - Re-crawling is slow and abusive to source sites | ||
|
|
||
| **Implementation:** | ||
| - Add to `archon_sources`: | ||
| ```sql | ||
| embedding_model TEXT, | ||
| embedding_dimensions INTEGER, | ||
| vectorizer_flags JSONB, | ||
| summarization_model TEXT | ||
| ``` | ||
| - Add "Reprocess" action to re-embed without re-fetching | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 3: Per-Source Provenance UI | ||
|
|
||
| **Scope:** Display processing metadata for each source in UI. | ||
|
|
||
| **Deliverable:** | ||
| - UI panel showing: embedding model used, dimensions, vectorizer flags, crawl timestamp | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 4 (Optional): robots.txt Enforcement | ||
|
|
||
| **Scope:** Respect `Disallow:` directives in robots.txt files. | ||
|
|
||
| **Note:** Currently only reads robots.txt for sitemap discovery, doesn't enforce crawl rules. Lower priority - can revisit later. | ||
|
|
||
| --- | ||
|
|
||
| ## Consequences | ||
|
|
||
| - Resumable crawls prevent data loss and reduce site abuse | ||
| - Re-vectorization enables switching embedding providers without re-crawling | ||
| - Provenance UI helps debug embedding issues | ||
|
|
||
| --- | ||
|
|
||
| ## Future: Git Integration | ||
|
|
||
| With a resumable, reprocessable pipeline in place, Git integration becomes the next major feature (separate ADR). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orphaned closing fence at Line 173 breaks all downstream markdown rendering.
The Quick Workflows
```bashblock originally closed at Line 173 (pre-existing```). The new changes insert a closing fence at Line 157, correctly ending the bash block. However, the pre-existing```at Line 173 remains and is now orphaned — it opens a new unlanguaged code block, causing the entire## Architecture Overviewsection and everything below it to render as raw code.Fix: remove the pre-existing
```at Line 173, since Line 157 already properly closes the bash block.🔧 Proposed fix
See `python/tests/prompts/README.md` for details on adding new prompt tests. -``` ## Architecture Overview🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 173-173: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents