-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add crawl checkpoint/resume infrastructure #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Zebastjan
wants to merge
25
commits into
coleam00:main
from
Zebastjan:feature/crawl-checkpoint-resume
Closed
Changes from 10 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
5b3500f
Optimize code summary prompt for small language models
Zebastjan 401c12e
Add integration tests for code summary prompt validation
Zebastjan 547440d
Add comprehensive testing documentation and results
Zebastjan ee980d1
Fix backend validation bug: add 'discovery' status to CrawlProgressRe…
Zebastjan 827a1a0
Add crawl URL state tracking for checkpoint/resume
Zebastjan 3c94561
Integrate crawl URL state tracking into crawling service
Zebastjan 0aaee50
Update ADR-001 with accurate implementation state and roadmap
Zebastjan 0f83aa6
Add unit tests for CrawlUrlStateService (checkpoint/resume)
Zebastjan 990b8bb
Add provenance tracking to frontend and reconcile ADRs
Zebastjan 70d4db7
Implement checkpoint/resume and provenance tracking backend
Zebastjan e93d42e
Add archon_migrations tracking to migration files 001-007, 012-013
Zebastjan f4c868d
Add re-vectorize, re-summarize endpoints and code summarization agent…
Zebastjan a66e355
Add progress tracking and concurrency limiting for re-vectorize/re-su…
Zebastjan 377324e
Move Code Summarization to third tab in LLM Provider Settings
Zebastjan a7306d2
Fix Code Summary tab layout and Config button
Zebastjan ca3d358
Add Summary tab to LLM Provider Settings
Zebastjan e3fef61
Fix ternary syntax and add full model selection for Summary tab
Zebastjan 9a12773
feat(ingestion): add restartable RAG pipeline with separable stages
Zebastjan 940d0de
feat(ingestion): add worker API and crawl service integration
Zebastjan a60fe0d
Add roadmap document outlining project direction
Zebastjan fb01b9a
Add progress persistence, pause/resume, and RAG settings fixes
Zebastjan 1fd5849
feat: add pause/resume infrastructure for crawl operations (WIP)
Zebastjan 5e99e72
test: add comprehensive pause/resume/cancel testing infrastructure
Zebastjan 9d24967
docs: add KNOWN_ISSUES - crawls currently broken
Zebastjan 24739fb
fix: use 'idle' instead of invalid 'initializing' pipeline_status
Zebastjan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # ADR-001: Crawl & Ingestion Pipeline Improvements | ||
|
|
||
| **Status:** Superseded by ADR-002 | ||
| **Date:** 2026-02-22 | ||
| **Authors:** Zebastjan Johanzen | ||
|
|
||
| > ⚠️ **This ADR has been superseded by ADR-002**. All content has been merged into ADR-002 for a unified view of crawl reliability, provenance tracking, and validation tooling. | ||
|
|
||
| --- | ||
|
|
||
| ## Completed ✅ | ||
|
|
||
| *(These were already completed before ADR-001 was superseded)* | ||
|
|
||
| | Feature | Status | Notes | | ||
| |---------|--------|-------| | ||
| | `CrawlStatus.discovery` enum | ✅ Done | Progress model includes discovery stage | | ||
| | Domain filtering | ✅ Done | Both UI controls and backend filtering | | ||
| | Priority discovery (llms.txt → sitemap → full) | ✅ Done | DiscoveryService with correct priority order | | ||
| | Per-chunk embedding metadata | ✅ Done | `embedding_model`, `embedding_dimension` on `archon_crawled_pages` | | ||
| | Chunk deduplication | ✅ Done | Unique constraint on `(url, chunk_number)` | | ||
|
|
||
| --- | ||
|
|
||
| ## Remaining Work | ||
|
|
||
| *(See ADR-002 for the complete roadmap)* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,318 @@ | ||
| # ADR-002 Implementation Status | ||
|
|
||
| ## Overview | ||
| This document tracks the implementation progress of ADR-002: Crawl Reliability, Provenance Tracking & Validation. | ||
|
|
||
| **Branch:** `feature/crawl-checkpoint-resume` | ||
| **Date:** 2026-02-22 | ||
|
|
||
| --- | ||
|
|
||
| ## Part 1: Checkpoint/Resume - ✅ COMPLETE | ||
|
|
||
| ### Backend Implementation | ||
|
|
||
| **Status:** ✅ Fully Implemented | ||
|
|
||
| **Files Modified:** | ||
| 1. ✅ `python/src/server/services/crawling/crawling_service.py` | ||
| - Added `_filter_already_processed_urls()` helper method (lines 857-889) | ||
| - Updated `_crawl_by_url_type()` signature to accept `source_id` and `has_existing_state` | ||
| - Applied resume filtering to sitemap crawling (lines 1101-1120) | ||
| - Applied resume filtering to link collection batch crawling (lines 1046-1051) | ||
| - Applied resume filtering to recursive crawling (lines 1066-1073, 1155-1162) | ||
| - Updated call sites to pass source_id and has_existing_state parameters | ||
|
|
||
| 2. ✅ `python/src/server/services/crawling/strategies/recursive.py` | ||
| - Updated `crawl_recursive_with_progress()` signature to accept `source_id` and `url_state_service` | ||
| - Pre-populated visited set with already-embedded URLs (lines 158-165) | ||
| - Prevents re-crawling of completed URLs during recursive depth traversal | ||
|
|
||
| 3. ✅ Infrastructure Already Complete (from previous work) | ||
| - `archon_crawl_url_state` table exists | ||
| - `CrawlUrlStateService` with full CRUD operations | ||
| - Integration with document storage operations | ||
|
|
||
| **How It Works:** | ||
| 1. **Detection:** When `orchestrate_crawl()` starts, it checks for existing crawl state using `url_state_service.has_existing_state()` | ||
| 2. **Logging:** If state exists with pending/failed URLs, logs resume information | ||
| 3. **Filtering:** Before crawling strategies execute: | ||
| - Sitemap: Filters URLs before batch crawl | ||
| - Link Collection: Filters extracted links before batch crawl | ||
| - Recursive: Pre-populates visited set to skip embedded URLs | ||
| 4. **Resume:** Only unprocessed URLs are crawled, preventing duplicates | ||
|
|
||
| **Testing Verification:** | ||
| ```bash | ||
| # Test scenario: | ||
| # 1. Start crawl of sitemap with 100 URLs | ||
| # 2. Kill server after 30 URLs embedded | ||
| # 3. Check archon_crawl_url_state shows 30 embedded, 70 pending | ||
| # 4. Restart server and re-trigger crawl | ||
| # 5. Verify logs show "Resume filtering | skipped=30 already-embedded URLs" | ||
| # 6. Verify only 70 new URLs are processed | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Part 2: Provenance Tracking - ✅ BACKEND COMPLETE, ⏳ FRONTEND PENDING | ||
|
|
||
| ### Backend Implementation | ||
|
|
||
| **Status:** ✅ Fully Implemented | ||
|
|
||
| **Database Migration:** | ||
| ✅ `migration/0.1.0/013_add_provenance_tracking.sql` | ||
| - Adds 7 new columns to `archon_sources`: | ||
| - `embedding_model` (TEXT) - e.g., "text-embedding-3-small" | ||
| - `embedding_dimensions` (INTEGER) - e.g., 1536 | ||
| - `embedding_provider` (TEXT) - e.g., "openai" | ||
| - `vectorizer_settings` (JSONB) - chunk_size, use_contextual, use_hybrid | ||
| - `summarization_model` (TEXT) - e.g., "gpt-4o-mini" | ||
| - `last_crawled_at` (TIMESTAMPTZ) | ||
| - `last_vectorized_at` (TIMESTAMPTZ) | ||
| - Creates indexes on `embedding_model` and `embedding_provider` | ||
| - Adds column comments for documentation | ||
|
|
||
| **Files Modified:** | ||
| 1. ✅ `python/src/server/services/source_management_service.py` | ||
| - Updated `update_source_info()` signature to accept provenance parameters (lines 214-232) | ||
| - Added provenance fields to existing source upsert (lines 294-313) | ||
| - Added provenance fields to new source creation (lines 378-402) | ||
| - Sets `last_crawled_at` and `last_vectorized_at` timestamps | ||
|
|
||
| 2. ✅ `python/src/server/services/crawling/document_storage_operations.py` | ||
| - Captures embedding configuration from credential service (lines 376-392) | ||
| - Retrieves: embedding_provider, embedding_model, embedding_dimensions | ||
| - Retrieves summarization_model from RAG strategy settings | ||
| - Passes all provenance to `update_source_info()` during crawl | ||
|
|
||
| **How It Works:** | ||
| 1. **Capture:** During `_create_source_records()`, reads current provider configuration | ||
| 2. **Store:** Passes configuration to `update_source_info()` which upserts to database | ||
| 3. **Timestamps:** Automatically sets `last_crawled_at` and `last_vectorized_at` to current time | ||
| 4. **Persistence:** All sources now track which models/settings were used | ||
|
|
||
| ### Frontend Implementation | ||
|
|
||
| **Status:** ⏳ PENDING | ||
|
|
||
| **Files to Modify:** | ||
| 1. ⏳ `archon-ui-main/src/features/knowledge/types/knowledge.ts` | ||
| ```typescript | ||
| export interface KnowledgeSource { | ||
| source_id: string; | ||
| // ... existing fields ... | ||
| embedding_model?: string; | ||
| embedding_dimensions?: number; | ||
| embedding_provider?: string; | ||
| vectorizer_settings?: { | ||
| use_contextual?: boolean; | ||
| use_hybrid?: boolean; | ||
| chunk_size?: number; | ||
| }; | ||
| summarization_model?: string; | ||
| last_crawled_at?: string; | ||
| last_vectorized_at?: string; | ||
| } | ||
| ``` | ||
|
|
||
| 2. ⏳ `archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` | ||
| - Add expandable "Processing Details" section using Radix Collapsible | ||
| - Display embedding_provider/embedding_model (embedding_dimensions D) | ||
| - Display summarization_model | ||
| - Display formatted last_crawled_at timestamp | ||
| - Use Tron-inspired glassmorphism styling | ||
|
|
||
| **UI Design:** | ||
| ```tsx | ||
| <Collapsible.Root> | ||
| <Collapsible.Trigger className="flex items-center gap-2 text-sm text-gray-400 hover:text-cyan-400"> | ||
| <ChevronRight className="transition-transform" /> | ||
| Processing Details | ||
| </Collapsible.Trigger> | ||
| <Collapsible.Content className="mt-2 text-xs text-gray-400 space-y-1 pl-6"> | ||
| <div>Embeddings: {embedding_provider}/{embedding_model} ({embedding_dimensions}D)</div> | ||
| <div>Summarization: {summarization_model}</div> | ||
| <div>Last crawled: {formatDate(last_crawled_at)}</div> | ||
| </Collapsible.Content> | ||
| </Collapsible.Root> | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Part 3: Validation Tools - ❌ NOT STARTED | ||
|
|
||
| ### Backend Implementation | ||
|
|
||
| **Status:** ❌ Not Started | ||
|
|
||
| **Files to Create:** | ||
| 1. ❌ `python/src/server/api_routes/knowledge_api.py` (or modify existing) | ||
| - Add `GET /api/knowledge-items/{source_id}/validate` endpoint | ||
| - Checks: | ||
| - Missing chunks (URLs marked embedded but no chunks exist) | ||
| - Zero-vector embeddings (null or all-zero vectors) | ||
| - Dimension mismatches (mixed embedding dimensions) | ||
| - Orphaned pages (page_metadata without chunks) | ||
| - Failed URLs that never recovered | ||
| - Returns: `{ valid: bool, issues: Issue[], total_issues: int }` | ||
|
|
||
| 2. ❌ `migration/0.1.0/014_add_validation_functions.sql` | ||
| ```sql | ||
| CREATE OR REPLACE FUNCTION count_zero_vectors(src_id TEXT) | ||
| RETURNS INTEGER AS $$ | ||
| SELECT COUNT(*) | ||
| FROM archon_documents | ||
| WHERE source_id = src_id | ||
| AND embedding IS NOT NULL | ||
| AND array_length(embedding, 1) > 0 | ||
| AND embedding = array_fill(0::float, ARRAY[array_length(embedding, 1)]); | ||
| $$ LANGUAGE SQL; | ||
| ``` | ||
|
|
||
| ### MCP Tool Implementation | ||
|
|
||
| **Status:** ❌ Not Started | ||
|
|
||
| **Files to Modify:** | ||
| 1. ❌ `python/src/mcp_server/features/rag/rag_tools.py` | ||
| - Add `rag_validate_source(source_id: str)` tool | ||
| - Calls validation API endpoint | ||
| - Returns summary: valid, error_count, warning_count, issues_summary, recommendation | ||
| - Read-only (no writes, no fixes) | ||
|
|
||
| **Tool Usage Example:** | ||
| ```python | ||
| @mcp.tool() | ||
| async def rag_validate_source(source_id: str) -> dict: | ||
| """Check knowledge source health before using for RAG.""" | ||
| # Calls GET /api/knowledge-items/{source_id}/validate | ||
| # Returns summary for agent decision-making | ||
| ``` | ||
|
|
||
| ### Frontend Implementation | ||
|
|
||
| **Status:** ❌ Not Started | ||
|
|
||
| **Files to Create:** | ||
| 1. ❌ `archon-ui-main/src/features/knowledge/components/ValidationPanel.tsx` | ||
| - "Validate" button on knowledge item action menu | ||
| - Opens expandable panel or modal with validation results | ||
| - Color-coded issues (red=error, yellow=warning, blue=info) | ||
| - "Fix" buttons for fixable issues | ||
|
|
||
| 2. ❌ `archon-ui-main/src/features/knowledge/hooks/useValidateSource.ts` | ||
| - TanStack Query hook for validation endpoint | ||
| - `useValidateSource(sourceId)` → returns validation data | ||
|
|
||
| --- | ||
|
|
||
| ## Testing Checklist | ||
|
|
||
| ### Part 1: Checkpoint/Resume | ||
| - [ ] Start sitemap crawl with 100 URLs | ||
| - [ ] Kill process at 30% complete | ||
| - [ ] Verify `archon_crawl_url_state` shows mix of embedded/pending | ||
| - [ ] Restart and re-trigger crawl | ||
| - [ ] Verify only pending URLs processed | ||
| - [ ] Verify no duplicates in final data | ||
| - [ ] Check logs show "Resume filtering | skipped=X" | ||
|
|
||
| ### Part 2: Provenance Tracking | ||
| - [x] Backend: Migration created | ||
| - [x] Backend: Service layer updated | ||
| - [x] Backend: Provenance captured during crawl | ||
| - [ ] Frontend: Types updated | ||
| - [ ] Frontend: UI displays provenance | ||
| - [ ] Test: Crawl a source | ||
| - [ ] Test: Query source record | ||
| - [ ] Test: Verify provenance fields populated | ||
|
|
||
| ### Part 3: Validation Tools | ||
| - [ ] Backend: Validation endpoint created | ||
| - [ ] Backend: Database functions created | ||
| - [ ] MCP: Validation tool implemented | ||
| - [ ] Frontend: Validation UI created | ||
| - [ ] Test: Insert corrupted data (zero vector) | ||
| - [ ] Test: Validation detects issues | ||
| - [ ] Test: MCP tool returns correct summary | ||
|
|
||
| --- | ||
|
|
||
| ## Migration Deployment | ||
|
|
||
| **Required Database Migrations:** | ||
| 1. ✅ `013_add_provenance_tracking.sql` - Ready to deploy | ||
| 2. ❌ `014_add_validation_functions.sql` - Not created yet | ||
|
|
||
| **Deployment Steps:** | ||
| ```bash | ||
| # Apply provenance tracking migration | ||
| supabase db push | ||
| # Or manually run the SQL in Supabase dashboard | ||
| ``` | ||
|
|
||
| **Rollback Plan:** | ||
| ```sql | ||
| -- If needed, rollback provenance columns: | ||
| ALTER TABLE archon_sources | ||
| DROP COLUMN IF EXISTS embedding_model, | ||
| DROP COLUMN IF EXISTS embedding_dimensions, | ||
| DROP COLUMN IF EXISTS embedding_provider, | ||
| DROP COLUMN IF EXISTS vectorizer_settings, | ||
| DROP COLUMN IF EXISTS summarization_model, | ||
| DROP COLUMN IF EXISTS last_crawled_at, | ||
| DROP COLUMN IF EXISTS last_vectorized_at; | ||
|
|
||
| DROP INDEX IF EXISTS idx_archon_sources_embedding_model; | ||
| DROP INDEX IF EXISTS idx_archon_sources_embedding_provider; | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Priority for Remaining Work | ||
|
|
||
| ### High Priority (Complete Part 2) | ||
| 1. Update frontend types for provenance fields | ||
| 2. Add provenance display to KnowledgeCard component | ||
| 3. Test end-to-end provenance tracking | ||
|
|
||
| ### Medium Priority (Part 3 - Validation) | ||
| 4. Create validation API endpoint | ||
| 5. Create database validation functions | ||
| 6. Build validation UI component | ||
|
|
||
| ### Low Priority (Part 3 - MCP Tool) | ||
| 7. Add read-only MCP validation tool | ||
|
|
||
| --- | ||
|
|
||
| ## Known Issues / Notes | ||
|
|
||
| 1. **Provenance Settings:** Currently using placeholder values for `vectorizer_settings`. These should be populated from actual RAG strategy configuration when contextual embeddings or hybrid search are implemented. | ||
|
|
||
| 2. **Recursive Crawl Resume:** The current implementation pre-populates the visited set with embedded URLs. This works well but doesn't distinguish between "already visited in this session" vs "embedded in previous session". This is acceptable for now. | ||
|
|
||
| 3. **Type Safety:** Some type warnings in `source_management_service.py` related to optional parameters. These are safe to ignore as the functions handle None values correctly. | ||
|
|
||
| 4. **Migration Order:** The provenance migration (013) must be run before the validation migration (014) when it's created. | ||
|
|
||
| --- | ||
|
|
||
| ## Next Steps | ||
|
|
||
| **Immediate:** | ||
| 1. Apply database migration `013_add_provenance_tracking.sql` | ||
| 2. Test checkpoint/resume functionality end-to-end | ||
| 3. Update frontend types and UI for provenance display | ||
|
|
||
| **Short Term:** | ||
| 4. Implement validation API endpoint and database functions | ||
| 5. Build validation UI component | ||
|
|
||
| **Future Enhancements:** | ||
| - Bulk loading UI/API (separate ADR) | ||
| - Manifest import capability (separate ADR) | ||
| - Re-vectorization tooling using provenance data | ||
| - Provenance-based source filtering in UI | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status markers are out of date — frontend implementation is already present in this PR.
The "Part 2: Frontend Implementation" is marked "⏳ PENDING" and the items listed (types in
knowledge.ts, provenance display inKnowledgeCard.tsx) are already implemented in this PR. Similarly, the checklist at Lines 274–275 should be checked.Additionally, Part 4 (Lines 211–256) marks all reprocessing items as "❌ Not Started," but this PR includes:
useRevectorizeKnowledgeItem/useResummarizeKnowledgeItemhooksrevectorizeKnowledgeItem/resummarizeKnowledgeItemservice methodsKnowledgeCardActionsKnowledgeCardUpdate the status markers and checklists to reflect the code that ships with this PR.
🤖 Prompt for AI Agents