Skip to content
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5b3500f
Optimize code summary prompt for small language models
Zebastjan Feb 22, 2026
401c12e
Add integration tests for code summary prompt validation
Zebastjan Feb 22, 2026
547440d
Add comprehensive testing documentation and results
Zebastjan Feb 22, 2026
ee980d1
Fix backend validation bug: add 'discovery' status to CrawlProgressRe…
Zebastjan Feb 22, 2026
827a1a0
Add crawl URL state tracking for checkpoint/resume
Zebastjan Feb 22, 2026
3c94561
Integrate crawl URL state tracking into crawling service
Zebastjan Feb 22, 2026
0aaee50
Update ADR-001 with accurate implementation state and roadmap
Zebastjan Feb 22, 2026
0f83aa6
Add unit tests for CrawlUrlStateService (checkpoint/resume)
Zebastjan Feb 22, 2026
990b8bb
Add provenance tracking to frontend and reconcile ADRs
Zebastjan Feb 22, 2026
70d4db7
Implement checkpoint/resume and provenance tracking backend
Zebastjan Feb 22, 2026
e93d42e
Add archon_migrations tracking to migration files 001-007, 012-013
Zebastjan Feb 22, 2026
f4c868d
Add re-vectorize, re-summarize endpoints and code summarization agent…
Zebastjan Feb 22, 2026
a66e355
Add progress tracking and concurrency limiting for re-vectorize/re-su…
Zebastjan Feb 22, 2026
377324e
Move Code Summarization to third tab in LLM Provider Settings
Zebastjan Feb 22, 2026
a7306d2
Fix Code Summary tab layout and Config button
Zebastjan Feb 22, 2026
ca3d358
Add Summary tab to LLM Provider Settings
Zebastjan Feb 22, 2026
e3fef61
Fix ternary syntax and add full model selection for Summary tab
Zebastjan Feb 22, 2026
9a12773
feat(ingestion): add restartable RAG pipeline with separable stages
Zebastjan Feb 22, 2026
940d0de
feat(ingestion): add worker API and crawl service integration
Zebastjan Feb 23, 2026
a60fe0d
Add roadmap document outlining project direction
Zebastjan Feb 23, 2026
fb01b9a
Add progress persistence, pause/resume, and RAG settings fixes
Zebastjan Feb 23, 2026
1fd5849
feat: add pause/resume infrastructure for crawl operations (WIP)
Zebastjan Feb 23, 2026
5e99e72
test: add comprehensive pause/resume/cancel testing infrastructure
Zebastjan Feb 23, 2026
9d24967
docs: add KNOWN_ISSUES - crawls currently broken
Zebastjan Feb 23, 2026
24739fb
fix: use 'idle' instead of invalid 'initializing' pipeline_status
Zebastjan Feb 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions ADR-001: Crawl & Ingestion Pipeline Improvements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# ADR-001: Crawl & Ingestion Pipeline Improvements

**Status:** In Progress
**Date:** 2026-02-22
**Authors:** Zebastjan Johanzen

---

## Context

Archon's crawler and ingestion pipeline is the foundation everything else
depends on — MCP agent quality, RAG search accuracy, and AI coding assistant
usefulness all trace back to whether the knowledge base contains clean,
well-processed, verifiable data.

This ADR tracks remaining improvements needed for the crawl & ingestion pipeline.

---

## Completed ✅

The following have already been implemented:

| Feature | Status | Notes |
|---------|--------|-------|
| `CrawlStatus.discovery` enum | ✅ Done | Progress model includes discovery stage |
| Domain filtering | ✅ Done | Both UI controls and backend filtering |
| Priority discovery (llms.txt → sitemap → full) | ✅ Done | DiscoveryService with correct priority order |
| Per-chunk embedding metadata | ✅ Done | `embedding_model`, `embedding_dimension` on `archon_crawled_pages` |
| Chunk deduplication | ✅ Done | Unique constraint on `(url, chunk_number)` |

---

## Remaining Work

### Phase 1: Crawl Checkpoint & Resume

**Scope:** Add crawl state tracking so interrupted crawls can resume.

**Problems solved:**
- Mid-crawl crashes produce duplicate entries
- No recovery path; must clean DB and restart entire crawl

**Implementation:**
- Add `crawl_url_state` table: `pending | fetched | embedded | failed`
- Make chunk writes idempotent (upsert keyed on URL + chunk hash)
- On restart, skip `embedded`, retry `failed`

---

### Phase 2: Re-vectorization Without Re-crawl

**Scope:** Allow reprocessing existing chunks with new embedding settings.

**Problems solved:**
- Can't change embedding provider (e.g., OpenAI → Ollama) without re-crawling
- Re-crawling is slow and abusive to source sites

**Implementation:**
- Add to `archon_sources`:
```sql
embedding_model TEXT,
embedding_dimensions INTEGER,
vectorizer_flags JSONB,
summarization_model TEXT
```
- Add "Reprocess" action to re-embed without re-fetching

---

### Phase 3: Per-Source Provenance UI

**Scope:** Display processing metadata for each source in UI.

**Deliverable:**
- UI panel showing: embedding model used, dimensions, vectorizer flags, crawl timestamp

---

### Phase 4 (Optional): robots.txt Enforcement

**Scope:** Respect `Disallow:` directives in robots.txt files.

**Note:** Currently only reads robots.txt for sitemap discovery, doesn't enforce crawl rules. Lower priority - can revisit later.

---

## Consequences

- Resumable crawls prevent data loss and reduce site abuse
- Re-vectorization enables switching embedding providers without re-crawling
- Provenance UI helps debug embedding issues

---

## Future: Git Integration

With a resumable, reprocessable pipeline in place, Git integration becomes the next major feature (separate ADR).
20 changes: 20 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,26 @@ make lint-be # Backend only (Ruff + MyPy)
make test # Run all tests
make test-fe # Frontend tests only
make test-be # Backend tests only

# Prompt regression tests
uv run python tests/prompts/test_code_summary_prompt.py # Test code summary prompt
uv run pytest tests/prompts/ -v # Run all prompt tests with pytest
```

### Prompt Regression Tests

**Location**: `python/tests/prompts/`
**Documentation**: `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md`

Regression tests for AI prompts used in production. These ensure prompt changes don't break output structure or quality.

**When to run**:
- Before merging prompt changes
- When updating LLM providers or models
- As part of CI/CD pipeline
- When debugging summary/output quality issues

See `python/tests/prompts/README.md` for details on adding new prompt tests.
```
Comment on lines +153 to 173

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Orphaned closing fence at Line 173 breaks all downstream markdown rendering.

The Quick Workflows ```bash block originally closed at Line 173 (pre-existing ```). The new changes insert a closing fence at Line 157, correctly ending the bash block. However, the pre-existing ``` at Line 173 remains and is now orphaned — it opens a new unlanguaged code block, causing the entire ## Architecture Overview section and everything below it to render as raw code.

Fix: remove the pre-existing ``` at Line 173, since Line 157 already properly closes the bash block.

🔧 Proposed fix
  See `python/tests/prompts/README.md` for details on adding new prompt tests.
-```
 
 ## Architecture Overview
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 173-173: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 153 - 173, Remove the stray closing code-fence that
follows the "Prompt Regression Tests" / Quick Workflows bash block so there's
only one closing ``` for that block; specifically, delete the orphaned ``` that
currently precedes the "## Architecture Overview" heading so the bash block
closes correctly and the rest of the document renders as normal.


## Architecture Overview
Expand Down
Loading