stranske · stranske · Jan 7, 2026 · Jan 7, 2026 · Jan 7, 2026 · Jan 7, 2026
@@ -1,8 +1,8 @@
 # LangChain Post-Code Production Capabilities - Evaluation & Rollout Plan
 
 > **Date:** January 7, 2026  
-> **Status:** Phase 1 & 2 Deployed - Active in Production  
-> **Last Validation:** 2026-01-07 (Post-Sync Cleanup)  
+> **Status:** Phase 3 Planning - Testing Cycle Defined  
+> **Last Validation:** 2026-01-07 (Phase 3 Test Plan Added)  
 
 ---
 
@@ -75,7 +75,7 @@
    - ✅ Travel-Plan-Permission (synced 2026-01-07)
    - ✅ Portable-Alpha-Extension-Model (synced 2026-01-07)
    - ✅ Trend_Model_Project (synced 2026-01-07)
-   - ⚠️ Collab-Admin (sync PR #113 pending - has lint failures)
+   - ✅ Collab-Admin (synced 2026-01-07, PR #113 merged)
 
 2. **Format Labels** - All 7 consumer repos have `agents:format`, `agents:formatted`, `agents:optimize`, `agents:apply-suggestions`:
    - ✅ Manager-Database (tested live - issue #184, synced 2026-01-07)
@@ -84,7 +84,7 @@
    - ✅ Travel-Plan-Permission (synced 2026-01-07)
    - ✅ Portable-Alpha-Extension-Model (synced 2026-01-07)
    - ✅ Trend_Model_Project (synced 2026-01-07)
-   - ⚠️ Collab-Admin (sync PR #113 pending - has lint failures)
+   - ✅ Collab-Admin (synced 2026-01-07, PR #113 merged)
 
 3. **Updated .gitignore** - Consumer repos have old partial version, missing new entries for:
    - `verifier-diff-summary.md`
@@ -135,11 +135,29 @@
 - Issue body updated with AGENT_ISSUE_TEMPLATE format
 - `agents:formatted` label added after successful formatting
 
-### Phase 3 Target: Advanced Features (Optional)
+### Phase 3 Target: Pre-Agent Intelligence (4 Capabilities)
 
-- `capability_check.py` integrated into issue intake OR archived
-- `task_decomposer.py` integrated for large issues OR archived
-- Dedup/semantic matching for issue triage OR archived
+**3A. Capability Check (Pre-Agent Gate)**
+- `capability_check.py` runs before `agent:codex` assignment
+- Identifies issues agent cannot complete (external deps, out-of-scope, credentials needed)
+- **Supplements** `agents:optimize` workflow (quality check) with feasibility check
+- Adds `needs-human` label + explanation when agent cannot proceed
+
+**3B. Task Decomposition (Large Issue Handling)**
+- `task_decomposer.py` auto-splits issues with 5+ implied tasks
+- Creates linked sub-issues or checklist within parent issue
+- Triggers via `agents:decompose` label (new)
+
+**3C. Duplicate Detection (Issue Triage)**
+- `issue_dedup.py` checks new issues against open issues
+- Posts warning comment if duplicate detected (>85% similarity)
+- Creates link to potential duplicate for human review
+- **Testing focus:** Validate false positive rate before auto-closing
+
+**3D. Semantic Label Matching (Auto-Labeling)**
+- `label_matcher.py` suggests appropriate labels based on issue content
+- Posts comment with label suggestions or auto-applies if confidence >90%
+- Uses `semantic_matcher.py` for embedding-based similarity
 
 ---
 
@@ -159,12 +177,11 @@
 - [x] Commit any fixes to main
 
 **Step 1B: Deploy to Consumer Repos**
-1. ✅ All consumer repos have verifier labels (6/7 active, Collab-Admin pending)
+1. ✅ All consumer repos have verifier labels (7/7 - all synced)
 2. ✅ Sync workflow runs automatically on template changes
 3. ✅ **Major cleanup completed 2026-01-07:**
    - 26 superseded sync PRs closed across 5 consumer repos
-   - 5 most recent sync PRs merged successfully
-   - Collab-Admin PR #113 blocked by lint failures (Python CI / lint-ruff)
+   - 6 most recent sync PRs merged successfully (including Collab-Admin PR #113)
    - **Bot Comment Analysis:** Reviewed 40+ comments across sync PRs
      - **Finding:** Zero substantive code review comments from Copilot/Codex agent bots
      - All comments were keepalive/autofix operational noise (status updates, missing-issue warnings)
@@ -213,7 +230,7 @@
 **Step 2A: Labels & Sync**
 1. ✅ Labels created via sync workflow (`agents:format`, `agents:formatted`, `agents:optimize`, `agents:apply-suggestions`)
 2. ✅ `agents-issue-optimizer.yml` is in sync manifest
-3. ✅ Sync PRs merged (5/6 repos as of 2026-01-07, Collab-Admin pending)
+3. ✅ Sync PRs merged (7/7 repos as of 2026-01-07, all synced)
 4. ✅ **Tested on Manager-Database #184:**
    - ✅ Created unstructured test issue
    - ✅ Added `agents:optimize` label → Workflow posted valuable analysis (8.6/10 quality)
@@ -237,33 +254,150 @@
   - ✅ Labels updated correctly (`agents:formatted` added)
   - ✅ **Updated:** Now uses `use_llm=True` to populate sections from analysis - pending retest
 
-### Phase 3: Archive Unused Scripts (1 Step)
+### Phase 3: Pre-Agent Intelligence (4 Steps)
+
+**Status: Planning - Test Cycle Defined**
+
+**Step 3A: Capability Check Integration**
+
+1. **Relationship to existing workflows:**
+   - `agents:optimize` → "Is this issue well-written?" (quality check)
+   - `capability_check.py` → "Can the agent DO this?" (feasibility gate)
+   - **Answer:** Supplements optimizer, runs BEFORE agent assignment on Issues
+
+2. **Proposed workflow integration:**
+   ```
+   Issue Created → agents:optimize (quality) → agents:apply-suggestions (format)
+                                                        ↓
+   User adds agent:codex → capability_check.py runs → If NOT capable:
+                                                        → Add needs-human label
+                                                        → Post blocker explanation
+                                                      If capable:
+                                                        → Proceed with agent
+   ```
+
+3. **Implementation tasks:**
+   - [ ] Create `agents-capability-check.yml` workflow
+   - [ ] Add `needs-human` label to consumer repos via sync
+   - [ ] Trigger on `agent:codex` label added OR new workflow label
+   - [ ] Post comment explaining blockers when agent cannot proceed
+
+**Step 3B: Task Decomposition**
+
+1. **Implementation tasks:**
+   - [ ] Create `agents-decompose.yml` workflow
+   - [ ] Add `agents:decompose` label to label sync config
+   - [ ] Call `task_decomposer.py` when label applied
+   - [ ] Output: Either create sub-issues OR add checklist to parent
+
+**Step 3C: Duplicate Detection (Testing Focus)**
+
+1. **Critical concern:** False positives - we don't want to close valid issues
+2. **Approach:** Comment-only mode first, no auto-close
+3. **Implementation tasks:**
+   - [ ] Create `agents-dedup.yml` workflow
+   - [ ] Trigger on issue opened
+   - [ ] Compare against open issues using embeddings
+   - [ ] Post comment if >85% similarity detected (link to potential duplicate)
+   - [ ] Track false positive rate over testing period
+
+4. **Testing metrics to track:**
+   - True positive rate (correctly identified duplicates)
+   - False positive rate (target: <5%)
+   - Human override rate (user keeps both issues open)
+
+**Step 3D: Semantic Label Matching**
+
+1. **Implementation tasks:**
+   - [ ] Create `agents-auto-label.yml` workflow OR integrate into existing
+   - [ ] Use `label_matcher.py` for semantic similarity
+   - [ ] Post comment with suggestions OR auto-apply at >90% confidence
+
+---
+
+## Phase 3 Testing Plan (Manager-Database)
+
+**Test Repository:** Manager-Database
+**Test Duration:** 2 weeks (7 issues minimum)
+**Start Date:** Ready to begin (all consumer repos synced)
+
+### Test Issue #1: Capability Check Validation
+
+**Purpose:** Validate capability_check.py correctly identifies agent blockers
+
+**Test Scenarios:**
+1. **Issue requiring external API** - Should flag "needs credentials/external dependency"
+2. **Issue requiring database migration** - Should flag "needs infrastructure/manual step"
+3. **Normal code-only issue** - Should pass capability check
+
+**Test Issue Ideas for Manager-Database:**
+- "Integrate with external payment API" (should fail - external dep)
+- "Add database migration for new schema" (should fail - infra)
+- "Refactor logging module" (should pass - code only)
+
+### Test Issue #2: Task Decomposition Validation
 
-**Status: Decision Deferred**
+**Purpose:** Validate task_decomposer.py produces useful sub-tasks
 
-These scripts are fully tested (145 tests passing) but not yet integrated:
-- `capability_check.py` - Pre-flight check for agent capability on tasks
-- `task_decomposer.py` - Break large tasks into smaller actionable items  
-- `issue_dedup.py` - Detect duplicate issues via embeddings
-- `label_matcher.py` - Semantic label matching
-- `semantic_matcher.py` - Shared embedding utilities
+**Test Scenario:**
+- Create large issue with 5+ implied tasks
+- Apply `agents:decompose` label
+- Verify sub-tasks are actionable and correctly scoped
 
-**Recommendation:** Keep & Document for future Phase 3+ integration
-- All scripts have full test coverage
-- Semantic matching could enhance issue triage
-- Capability check could prevent failed agent attempts
+**Test Issue Idea:**
+- "Implement comprehensive health check endpoint with retry logic, circuit breaker, metrics, and alerting integration"
+
+### Test Issue #3: Duplicate Detection Validation
+
+**Purpose:** Measure false positive rate for issue_dedup.py
+
+**Test Scenarios:**
+1. **True duplicate** - Create issue very similar to existing (should detect)
+2. **Related but different** - Create issue in same area but different ask (should NOT flag)
+3. **Unrelated** - Create issue in different area (should NOT flag)
+
+**Success Criteria:**
+- True positives detected: 100%
+- False positive rate: <5%
+- Clear explanation in comment linking to potential duplicate
+
+### Test Issue #4: Label Matching Validation
+
+**Purpose:** Validate label_matcher.py suggests correct labels
+
+**Test Scenario:**
+- Create unlabeled issues in different categories
+- Verify label suggestions match expected labels
+- Track suggestion accuracy
+
+### Testing Metrics Dashboard
+
+| Script | Test Issues | True Positives | False Positives | Accuracy | Status |
+|--------|-------------|----------------|-----------------|----------|--------|
+| capability_check.py | 0/3 | - | - | - | ⏳ Pending |
+| task_decomposer.py | 0/2 | - | - | - | ⏳ Pending |
+| issue_dedup.py | 0/3 | - | - | <5% target | ⏳ Pending |
+| label_matcher.py | 0/3 | - | - | - | ⏳ Pending |
+
+**Total test issues needed:** ~11 issues on Manager-Database
 
 ---
 
 ## Summary
 
 | Phase | Scope | Steps | Test Repo | Status |
 |-------|-------|-------|-----------|--------|
-| 1 | PR Verification | 2 | Manager-Database | ✅ Deployed, 5/6 repos synced |
+| 1 | PR Verification | 2 | Manager-Database | ✅ Deployed, 6/7 repos synced |
 | 2 | Issue Formatting | 1 | Manager-Database | ✅ Deployed & tested - Quality: 7.5/10 |
-| 3 | Cleanup/Archive | 1 | N/A | Deferred (scripts retained) |
+| 3 | Pre-Agent Intelligence | 4 | Manager-Database | 🔄 Planning - Testing cycle defined |
+
+**Phase 3 Components:**
+- **3A:** Capability Check - Pre-agent feasibility gate (supplements agents:optimize)
+- **3B:** Task Decomposition - Auto-split large issues
+- **3C:** Duplicate Detection - Comment-only mode, track false positives
+- **3D:** Semantic Labeling - Auto-suggest/apply labels
 
-**Total: 4 deployment actions** - All infrastructure deployed. Major sync cleanup completed 2026-01-07 (26 superseded PRs closed, 5/6 repos synced). Collab-Admin PR #113 blocked by lint failures.
+**Total: 7 deployment actions** - Phases 1-2 deployed. Phase 3 testing plan defined for Manager-Database (~11 test issues).
 
 **Substantive Quality Assessment:**
 - **agents:optimize:** 8.6/10 - Provides valuable, actionable analysis
@@ -277,8 +411,8 @@ These scripts are fully tested (145 tests passing) but not yet integrated:
 ### Immediate (Ready Now)
 1. ~~**Merge PR #633**~~ ✅ Merged - GPT-5.2 for compare mode
 2. ~~**Merge PR #643**~~ ✅ Merged - Model name in comparison reports + disable auto-issue creation
-3. ~~**Consumer repo sync cleanup**~~ ✅ Completed 2026-01-07 - 26 superseded PRs closed, 5/6 merged
-4. **Resolve Collab-Admin sync** - ⏳ PR #113 blocked by lint failures (Python CI / lint-ruff)
+3. ~~**Consumer repo sync cleanup**~~ ✅ Completed 2026-01-07 - 26 superseded PRs closed, 6/6 merged
+4. ~~**Resolve Collab-Admin sync**~~ ✅ PR #113 merged 2026-01-07
 5. ~~**Live test `agents:optimize`**~~ ✅ Tested on Manager-Database #184 - Quality: 8.6/10
 6. ~~**Live test `agents:apply-suggestions`**~~ ✅ Tested on Manager-Database #184 - Quality: 6/10
 
@@ -294,11 +428,18 @@ These scripts are fully tested (145 tests passing) but not yet integrated:
    - "Implement logging before health checks"
    - "Retry logic blocks enhanced error logging"
 
+### Phase 3 Implementation (Next)
+1. **Step 3A: Capability Check** - Create `agents-capability-check.yml`, integrate with issue workflow
+   - Supplements existing agents:optimize (quality) with feasibility gate
+   - Runs BEFORE agent assignment, not after
+2. **Step 3B: Task Decomposition** - Create `agents-decompose.yml` workflow
+3. **Step 3C: Duplicate Detection** - Create `agents-dedup.yml` (comment-only, track false positives)
+4. **Step 3D: Label Matching** - Integrate into issue workflow
+
 ### Future Enhancements
 1. **Compare mode refinement** - Currently uses gpt-4o (GitHub) vs gpt-5.2 (OpenAI)
 2. **Model auto-update** - Use `scripts/update_model_list.sh` periodically
 3. **Domain-specific guidance** - Add prompts for retry patterns, health check endpoints
-4. **Phase 3 scripts** - Decide on capability_check.py and task_decomposer.py integration
 
 ### Test Results Documentation
 Full substantive analysis available at `/tmp/substantive_test_analysis.md`: