Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 171 additions & 30 deletions docs/plans/langchain-post-code-rollout.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# LangChain Post-Code Production Capabilities - Evaluation & Rollout Plan

> **Date:** January 7, 2026
> **Status:** Phase 1 & 2 Deployed - Active in Production
> **Last Validation:** 2026-01-07 (Post-Sync Cleanup)
> **Status:** Phase 3 Planning - Testing Cycle Defined
> **Last Validation:** 2026-01-07 (Phase 3 Test Plan Added)

---

Expand Down Expand Up @@ -75,7 +75,7 @@
- ✅ Travel-Plan-Permission (synced 2026-01-07)
- ✅ Portable-Alpha-Extension-Model (synced 2026-01-07)
- ✅ Trend_Model_Project (synced 2026-01-07)
- ⚠️ Collab-Admin (sync PR #113 pending - has lint failures)
- Collab-Admin (synced 2026-01-07, PR #113 merged)

2. **Format Labels** - All 7 consumer repos have `agents:format`, `agents:formatted`, `agents:optimize`, `agents:apply-suggestions`:
- ✅ Manager-Database (tested live - issue #184, synced 2026-01-07)
Expand All @@ -84,7 +84,7 @@
- ✅ Travel-Plan-Permission (synced 2026-01-07)
- ✅ Portable-Alpha-Extension-Model (synced 2026-01-07)
- ✅ Trend_Model_Project (synced 2026-01-07)
- ⚠️ Collab-Admin (sync PR #113 pending - has lint failures)
- Collab-Admin (synced 2026-01-07, PR #113 merged)

3. **Updated .gitignore** - Consumer repos have old partial version, missing new entries for:
- `verifier-diff-summary.md`
Expand Down Expand Up @@ -135,11 +135,29 @@
- Issue body updated with AGENT_ISSUE_TEMPLATE format
- `agents:formatted` label added after successful formatting

### Phase 3 Target: Advanced Features (Optional)
### Phase 3 Target: Pre-Agent Intelligence (4 Capabilities)

- `capability_check.py` integrated into issue intake OR archived
- `task_decomposer.py` integrated for large issues OR archived
- Dedup/semantic matching for issue triage OR archived
**3A. Capability Check (Pre-Agent Gate)**
- `capability_check.py` runs before `agent:codex` assignment
- Identifies issues agent cannot complete (external deps, out-of-scope, credentials needed)
- **Supplements** `agents:optimize` workflow (quality check) with feasibility check
- Adds `needs-human` label + explanation when agent cannot proceed

**3B. Task Decomposition (Large Issue Handling)**
- `task_decomposer.py` auto-splits issues with 5+ implied tasks
- Creates linked sub-issues or checklist within parent issue
- Triggers via `agents:decompose` label (new)

**3C. Duplicate Detection (Issue Triage)**
- `issue_dedup.py` checks new issues against open issues
- Posts warning comment if duplicate detected (>85% similarity)
- Creates link to potential duplicate for human review
- **Testing focus:** Validate false positive rate before auto-closing

**3D. Semantic Label Matching (Auto-Labeling)**
- `label_matcher.py` suggests appropriate labels based on issue content
- Posts comment with label suggestions or auto-applies if confidence >90%
- Uses `semantic_matcher.py` for embedding-based similarity

---

Expand All @@ -159,12 +177,11 @@
- [x] Commit any fixes to main

**Step 1B: Deploy to Consumer Repos**
1. ✅ All consumer repos have verifier labels (6/7 active, Collab-Admin pending)
1. ✅ All consumer repos have verifier labels (7/7 - all synced)
2. ✅ Sync workflow runs automatically on template changes
3. ✅ **Major cleanup completed 2026-01-07:**
- 26 superseded sync PRs closed across 5 consumer repos
- 5 most recent sync PRs merged successfully
- Collab-Admin PR #113 blocked by lint failures (Python CI / lint-ruff)
- 6 most recent sync PRs merged successfully (including Collab-Admin PR #113)
- **Bot Comment Analysis:** Reviewed 40+ comments across sync PRs
- **Finding:** Zero substantive code review comments from Copilot/Codex agent bots
- All comments were keepalive/autofix operational noise (status updates, missing-issue warnings)
Expand Down Expand Up @@ -213,7 +230,7 @@
**Step 2A: Labels & Sync**
1. ✅ Labels created via sync workflow (`agents:format`, `agents:formatted`, `agents:optimize`, `agents:apply-suggestions`)
2. ✅ `agents-issue-optimizer.yml` is in sync manifest
3. ✅ Sync PRs merged (5/6 repos as of 2026-01-07, Collab-Admin pending)
3. ✅ Sync PRs merged (7/7 repos as of 2026-01-07, all synced)
4. ✅ **Tested on Manager-Database #184:**
- ✅ Created unstructured test issue
- ✅ Added `agents:optimize` label → Workflow posted valuable analysis (8.6/10 quality)
Expand All @@ -237,33 +254,150 @@
- ✅ Labels updated correctly (`agents:formatted` added)
- ✅ **Updated:** Now uses `use_llm=True` to populate sections from analysis - pending retest

### Phase 3: Archive Unused Scripts (1 Step)
### Phase 3: Pre-Agent Intelligence (4 Steps)

**Status: Planning - Test Cycle Defined**

**Step 3A: Capability Check Integration**

1. **Relationship to existing workflows:**
- `agents:optimize` → "Is this issue well-written?" (quality check)
- `capability_check.py` → "Can the agent DO this?" (feasibility gate)
- **Answer:** Supplements optimizer, runs BEFORE agent assignment on Issues

2. **Proposed workflow integration:**
```
Issue Created → agents:optimize (quality) → agents:apply-suggestions (format)
User adds agent:codex → capability_check.py runs → If NOT capable:
→ Add needs-human label
→ Post blocker explanation
If capable:
→ Proceed with agent
```

3. **Implementation tasks:**
- [ ] Create `agents-capability-check.yml` workflow
- [ ] Add `needs-human` label to consumer repos via sync
- [ ] Trigger on `agent:codex` label added OR new workflow label
- [ ] Post comment explaining blockers when agent cannot proceed

**Step 3B: Task Decomposition**

1. **Implementation tasks:**
- [ ] Create `agents-decompose.yml` workflow
- [ ] Add `agents:decompose` label to label sync config
- [ ] Call `task_decomposer.py` when label applied
- [ ] Output: Either create sub-issues OR add checklist to parent

**Step 3C: Duplicate Detection (Testing Focus)**

1. **Critical concern:** False positives - we don't want to close valid issues
2. **Approach:** Comment-only mode first, no auto-close
3. **Implementation tasks:**
- [ ] Create `agents-dedup.yml` workflow
- [ ] Trigger on issue opened
- [ ] Compare against open issues using embeddings
- [ ] Post comment if >85% similarity detected (link to potential duplicate)
- [ ] Track false positive rate over testing period

4. **Testing metrics to track:**
- True positive rate (correctly identified duplicates)
- False positive rate (target: <5%)
- Human override rate (user keeps both issues open)

**Step 3D: Semantic Label Matching**

1. **Implementation tasks:**
- [ ] Create `agents-auto-label.yml` workflow OR integrate into existing
- [ ] Use `label_matcher.py` for semantic similarity
- [ ] Post comment with suggestions OR auto-apply at >90% confidence

---

## Phase 3 Testing Plan (Manager-Database)

**Test Repository:** Manager-Database
**Test Duration:** 2 weeks (7 issues minimum)
**Start Date:** Ready to begin (all consumer repos synced)

### Test Issue #1: Capability Check Validation

**Purpose:** Validate capability_check.py correctly identifies agent blockers

**Test Scenarios:**
1. **Issue requiring external API** - Should flag "needs credentials/external dependency"
2. **Issue requiring database migration** - Should flag "needs infrastructure/manual step"
3. **Normal code-only issue** - Should pass capability check

**Test Issue Ideas for Manager-Database:**
- "Integrate with external payment API" (should fail - external dep)
- "Add database migration for new schema" (should fail - infra)
- "Refactor logging module" (should pass - code only)

### Test Issue #2: Task Decomposition Validation

**Status: Decision Deferred**
**Purpose:** Validate task_decomposer.py produces useful sub-tasks

These scripts are fully tested (145 tests passing) but not yet integrated:
- `capability_check.py` - Pre-flight check for agent capability on tasks
- `task_decomposer.py` - Break large tasks into smaller actionable items
- `issue_dedup.py` - Detect duplicate issues via embeddings
- `label_matcher.py` - Semantic label matching
- `semantic_matcher.py` - Shared embedding utilities
**Test Scenario:**
- Create large issue with 5+ implied tasks
- Apply `agents:decompose` label
- Verify sub-tasks are actionable and correctly scoped

**Recommendation:** Keep & Document for future Phase 3+ integration
- All scripts have full test coverage
- Semantic matching could enhance issue triage
- Capability check could prevent failed agent attempts
**Test Issue Idea:**
- "Implement comprehensive health check endpoint with retry logic, circuit breaker, metrics, and alerting integration"

### Test Issue #3: Duplicate Detection Validation

**Purpose:** Measure false positive rate for issue_dedup.py

**Test Scenarios:**
1. **True duplicate** - Create issue very similar to existing (should detect)
2. **Related but different** - Create issue in same area but different ask (should NOT flag)
3. **Unrelated** - Create issue in different area (should NOT flag)

**Success Criteria:**
- True positives detected: 100%
- False positive rate: <5%
- Clear explanation in comment linking to potential duplicate

### Test Issue #4: Label Matching Validation

**Purpose:** Validate label_matcher.py suggests correct labels

**Test Scenario:**
- Create unlabeled issues in different categories
- Verify label suggestions match expected labels
- Track suggestion accuracy

### Testing Metrics Dashboard

| Script | Test Issues | True Positives | False Positives | Accuracy | Status |
|--------|-------------|----------------|-----------------|----------|--------|
| capability_check.py | 0/3 | - | - | - | ⏳ Pending |
| task_decomposer.py | 0/2 | - | - | - | ⏳ Pending |
| issue_dedup.py | 0/3 | - | - | <5% target | ⏳ Pending |
| label_matcher.py | 0/3 | - | - | - | ⏳ Pending |

**Total test issues needed:** ~11 issues on Manager-Database

---

## Summary

| Phase | Scope | Steps | Test Repo | Status |
|-------|-------|-------|-----------|--------|
| 1 | PR Verification | 2 | Manager-Database | ✅ Deployed, 5/6 repos synced |
| 1 | PR Verification | 2 | Manager-Database | ✅ Deployed, 6/7 repos synced |
| 2 | Issue Formatting | 1 | Manager-Database | ✅ Deployed & tested - Quality: 7.5/10 |
| 3 | Cleanup/Archive | 1 | N/A | Deferred (scripts retained) |
| 3 | Pre-Agent Intelligence | 4 | Manager-Database | 🔄 Planning - Testing cycle defined |

**Phase 3 Components:**
- **3A:** Capability Check - Pre-agent feasibility gate (supplements agents:optimize)
- **3B:** Task Decomposition - Auto-split large issues
- **3C:** Duplicate Detection - Comment-only mode, track false positives
- **3D:** Semantic Labeling - Auto-suggest/apply labels

**Total: 4 deployment actions** - All infrastructure deployed. Major sync cleanup completed 2026-01-07 (26 superseded PRs closed, 5/6 repos synced). Collab-Admin PR #113 blocked by lint failures.
**Total: 7 deployment actions** - Phases 1-2 deployed. Phase 3 testing plan defined for Manager-Database (~11 test issues).

**Substantive Quality Assessment:**
- **agents:optimize:** 8.6/10 - Provides valuable, actionable analysis
Expand All @@ -277,8 +411,8 @@ These scripts are fully tested (145 tests passing) but not yet integrated:
### Immediate (Ready Now)
1. ~~**Merge PR #633**~~ ✅ Merged - GPT-5.2 for compare mode
2. ~~**Merge PR #643**~~ ✅ Merged - Model name in comparison reports + disable auto-issue creation
3. ~~**Consumer repo sync cleanup**~~ ✅ Completed 2026-01-07 - 26 superseded PRs closed, 5/6 merged
4. **Resolve Collab-Admin sync** - ⏳ PR #113 blocked by lint failures (Python CI / lint-ruff)
3. ~~**Consumer repo sync cleanup**~~ ✅ Completed 2026-01-07 - 26 superseded PRs closed, 6/6 merged
4. ~~**Resolve Collab-Admin sync**~~ ✅ PR #113 merged 2026-01-07
5. ~~**Live test `agents:optimize`**~~ ✅ Tested on Manager-Database #184 - Quality: 8.6/10
6. ~~**Live test `agents:apply-suggestions`**~~ ✅ Tested on Manager-Database #184 - Quality: 6/10

Expand All @@ -294,11 +428,18 @@ These scripts are fully tested (145 tests passing) but not yet integrated:
- "Implement logging before health checks"
- "Retry logic blocks enhanced error logging"

### Phase 3 Implementation (Next)
1. **Step 3A: Capability Check** - Create `agents-capability-check.yml`, integrate with issue workflow
- Supplements existing agents:optimize (quality) with feasibility gate
- Runs BEFORE agent assignment, not after
2. **Step 3B: Task Decomposition** - Create `agents-decompose.yml` workflow
3. **Step 3C: Duplicate Detection** - Create `agents-dedup.yml` (comment-only, track false positives)
4. **Step 3D: Label Matching** - Integrate into issue workflow

### Future Enhancements
1. **Compare mode refinement** - Currently uses gpt-4o (GitHub) vs gpt-5.2 (OpenAI)
2. **Model auto-update** - Use `scripts/update_model_list.sh` periodically
3. **Domain-specific guidance** - Add prompts for retry patterns, health check endpoints
4. **Phase 3 scripts** - Decide on capability_check.py and task_decomposer.py integration

### Test Results Documentation
Full substantive analysis available at `/tmp/substantive_test_analysis.md`:
Expand Down
Loading