Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 60 additions & 23 deletions docs/plans/SHORT_TERM_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,36 +103,50 @@

## Immediate Next Steps (Based on Test Results)

### 🔴 High Priority Fixes Needed
### ✅ Workflow-Level Fixes Applied (PR #731 - Merged)

**1. Fix Suite C: Duplicate Detection - 50% False Positive Rate**
- **Problem:** All 4 test issues got `duplicate` label, but only 2 were actual duplicates
- **Root Cause:** Similarity threshold too low or matching too aggressive
- **Action:** Review `issue_dedup.py` similarity threshold, currently flagging unrelated issues
- **Files:** `scripts/issue_dedup.py`, `.github/workflows/agents-dedup.yml`
**1. Suite C: Duplicate Detection Threshold**
- **Fix Applied:** Raised threshold 0.85→0.92, added 40% title word overlap filter
- **Status:** ✅ Merged in PR #731

**2. Fix Suite D: Auto-Label Over-Labeling**
- **Problem:** Both issues got BOTH `bug` and `enhancement` labels instead of the most appropriate one
- **Root Cause:** Applying all labels above threshold instead of best match only
- **Action:** Modify `label_matcher.py` to apply only the highest-scoring label
- **Files:** `scripts/label_matcher.py`, `.github/workflows/agents-auto-label.yml`
**2. Suite D: Auto-Label Best-Match-Only**
- **Fix Applied:** Workflow now applies ONLY highest-scoring label, others become suggestions
- **Status:** ✅ Merged in PR #731

**3. Investigate Suite A #237 Workflow Failure**
- **Problem:** "Add database migration for user roles" workflow failed
- **Action:** Check workflow logs, identify error cause
- **Issue:** Manager-Database #237
### 🔴 Critical Bug Found During Re-Testing (PR #733 - Open)
Comment thread
stranske marked this conversation as resolved.
Outdated
Comment thread
stranske marked this conversation as resolved.
Outdated

**Re-Test Conducted:** Created 5 new test issues (#254-258) in Manager-Database

**Result:** ALL 5 issues got 15+ labels each (bug, enhancement, duplicate, documentation, agents:*, verify:*, etc.)

**Root Cause:** `label_matcher.py` keyword matching returned 0.95 for ANY token overlap:
```python
# Line 250-251 was returning 0.95 for ANY word match
Comment thread
stranske marked this conversation as resolved.
Outdated
if label_tokens and label_tokens.intersection(tokens):
return 0.95
```
Common words like "issue", "request", "new" appear in most label descriptions AND issue texts.

**Fix (PR #733):**
- Added 50+ stopwords to filter out common words
- Require label NAME tokens (not description) to match for 0.95 score
- Use label NAME only for bug/feature/docs category detection

**Test Results After Fix:**
- Bug query → only 'bug' label (0.91) ✅
- Feature query → only 'enhancement' label (0.9) ✅
- Generic query → relies on semantic matching only ✅

### 🟡 Medium Priority

**4. Review Suite A Capability Check Accuracy**
- #236 (Stripe) should have been flagged as BLOCKED but wasn't
- #239 (Logging) got `agent:needs-attention` when it should have proceeded
- May need prompt tuning in `capability_check.py`
**3. Suite A #237: Azure Content Filter Issue (Not a Bug)**
- **Problem:** Workflow failed with "jailbreak" content filter error
- **Root Cause:** Azure OpenAI false positive on "database migration" + "admin access" text
- **Status:** Not a code bug - no fix needed

**5. Review Suite B Decomposition Quality**
**4. Review Suite B Decomposition Quality**
- PRs #249, #250, #251 were created successfully
- Need to manually review decomposition quality
- Verify sub-tasks are actionable and appropriately sized

---

Expand Down Expand Up @@ -579,6 +593,10 @@ PRs #696-699 created **test infrastructure** (unit tests, smoke test CLI), not t
- [x] PR #720: Handle rate limits gracefully in verifier CI wait
- [x] PR #726: Prevent duplicate follow-up issues + context builder rate limits

#### Phase 3 Accuracy Fixes
- [x] PR #731: Reduce false positives in auto-label and duplicate detection (workflow-level)
- [ ] PR #733: Fix keyword matcher false positives in label_matcher.py (script-level) ← **OPEN**

#### Test Suite Tooling (NOT Execution)
- [x] PR #699 (Issue #690): Created `run_consumer_repo_tests.py` + 60 capability check unit tests
- [x] PR #696 (Issue #691): Created 64 task decomposer unit tests
Expand All @@ -598,11 +616,30 @@ PRs #696-699 created **test infrastructure** (unit tests, smoke test CLI), not t
- [x] trip-planner: 5 workflow syncs

### Week 2 Checklist
- [ ] Day 6-8: Resolve 3 conflicted PRs
- [x] Day 6-8: Resolve 3 conflicted PRs ✅ **ALL RESOLVED**
- Manager-Database #134: MERGEABLE (conflict resolved)
- Manager-Database #135: MERGEABLE (conflict resolved)
- Portable-Alpha #1049: MERGED
- [ ] Day 9-10: Label cleanup audit
- [ ] Day 11-12: Document test results
- [x] Day 11-12: Document test results ✅ **DONE** (moved up)
- [ ] Day 13-14: Plan Phase 4 rollout

### Week 2 Progress (January 10, 2026 - Evening)

**Re-Testing Suite C & D:**
- Created 5 test issues (#254-258) in Manager-Database
- Discovered critical bug: ALL labels applied to ALL issues
- Root cause: `label_matcher.py` keyword matching too permissive
- Fix created: PR #733
- Test issues closed after revealing bug

**PRs Created Today:**
| PR | Title | Status |
|----|-------|--------|
Comment thread
stranske marked this conversation as resolved.
Outdated
| #731 | fix: Reduce false positives in auto-label and duplicate detection | ✅ Merged |
| #732 | docs: Update SHORT_TERM_PLAN with PR #731 and Week 2 progress | Open |
| #733 | fix: Prevent keyword matcher from applying all labels | Open |

Comment thread
stranske marked this conversation as resolved.
Outdated
---

## Post-Plan: Phase 4 Preview
Expand Down
Loading