fix: Reduce false positives in auto-label and duplicate detection#731
fix: Reduce false positives in auto-label and duplicate detection#731
Conversation
Auto-Label (agents-auto-label.yml): - Now applies only the BEST matching label instead of all labels above threshold - Prevents issues from getting multiple conflicting labels like bug+enhancement - Other high-confidence matches moved to suggestions comment Duplicate Detection (agents-dedup.yml): - Raised threshold from 0.85 to 0.92 for higher precision - Added title word overlap filter (requires 40% overlap or 95% score) - Reduces false positives from issues in same domain that share vocabulary - Logs filtering decisions for debugging Test results showed: - Suite C had 50% false positive rate (4/4 flagged, expected 2/4) - Suite D applied both bug+enhancement to all issues Fixes identified in Manager-Database #243-248 testing.
Automated Status SummaryHead SHA: c518bb2
Coverage Overview
Coverage Trend
Top Coverage Hotspots (lowest coverage)
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScopeNo scope information available Tasks
Acceptance criteria
|
🤖 Keepalive Loop StatusPR #731 | Agent: Codex | Iteration 0/5 Current State
🔍 Failure Classification| Error type | infrastructure | |
There was a problem hiding this comment.
Pull request overview
This PR addresses false positives identified during Phase 3 functional testing by tuning duplicate detection and auto-labeling workflows based on test results from Manager-Database issues #243-248.
Changes:
- Modified duplicate detection to apply stricter filtering (raised threshold from 0.85 to 0.92, added 40% title word overlap requirement)
- Changed auto-label workflow to apply only the best matching label instead of all labels above threshold
- Updated SHORT_TERM_PLAN.md with comprehensive test results and accuracy metrics from Phase 3 functional testing
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| templates/consumer-repo/.github/workflows/agents-dedup.yml | Increased similarity threshold to 0.92 and added title word overlap filter to reduce false positives |
| templates/consumer-repo/.github/workflows/agents-auto-label.yml | Modified to apply only the best matching label instead of all labels above threshold |
| .github/workflows/agents-dedup.yml | Same duplicate detection improvements as template version |
| .github/workflows/agents-auto-label.yml | Same auto-label improvements as template version |
| docs/plans/SHORT_TERM_PLAN.md | Added detailed Phase 3 test results showing 50% accuracy for dedup and over-labeling issues in auto-label |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Require at least 40% of words to overlap for a duplicate flag | ||
| max_words = max(len(title_words_new), len(title_words_match), 1) | ||
| overlap_ratio = len(shared_words) / max_words |
There was a problem hiding this comment.
The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).
For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.
| # Require at least 40% of words to overlap for a duplicate flag | |
| max_words = max(len(title_words_new), len(title_words_match), 1) | |
| overlap_ratio = len(shared_words) / max_words | |
| # Require at least 40% of the shorter title's words to overlap for a duplicate flag | |
| min_words = min(len(title_words_new), len(title_words_match), 1) | |
| overlap_ratio = len(shared_words) / min_words |
| # Require at least 40% of words to overlap for a duplicate flag | ||
| max_words = max(len(title_words_new), len(title_words_match), 1) | ||
| overlap_ratio = len(shared_words) / max_words |
There was a problem hiding this comment.
The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).
For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.
| # Require at least 40% of words to overlap for a duplicate flag | |
| max_words = max(len(title_words_new), len(title_words_match), 1) | |
| overlap_ratio = len(shared_words) / max_words | |
| # Require at least 40% of words in the shorter title to overlap for a duplicate flag | |
| min_words = max(min(len(title_words_new), len(title_words_match)), 1) | |
| overlap_ratio = len(shared_words) / min_words |
| auto_apply = [m for m in matches if m.score >= auto_threshold] | ||
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | ||
|
|
||
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | ||
| # This prevents over-labeling issues with multiple labels like bug+enhancement | ||
| if auto_apply: | ||
| best_match = auto_apply[0] # matches are already sorted by score descending | ||
| auto_apply = [best_match] | ||
| # Move other high-confidence matches to suggestions | ||
| for m in matches[1:]: | ||
| if m.score >= auto_threshold and m not in suggestions: | ||
| suggestions.insert(0, m) |
There was a problem hiding this comment.
The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.
| auto_apply = [m for m in matches if m.score >= auto_threshold] | |
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | |
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | |
| # This prevents over-labeling issues with multiple labels like bug+enhancement | |
| if auto_apply: | |
| best_match = auto_apply[0] # matches are already sorted by score descending | |
| auto_apply = [best_match] | |
| # Move other high-confidence matches to suggestions | |
| for m in matches[1:]: | |
| if m.score >= auto_threshold and m not in suggestions: | |
| suggestions.insert(0, m) | |
| all_auto_apply = [m for m in matches if m.score >= auto_threshold] | |
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | |
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | |
| # This prevents over-labeling issues with multiple labels like bug+enhancement | |
| auto_apply = [] | |
| if all_auto_apply: | |
| best_match = all_auto_apply[0] # matches are already sorted by score descending | |
| auto_apply = [best_match] | |
| # Move other high-confidence matches (except the best) to suggestions | |
| for m in all_auto_apply[1:]: | |
| suggestions.insert(0, m) |
| auto_apply = [m for m in matches if m.score >= auto_threshold] | ||
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | ||
|
|
||
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | ||
| # This prevents over-labeling issues with multiple labels like bug+enhancement | ||
| if auto_apply: | ||
| best_match = auto_apply[0] # matches are already sorted by score descending | ||
| auto_apply = [best_match] | ||
| # Move other high-confidence matches to suggestions | ||
| for m in matches[1:]: | ||
| if m.score >= auto_threshold and m not in suggestions: | ||
| suggestions.insert(0, m) |
There was a problem hiding this comment.
The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.
| auto_apply = [m for m in matches if m.score >= auto_threshold] | |
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | |
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | |
| # This prevents over-labeling issues with multiple labels like bug+enhancement | |
| if auto_apply: | |
| best_match = auto_apply[0] # matches are already sorted by score descending | |
| auto_apply = [best_match] | |
| # Move other high-confidence matches to suggestions | |
| for m in matches[1:]: | |
| if m.score >= auto_threshold and m not in suggestions: | |
| suggestions.insert(0, m) | |
| auto_apply_full = [m for m in matches if m.score >= auto_threshold] | |
| auto_apply = list(auto_apply_full) | |
| suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold] | |
| # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold | |
| # This prevents over-labeling issues with multiple labels like bug+enhancement | |
| if auto_apply: | |
| best_match = auto_apply[0] # matches are already sorted by score descending | |
| # Keep only the best match for auto-apply | |
| auto_apply = [best_match] | |
| # Move other high-confidence matches (remaining auto_apply) to suggestions | |
| other_high = auto_apply_full[1:] if 'auto_apply_full' in locals() else [] | |
| for m in reversed(other_high): | |
| suggestions.insert(0, m) |
- Mark Suite C & D fixes as COMPLETED (PR #731 merged) - Update Success Criteria to reflect fixes done - Add Week 2 Progress section with completed work - Add PR #731 to completed PRs list - Update Suite A #237 to note Azure content filter issue (not code bug) - Update remaining priorities and next steps
…732) - Mark Suite C & D fixes as COMPLETED (PR #731 merged) - Update Success Criteria to reflect fixes done - Add Week 2 Progress section with completed work - Add PR #731 to completed PRs list - Update Suite A #237 to note Azure content filter issue (not code bug) - Update remaining priorities and next steps
Automated Status Summary
Scope
Address unmet acceptance criteria from PR #235.
Original scope:
Context for Agent
Related Issues/PRs
References
Tasks
Acceptance criteria
Head SHA: a49c6d4
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress