Skip to content

fix: Reduce false positives in auto-label and duplicate detection#731

Merged
stranske merged 1 commit intomainfrom
fix/auto-label-dedup-false-positives
Jan 10, 2026
Merged

fix: Reduce false positives in auto-label and duplicate detection#731
stranske merged 1 commit intomainfrom
fix/auto-label-dedup-false-positives

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 10, 2026

Source: Issue #243

Automated Status Summary

Scope

Address unmet acceptance criteria from PR #235.

Original scope:

  • Scope section missing from source issue.

Context for Agent

Related Issues/PRs

References

Tasks

  • Tasks section missing from source issue.

Acceptance criteria

  • Acceptance criteria section missing from source issue.

Head SHA: a49c6d4
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress

Workflow / Job Result Logs
Agents PR meta manager ❔ in progress View run
Auto-label Dependabot PRs ⏭️ skipped View run
CI Autofix Loop ✅ success View run
Copilot code review ❔ in progress View run
Gate ❔ in progress View run
Health 40 Sweep ✅ success View run
Health 44 Gate Branch Protection ✅ success View run
Health 45 Agents Guard ✅ success View run
Health 50 Security Scan ✅ success View run
Maint 52 Validate Workflows ✅ success View run
PR 11 - Minimal invariant CI ✅ success View run
Selftest CI ✅ success View run
Validate Sync Manifest ✅ success View run

Auto-Label (agents-auto-label.yml):
- Now applies only the BEST matching label instead of all labels above threshold
- Prevents issues from getting multiple conflicting labels like bug+enhancement
- Other high-confidence matches moved to suggestions comment

Duplicate Detection (agents-dedup.yml):
- Raised threshold from 0.85 to 0.92 for higher precision
- Added title word overlap filter (requires 40% overlap or 95% score)
- Reduces false positives from issues in same domain that share vocabulary
- Logs filtering decisions for debugging

Test results showed:
- Suite C had 50% false positive rate (4/4 flagged, expected 2/4)
- Suite D applied both bug+enhancement to all issues

Fixes identified in Manager-Database #243-248 testing.
Copilot AI review requested due to automatic review settings January 10, 2026 03:22
@stranske stranske temporarily deployed to agent-high-privilege January 10, 2026 03:22 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

Automated Status Summary

Head SHA: c518bb2
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 92.21%
Baseline 85.00%
Delta +7.21%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
scripts/workflow_health_check.py 62.6% 28
scripts/classify_test_failures.py 62.9% 37
scripts/ledger_validate.py 65.3% 63
scripts/mypy_return_autofix.py 82.6% 11
scripts/ledger_migrate_base.py 85.5% 13
scripts/fix_cosmetic_aggregate.py 92.3% 1
scripts/coverage_history_append.py 92.8% 2
scripts/workflow_validator.py 93.3% 4
scripts/update_autofix_expectations.py 93.9% 1
scripts/pr_metrics_tracker.py 95.7% 3
scripts/generate_residual_trend.py 96.6% 1
scripts/build_autofix_pr_comment.py 97.0% 2
scripts/aggregate_agent_metrics.py 97.2% 0
scripts/fix_numpy_asserts.py 98.1% 0
scripts/sync_test_dependencies.py 98.3% 1

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@stranske stranske merged commit abf75a5 into main Jan 10, 2026
105 checks passed
@stranske stranske deleted the fix/auto-label-dedup-false-positives branch January 10, 2026 03:24
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Keepalive Loop Status

PR #731 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/2 complete
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses false positives identified during Phase 3 functional testing by tuning duplicate detection and auto-labeling workflows based on test results from Manager-Database issues #243-248.

Changes:

  • Modified duplicate detection to apply stricter filtering (raised threshold from 0.85 to 0.92, added 40% title word overlap requirement)
  • Changed auto-label workflow to apply only the best matching label instead of all labels above threshold
  • Updated SHORT_TERM_PLAN.md with comprehensive test results and accuracy metrics from Phase 3 functional testing

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
templates/consumer-repo/.github/workflows/agents-dedup.yml Increased similarity threshold to 0.92 and added title word overlap filter to reduce false positives
templates/consumer-repo/.github/workflows/agents-auto-label.yml Modified to apply only the best matching label instead of all labels above threshold
.github/workflows/agents-dedup.yml Same duplicate detection improvements as template version
.github/workflows/agents-auto-label.yml Same auto-label improvements as template version
docs/plans/SHORT_TERM_PLAN.md Added detailed Phase 3 test results showing 50% accuracy for dedup and over-labeling issues in auto-label

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +140 to +142
# Require at least 40% of words to overlap for a duplicate flag
max_words = max(len(title_words_new), len(title_words_match), 1)
overlap_ratio = len(shared_words) / max_words
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).

For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.

Suggested change
# Require at least 40% of words to overlap for a duplicate flag
max_words = max(len(title_words_new), len(title_words_match), 1)
overlap_ratio = len(shared_words) / max_words
# Require at least 40% of the shorter title's words to overlap for a duplicate flag
min_words = min(len(title_words_new), len(title_words_match), 1)
overlap_ratio = len(shared_words) / min_words

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +142
# Require at least 40% of words to overlap for a duplicate flag
max_words = max(len(title_words_new), len(title_words_match), 1)
overlap_ratio = len(shared_words) / max_words
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).

For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.

Suggested change
# Require at least 40% of words to overlap for a duplicate flag
max_words = max(len(title_words_new), len(title_words_match), 1)
overlap_ratio = len(shared_words) / max_words
# Require at least 40% of words in the shorter title to overlap for a duplicate flag
min_words = max(min(len(title_words_new), len(title_words_match)), 1)
overlap_ratio = len(shared_words) / min_words

Copilot uses AI. Check for mistakes.
Comment on lines 131 to +142
auto_apply = [m for m in matches if m.score >= auto_threshold]
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
if auto_apply:
best_match = auto_apply[0] # matches are already sorted by score descending
auto_apply = [best_match]
# Move other high-confidence matches to suggestions
for m in matches[1:]:
if m.score >= auto_threshold and m not in suggestions:
suggestions.insert(0, m)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.

Suggested change
auto_apply = [m for m in matches if m.score >= auto_threshold]
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]
# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
if auto_apply:
best_match = auto_apply[0] # matches are already sorted by score descending
auto_apply = [best_match]
# Move other high-confidence matches to suggestions
for m in matches[1:]:
if m.score >= auto_threshold and m not in suggestions:
suggestions.insert(0, m)
all_auto_apply = [m for m in matches if m.score >= auto_threshold]
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]
# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
auto_apply = []
if all_auto_apply:
best_match = all_auto_apply[0] # matches are already sorted by score descending
auto_apply = [best_match]
# Move other high-confidence matches (except the best) to suggestions
for m in all_auto_apply[1:]:
suggestions.insert(0, m)

Copilot uses AI. Check for mistakes.
Comment on lines 131 to +142
auto_apply = [m for m in matches if m.score >= auto_threshold]
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
if auto_apply:
best_match = auto_apply[0] # matches are already sorted by score descending
auto_apply = [best_match]
# Move other high-confidence matches to suggestions
for m in matches[1:]:
if m.score >= auto_threshold and m not in suggestions:
suggestions.insert(0, m)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.

Suggested change
auto_apply = [m for m in matches if m.score >= auto_threshold]
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]
# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
if auto_apply:
best_match = auto_apply[0] # matches are already sorted by score descending
auto_apply = [best_match]
# Move other high-confidence matches to suggestions
for m in matches[1:]:
if m.score >= auto_threshold and m not in suggestions:
suggestions.insert(0, m)
auto_apply_full = [m for m in matches if m.score >= auto_threshold]
auto_apply = list(auto_apply_full)
suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]
# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
# This prevents over-labeling issues with multiple labels like bug+enhancement
if auto_apply:
best_match = auto_apply[0] # matches are already sorted by score descending
# Keep only the best match for auto-apply
auto_apply = [best_match]
# Move other high-confidence matches (remaining auto_apply) to suggestions
other_high = auto_apply_full[1:] if 'auto_apply_full' in locals() else []
for m in reversed(other_high):
suggestions.insert(0, m)

Copilot uses AI. Check for mistakes.
stranske added a commit that referenced this pull request Jan 10, 2026
- Mark Suite C & D fixes as COMPLETED (PR #731 merged)
- Update Success Criteria to reflect fixes done
- Add Week 2 Progress section with completed work
- Add PR #731 to completed PRs list
- Update Suite A #237 to note Azure content filter issue (not code bug)
- Update remaining priorities and next steps
stranske added a commit that referenced this pull request Jan 10, 2026
…732)

- Mark Suite C & D fixes as COMPLETED (PR #731 merged)
- Update Success Criteria to reflect fixes done
- Add Week 2 Progress section with completed work
- Add PR #731 to completed PRs list
- Update Suite A #237 to note Azure content filter issue (not code bug)
- Update remaining priorities and next steps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants