fix: Reduce false positives in auto-label and duplicate detection by stranske · Pull Request #731 · stranske/Workflows

stranske · 2026-01-10T03:22:26Z

Source: Issue #243

Automated Status Summary

Scope

Address unmet acceptance criteria from PR #235.

Original scope:

Scope section missing from source issue.

Context for Agent

Related Issues/PRs

References

Tasks

Tasks section missing from source issue.

Acceptance criteria

Acceptance criteria section missing from source issue.

Head SHA: a49c6d4
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress

Workflow / Job	Result	Logs
Agents PR meta manager	❔ in progress	View run
Auto-label Dependabot PRs	⏭️ skipped	View run
CI Autofix Loop	✅ success	View run
Copilot code review	❔ in progress	View run
Gate	❔ in progress	View run
Health 40 Sweep	✅ success	View run
Health 44 Gate Branch Protection	✅ success	View run
Health 45 Agents Guard	✅ success	View run
Health 50 Security Scan	✅ success	View run
Maint 52 Validate Workflows	✅ success	View run
PR 11 - Minimal invariant CI	✅ success	View run
Selftest CI	✅ success	View run
Validate Sync Manifest	✅ success	View run

Auto-Label (agents-auto-label.yml): - Now applies only the BEST matching label instead of all labels above threshold - Prevents issues from getting multiple conflicting labels like bug+enhancement - Other high-confidence matches moved to suggestions comment Duplicate Detection (agents-dedup.yml): - Raised threshold from 0.85 to 0.92 for higher precision - Added title word overlap filter (requires 40% overlap or 95% score) - Reduces false positives from issues in same domain that share vocabulary - Logs filtering decisions for debugging Test results showed: - Suite C had 50% false positive rate (4/4 flagged, expected 2/4) - Suite D applied both bug+enhancement to all issues Fixes identified in Manager-Database #243-248 testing.

github-actions · 2026-01-10T03:24:08Z

Automated Status Summary

Head SHA: c518bb2
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	92.21%
Baseline	85.00%
Delta	+7.21%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`scripts/workflow_health_check.py`	62.6%	28
`scripts/classify_test_failures.py`	62.9%	37
`scripts/ledger_validate.py`	65.3%	63
`scripts/mypy_return_autofix.py`	82.6%	11
`scripts/ledger_migrate_base.py`	85.5%	13
`scripts/fix_cosmetic_aggregate.py`	92.3%	1
`scripts/coverage_history_append.py`	92.8%	2
`scripts/workflow_validator.py`	93.3%	4
`scripts/update_autofix_expectations.py`	93.9%	1
`scripts/pr_metrics_tracker.py`	95.7%	3
`scripts/generate_residual_trend.py`	96.6%	1
`scripts/build_autofix_pr_comment.py`	97.0%	2
`scripts/aggregate_agent_metrics.py`	97.2%	0
`scripts/fix_numpy_asserts.py`	98.1%	0
`scripts/sync_test_dependencies.py`	98.3%	1

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

No scope information available

Tasks

No tasks defined

Acceptance criteria

No acceptance criteria defined

github-actions · 2026-01-10T03:24:33Z

🤖 Keepalive Loop Status

PR #731 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	0/2 complete
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

Copilot

Pull request overview

This PR addresses false positives identified during Phase 3 functional testing by tuning duplicate detection and auto-labeling workflows based on test results from Manager-Database issues #243-248.

Changes:

Modified duplicate detection to apply stricter filtering (raised threshold from 0.85 to 0.92, added 40% title word overlap requirement)
Changed auto-label workflow to apply only the best matching label instead of all labels above threshold
Updated SHORT_TERM_PLAN.md with comprehensive test results and accuracy metrics from Phase 3 functional testing

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
templates/consumer-repo/.github/workflows/agents-dedup.yml	Increased similarity threshold to 0.92 and added title word overlap filter to reduce false positives
templates/consumer-repo/.github/workflows/agents-auto-label.yml	Modified to apply only the best matching label instead of all labels above threshold
.github/workflows/agents-dedup.yml	Same duplicate detection improvements as template version
.github/workflows/agents-auto-label.yml	Same auto-label improvements as template version
docs/plans/SHORT_TERM_PLAN.md	Added detailed Phase 3 test results showing 50% accuracy for dedup and over-labeling issues in auto-label

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-10T03:25:22Z

templates/consumer-repo/.github/workflows/agents-dedup.yml

+              # Require at least 40% of words to overlap for a duplicate flag
+              max_words = max(len(title_words_new), len(title_words_match), 1)
+              overlap_ratio = len(shared_words) / max_words


The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).

For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.

Suggested change

# Require at least 40% of words to overlap for a duplicate flag

max_words = max(len(title_words_new), len(title_words_match), 1)

overlap_ratio = len(shared_words) / max_words

# Require at least 40% of the shorter title's words to overlap for a duplicate flag

min_words = min(len(title_words_new), len(title_words_match), 1)

overlap_ratio = len(shared_words) / min_words

Copilot · 2026-01-10T03:25:23Z

.github/workflows/agents-dedup.yml

+              # Require at least 40% of words to overlap for a duplicate flag
+              max_words = max(len(title_words_new), len(title_words_match), 1)
+              overlap_ratio = len(shared_words) / max_words


The overlap ratio calculation uses the maximum of the two title lengths as the denominator, but this may not accurately represent the degree of overlap between titles. Consider using the minimum length instead, or a Jaccard similarity metric (intersection over union).

For example, if a new issue has title "Add caching" (2 words) and an existing issue has "Add PUT endpoint to update manager caching layer optimization" (8 words), and they share "Add" and "caching" (2 words), the current calculation gives 2/8 = 25% overlap, which would miss this as a duplicate even if the body content is similar. Using min would give 2/2 = 100% overlap, which better reflects that all words in the shorter title are present in the longer one.

Suggested change

# Require at least 40% of words to overlap for a duplicate flag

max_words = max(len(title_words_new), len(title_words_match), 1)

overlap_ratio = len(shared_words) / max_words

# Require at least 40% of words in the shorter title to overlap for a duplicate flag

min_words = max(min(len(title_words_new), len(title_words_match)), 1)

overlap_ratio = len(shared_words) / min_words

Copilot · 2026-01-10T03:25:23Z

templates/consumer-repo/.github/workflows/agents-auto-label.yml

          auto_apply = [m for m in matches if m.score >= auto_threshold]
          suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

+          # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
+          # This prevents over-labeling issues with multiple labels like bug+enhancement
+          if auto_apply:
+              best_match = auto_apply[0]  # matches are already sorted by score descending
+              auto_apply = [best_match]
+              # Move other high-confidence matches to suggestions
+              for m in matches[1:]:
+                  if m.score >= auto_threshold and m not in suggestions:
+                      suggestions.insert(0, m)


The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.

Suggested change

auto_apply = [m for m in matches if m.score >= auto_threshold]

suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold

# This prevents over-labeling issues with multiple labels like bug+enhancement

if auto_apply:

best_match = auto_apply[0] # matches are already sorted by score descending

auto_apply = [best_match]

# Move other high-confidence matches to suggestions

for m in matches[1:]:

if m.score >= auto_threshold and m not in suggestions:

suggestions.insert(0, m)

all_auto_apply = [m for m in matches if m.score >= auto_threshold]

suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold

# This prevents over-labeling issues with multiple labels like bug+enhancement

auto_apply = []

if all_auto_apply:

best_match = all_auto_apply[0] # matches are already sorted by score descending

auto_apply = [best_match]

# Move other high-confidence matches (except the best) to suggestions

for m in all_auto_apply[1:]:

suggestions.insert(0, m)

Copilot · 2026-01-10T03:25:23Z

.github/workflows/agents-auto-label.yml

          auto_apply = [m for m in matches if m.score >= auto_threshold]
          suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

+          # IMPORTANT: Only auto-apply the BEST matching label, not all above threshold
+          # This prevents over-labeling issues with multiple labels like bug+enhancement
+          if auto_apply:
+              best_match = auto_apply[0]  # matches are already sorted by score descending
+              auto_apply = [best_match]
+              # Move other high-confidence matches to suggestions
+              for m in matches[1:]:
+                  if m.score >= auto_threshold and m not in suggestions:
+                      suggestions.insert(0, m)


The logic to move other high-confidence matches to suggestions has a potential bug. The condition m not in suggestions will always be true because suggestions only contains matches with scores below auto_threshold (line 132), while this loop checks matches with scores >= auto_threshold. This means duplicates could be added to the suggestions list. Additionally, iterating over matches[1:] and checking m.score >= auto_threshold is redundant since we already have the auto_apply list containing all matches above the threshold.

Suggested change

auto_apply = [m for m in matches if m.score >= auto_threshold]

suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold

# This prevents over-labeling issues with multiple labels like bug+enhancement

if auto_apply:

best_match = auto_apply[0] # matches are already sorted by score descending

auto_apply = [best_match]

# Move other high-confidence matches to suggestions

for m in matches[1:]:

if m.score >= auto_threshold and m not in suggestions:

suggestions.insert(0, m)

auto_apply_full = [m for m in matches if m.score >= auto_threshold]

auto_apply = list(auto_apply_full)

suggestions = [m for m in matches if suggest_threshold <= m.score < auto_threshold]

# IMPORTANT: Only auto-apply the BEST matching label, not all above threshold

# This prevents over-labeling issues with multiple labels like bug+enhancement

if auto_apply:

best_match = auto_apply[0] # matches are already sorted by score descending

# Keep only the best match for auto-apply

auto_apply = [best_match]

# Move other high-confidence matches (remaining auto_apply) to suggestions

other_high = auto_apply_full[1:] if 'auto_apply_full' in locals() else []

for m in reversed(other_high):

suggestions.insert(0, m)

- Mark Suite C & D fixes as COMPLETED (PR #731 merged) - Update Success Criteria to reflect fixes done - Add Week 2 Progress section with completed work - Add PR #731 to completed PRs list - Update Suite A #237 to note Azure content filter issue (not code bug) - Update remaining priorities and next steps

…732) - Mark Suite C & D fixes as COMPLETED (PR #731 merged) - Update Success Criteria to reflect fixes done - Add Week 2 Progress section with completed work - Add PR #731 to completed PRs list - Update Suite A #237 to note Azure content filter issue (not code bug) - Update remaining priorities and next steps

Copilot AI review requested due to automatic review settings January 10, 2026 03:22

stranske temporarily deployed to agent-high-privilege January 10, 2026 03:22 — with GitHub Actions Inactive

Copilot started reviewing on behalf of stranske January 10, 2026 03:22 View session

stranske merged commit abf75a5 into main Jan 10, 2026
105 checks passed

stranske deleted the fix/auto-label-dedup-false-positives branch January 10, 2026 03:24

Copilot AI reviewed Jan 10, 2026

View reviewed changes

This was referenced Jan 10, 2026

docs: Update SHORT_TERM_PLAN with PR #731 merge and Week 2 progress #732

Merged

fix: Prevent keyword matcher from applying all labels #733

Merged

docs: Update SHORT_TERM_PLAN with re-test results #734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Reduce false positives in auto-label and duplicate detection#731

fix: Reduce false positives in auto-label and duplicate detection#731
stranske merged 1 commit intomainfrom
fix/auto-label-dedup-false-positives

stranske commented Jan 10, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stranske commented Jan 10, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Scope

Context for Agent

Related Issues/PRs

References

Tasks

Acceptance criteria

Uh oh!

github-actions bot commented Jan 10, 2026

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Tasks

Acceptance criteria

Uh oh!

Uh oh!

github-actions bot commented Jan 10, 2026

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stranske commented Jan 10, 2026 •

edited by agents-workflows-bot bot

Loading