Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 97 additions & 3 deletions .github/scripts/keepalive_loop.js
Original file line number Diff line number Diff line change
Expand Up @@ -822,15 +822,44 @@ async function evaluateKeepaliveLoop({ github, context, core, payload: overrideP
const maxIterations = toNumber(config.max_iterations ?? state.max_iterations, 5);
const failureThreshold = toNumber(config.failure_threshold ?? state.failure_threshold, 3);

// Productivity tracking: determine if recent iterations have been productive
// An iteration is productive if it made file changes or completed tasks
// Evidence-based productivity tracking
// Uses multiple signals to determine if work is being done:
// 1. File changes (primary signal)
// 2. Task completion progress
// 3. Historical productivity trend
const lastFilesChanged = toNumber(state.last_files_changed, 0);
const prevFilesChanged = toNumber(state.prev_files_changed, 0);
const hasRecentFailures = Boolean(state.failure?.count > 0);
const isProductive = lastFilesChanged > 0 && !hasRecentFailures;

// Track task completion trend
const previousTasks = state.tasks || {};
const prevUnchecked = toNumber(previousTasks.unchecked, checkboxCounts.unchecked);
const tasksCompletedSinceLastRound = prevUnchecked - checkboxCounts.unchecked;

// Calculate productivity score (0-100)
// This is evidence-based: higher score = more confidence work is happening
let productivityScore = 0;
if (lastFilesChanged > 0) productivityScore += Math.min(40, lastFilesChanged * 10);
if (tasksCompletedSinceLastRound > 0) productivityScore += Math.min(40, tasksCompletedSinceLastRound * 20);
if (prevFilesChanged > 0 && iteration > 1) productivityScore += 10; // Recent historical activity
if (!hasRecentFailures) productivityScore += 10; // No failures is a positive signal
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The productivity score calculation can exceed 100 in edge cases. For example, if lastFilesChanged is 4+ (40 points), tasksCompletedSinceLastRound is 2+ (40 points), there's historical activity (10 points), and no failures (10 points), the score would be 100. However, if lastFilesChanged is 5, the formula Math.min(40, lastFilesChanged * 10) would give 40, but the total could still be 100. The issue is that the score is described as 0-100 but the logic doesn't enforce a maximum of 100. Consider capping the final score with Math.min(100, productivityScore) after all additions.

Suggested change
if (!hasRecentFailures) productivityScore += 10; // No failures is a positive signal
if (!hasRecentFailures) productivityScore += 10; // No failures is a positive signal
// Enforce the documented 0-100 range for the productivity score
productivityScore = Math.min(100, productivityScore);

Copilot uses AI. Check for mistakes.

// An iteration is productive if it has a reasonable productivity score
const isProductive = productivityScore >= 20 && !hasRecentFailures;

// Early detection: Check for diminishing returns pattern
// If we had activity before but now have none, might be naturally completing
const diminishingReturns =
iteration >= 2 &&
prevFilesChanged > 0 &&
lastFilesChanged === 0 &&
tasksCompletedSinceLastRound === 0;

// max_iterations is a "stuck detection" threshold, not a hard cap
// Continue past max if productive work is happening
// But stop earlier if we detect diminishing returns pattern
const shouldStopForMaxIterations = iteration >= maxIterations && !isProductive;
const shouldStopEarly = diminishingReturns && iteration >= Math.ceil(maxIterations * 0.6);
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early stopping condition for diminishing returns uses Math.ceil(maxIterations * 0.6) which means it can trigger at 60% of max iterations. However, this doesn't account for the case where maxIterations is very small (e.g., 1 or 2). When maxIterations is 1, this evaluates to Math.ceil(0.6) = 1, which means it could stop on the first iteration if there's a diminishing returns pattern. Consider adding a minimum iteration threshold (e.g., iteration >= Math.max(2, Math.ceil(maxIterations * 0.6))) to prevent premature stopping.

Suggested change
const shouldStopEarly = diminishingReturns && iteration >= Math.ceil(maxIterations * 0.6);
const shouldStopEarly =
diminishingReturns &&
iteration >= Math.max(2, Math.ceil(maxIterations * 0.6));

Copilot uses AI. Check for mistakes.

// Build task appendix for the agent prompt (after state load for reconciliation info)
const taskAppendix = buildTaskAppendix(normalisedSections, checkboxCounts, state, { prBody: pr.body });
Expand All @@ -850,6 +879,10 @@ async function evaluateKeepaliveLoop({ github, context, core, payload: overrideP
} else if (allComplete) {
action = 'stop';
reason = 'tasks-complete';
} else if (shouldStopEarly) {
// Evidence-based early stopping: diminishing returns detected
action = 'stop';
reason = 'diminishing-returns';
} else if (shouldStopForMaxIterations) {
action = 'stop';
reason = isProductive ? 'max-iterations' : 'max-iterations-unproductive';
Expand Down Expand Up @@ -954,6 +987,14 @@ async function updateKeepaliveLoopSummary({ github, context, core, inputs }) {
const llmProvider = normalise(inputs.llm_provider ?? inputs.llmProvider);
const llmConfidence = toNumber(inputs.llm_confidence ?? inputs.llmConfidence, 0);
const llmAnalysisRun = toBool(inputs.llm_analysis_run ?? inputs.llmAnalysisRun, false);

// Quality metrics for BS detection and evidence-based decisions
const llmRawConfidence = toNumber(inputs.llm_raw_confidence ?? inputs.llmRawConfidence, llmConfidence);
const llmConfidenceAdjusted = toBool(inputs.llm_confidence_adjusted ?? inputs.llmConfidenceAdjusted, false);
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The llmConfidenceAdjusted parameter is extracted from inputs here but is never passed from the workflow file. The workflow only passes quality-related outputs like llm_raw_confidence, llm_quality_warnings, etc., but doesn't include llm_confidence_adjusted or llm-confidence-adjusted. This means the confidence adjustment warning at line 1289 will never be displayed since llmConfidenceAdjusted will always be false. Either add the missing input parameter in the workflow file or remove this unused variable.

Suggested change
const llmConfidenceAdjusted = toBool(inputs.llm_confidence_adjusted ?? inputs.llmConfidenceAdjusted, false);
const llmConfidenceAdjusted =
Number.isFinite(llmRawConfidence) &&
Number.isFinite(llmConfidence) &&
llmRawConfidence !== llmConfidence;

Copilot uses AI. Check for mistakes.
const llmQualityWarnings = normalise(inputs.llm_quality_warnings ?? inputs.llmQualityWarnings);
const sessionDataQuality = normalise(inputs.session_data_quality ?? inputs.sessionDataQuality);
const sessionEffortScore = toNumber(inputs.session_effort_score ?? inputs.sessionEffortScore, 0);
const analysisTextLength = toNumber(inputs.analysis_text_length ?? inputs.analysisTextLength, 0);

const { state: previousState, commentId } = await loadKeepaliveState({
github,
Expand Down Expand Up @@ -1225,12 +1266,60 @@ async function updateKeepaliveLoopSummary({ github, context, core, inputs }) {
llmProvider === 'openai' ? 'OpenAI (fallback)' :
llmProvider === 'regex-fallback' ? 'Regex (fallback)' : llmProvider;
const confidencePercent = Math.round(llmConfidence * 100);

summaryLines.push(
'',
'### 🧠 Task Analysis',
`| Provider | ${providerIcon} ${providerLabel} |`,
`| Confidence | ${confidencePercent}% |`,
);

// Show quality metrics if available
if (sessionDataQuality) {
const qualityIcon = sessionDataQuality === 'high' ? '🟢' :
sessionDataQuality === 'medium' ? '🟡' :
sessionDataQuality === 'low' ? '🟠' : '🔴';
summaryLines.push(`| Data Quality | ${qualityIcon} ${sessionDataQuality} |`);
}
if (sessionEffortScore > 0) {
summaryLines.push(`| Effort Score | ${sessionEffortScore}/100 |`);
}

// Show BS detection warnings if confidence was adjusted
if (llmConfidenceAdjusted && llmRawConfidence !== llmConfidence) {
const rawPercent = Math.round(llmRawConfidence * 100);
summaryLines.push(
'',
`> ⚠️ **Confidence adjusted**: Raw confidence was ${rawPercent}%, adjusted to ${confidencePercent}% based on session quality metrics.`
);
}

// Show specific quality warnings if present
if (llmQualityWarnings) {
summaryLines.push(
'',
'#### Quality Warnings',
);
// Parse warnings (could be JSON array or comma-separated)
let warnings = [];
try {
warnings = JSON.parse(llmQualityWarnings);
} catch {
warnings = llmQualityWarnings.split(';').filter(w => w.trim());
}
for (const warning of warnings) {
summaryLines.push(`- ⚠️ ${warning.trim()}`);
Comment on lines +1303 to +1311
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning parsing logic assumes warnings are semicolon-delimited when JSON parsing fails, but there's no guarantee that the warnings string format uses semicolons. If warnings come in a different format (comma-separated, newline-separated, or plain text), this will fail to parse them correctly. Consider adding more robust parsing that handles multiple common delimiters or documenting the expected format more explicitly.

Suggested change
// Parse warnings (could be JSON array or comma-separated)
let warnings = [];
try {
warnings = JSON.parse(llmQualityWarnings);
} catch {
warnings = llmQualityWarnings.split(';').filter(w => w.trim());
}
for (const warning of warnings) {
summaryLines.push(`- ⚠️ ${warning.trim()}`);
// Parse warnings (could be JSON array or delimited string)
let warnings = [];
let rawWarnings = String(llmQualityWarnings);
try {
const parsed = JSON.parse(llmQualityWarnings);
if (Array.isArray(parsed)) {
warnings = parsed;
} else if (typeof parsed === 'string') {
rawWarnings = parsed;
} else if (parsed && typeof parsed === 'object') {
rawWarnings = Object.values(parsed).join('\n');
}
} catch {
// Fall back to treating llmQualityWarnings as a raw delimited string
}
if (!warnings.length && rawWarnings) {
// Support multiple common delimiters: semicolon, comma, newline
warnings = String(rawWarnings)
.split(/[\n;,]/)
.map(w => w.trim())
.filter(Boolean);
}
for (const warning of warnings) {
summaryLines.push(`- ⚠️ ${String(warning).trim()}`);

Copilot uses AI. Check for mistakes.
}
}

// Analysis data health check
if (analysisTextLength > 0 && analysisTextLength < 200 && agentFilesChanged > 0) {
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data loss alert uses a hardcoded threshold of 200 characters, but this threshold isn't documented or justified. Different types of analysis might naturally have different text lengths, and 200 characters seems arbitrary. Consider making this threshold configurable or adding a comment explaining why 200 characters is the chosen threshold for detecting data loss.

Copilot uses AI. Check for mistakes.
summaryLines.push(
'',
`> 🔴 **Data Loss Alert**: Analysis text was only ${analysisTextLength} chars despite ${agentFilesChanged} file changes. Task detection may be inaccurate.`
);
}

if (llmProvider !== 'github-models') {
summaryLines.push(
'',
Expand Down Expand Up @@ -1297,7 +1386,12 @@ async function updateKeepaliveLoopSummary({ github, context, core, inputs }) {
failure_threshold: failureThreshold,
// Track task reconciliation for next iteration
needs_task_reconciliation: madeChangesButNoTasksChecked,
// Productivity tracking for evidence-based decisions
last_files_changed: agentFilesChanged,
prev_files_changed: toNumber(previousState?.last_files_changed, 0),
// Quality metrics for analysis validation
last_effort_score: sessionEffortScore,
last_data_quality: sessionDataQuality,
};

const summaryOutcome = runResult || summaryReason || action || 'unknown';
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/agents-keepalive-loop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -505,5 +505,11 @@ jobs:
llm_provider: '${{ needs.run-codex.outputs.llm-provider || '' }}',
llm_confidence: '${{ needs.run-codex.outputs.llm-confidence || '' }}',
llm_analysis_run: '${{ needs.run-codex.outputs.llm-analysis-run }}' === 'true',
// Quality metrics for BS detection and evidence-based decisions
llm_raw_confidence: '${{ needs.run-codex.outputs.llm-raw-confidence || '' }}',
llm_quality_warnings: '${{ needs.run-codex.outputs.llm-quality-warnings || '' }}',
session_data_quality: '${{ needs.run-codex.outputs.llm-data-quality || '' }}',
session_effort_score: '${{ needs.run-codex.outputs.llm-effort-score || '' }}',
analysis_text_length: '${{ needs.run-codex.outputs.llm-analysis-text-length || '' }}',
};
await updateKeepaliveLoopSummary({ github, context, core, inputs });
Loading