Skip to content

CLI: Improve self-healing scoring observability#34699

Merged
Sidnioulz merged 2 commits into
nextfrom
yann/cummulative-self-healing-score
May 4, 2026
Merged

CLI: Improve self-healing scoring observability#34699
Sidnioulz merged 2 commits into
nextfrom
yann/cummulative-self-healing-score

Conversation

@yannbf
Copy link
Copy Markdown
Member

@yannbf yannbf commented May 4, 2026

Closes #

What I did

This PR adds logic to accumulate self-healing scoring over an agentic session

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This pull request has been released as version 0.0.0-pr-34699-sha-a88c0eaa. Try it out in a new sandbox by running npx storybook@0.0.0-pr-34699-sha-a88c0eaa sandbox or in an existing project with npx storybook@0.0.0-pr-34699-sha-a88c0eaa upgrade.

More information
Published version 0.0.0-pr-34699-sha-a88c0eaa
Triggered by @yannbf
Repository storybookjs/storybook
Branch yann/cummulative-self-healing-score
Commit a88c0eaa
Datetime Mon May 4 11:28:43 UTC 2026 (1777894123)
Workflow run 25316410456

To request a new release of this pull request, mention the @storybookjs/core team.

core team members can create a new canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=34699

Summary by CodeRabbit

  • New Features

    • Story test history is now persisted across runs, enabling cumulative metrics and combined run+cumulative analysis.
    • Telemetry and reports now provide separate run- and cumulative-scoped metrics for clearer trends; telemetry no longer includes raw story identifiers.
  • Tests

    • Test suites and assertions updated to validate the new run/cumulative metrics shape and cumulative-history behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

Adds persistent per-story test history and integrates it into telemetry: a disk-backed cache merges run results by storyId, analyzeTestResults now returns both run- and cumulative-scoped metrics, and the Vitest telemetry reporter uses the cache to emit run + cumulative analysis.

Changes

Story Test History Tracking

Layer / File(s) Summary
Type Definitions
code/core/src/shared/utils/test-result-types.ts
Adds StoryTestResultHistoryEntry (extends StoryTestResult with timestamp), StoryTestResultHistory (map of storyId → entry), CssCheckOutcome type, and refactors TestRunAnalysis to use run* and cumulative* fields.
Type Exports
code/core/src/core-server/index.ts
Expands re-exports to include StoryTestResultHistory and StoryTestResultHistoryEntry alongside StoryTestResult.
Cache Implementation
code/addons/vitest/src/vitest-plugin/agent-story-history-cache.ts
New on-disk cache module with readStoryHistory() to load persisted history and mergeAndWriteStoryHistory(results) to merge current StoryTestResult[] into persisted history (keep newest timestamp per storyId), persist merged history, and return merged results (timestamps stripped).
Reporter Integration
code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.ts
Imports and calls mergeAndWriteStoryHistory(this.testResults) on run end and passes returned cumulative results into analyzeTestResults(this.testResults, cumulativeResults).
Core Analysis Logic
code/core/src/shared/utils/analyze-test-results.ts
Introduces summarizeResults() and refactors analyzeTestResults(results, cumulativeResults?) to produce separate run* and cumulative* metrics (including runCssCheck / cumulativeCssCheck).
Tests / Consumers
code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.test.ts, code/core/src/core-server/utils/ghost-stories/parse-vitest-report.test.ts, code/core/src/shared/utils/analyze-test-results.test.ts, code/core/src/core-server/server-channel/ghost-stories-channel.test.ts, scripts/eval/*, scripts/eval/lib/story-render.ts
Mocks mergeAndWriteStoryHistory; updates assertions to run*/cumulative* field names; adds cumulative-stat tests; adjusts consumers to read run* fields (e.g., grade/eval and story-render mapping).

Sequence Diagram

sequenceDiagram
    participant Reporter as Test Reporter
    participant Cache as History Cache
    participant Disk as Disk Storage
    participant Analyzer as Analysis Engine
    participant Telemetry as Telemetry Sink

    Reporter->>Reporter: Collect test results for current run
    Reporter->>Cache: mergeAndWriteStoryHistory(runResults)
    Cache->>Disk: Read persisted history
    Disk-->>Cache: Return cached entries
    Cache->>Cache: Merge run results into history (keep newest timestamp per storyId)
    Cache->>Disk: Write updated history back
    Cache-->>Reporter: Return merged cumulative results
    Reporter->>Analyzer: analyzeTestResults(runResults, cumulativeResults)
    Analyzer->>Analyzer: Compute run* metrics from runResults and cumulative* from cumulativeResults
    Analyzer-->>Reporter: Return TestRunAnalysis with run* and cumulative* fields
    Reporter->>Telemetry: Emit telemetry with run and cumulative stats
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.ts (1)

68-93: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cache failures so telemetry/reporter state stays stable.

If history merge fails, onTestRunEnd can abort before telemetry emission and before resetting this.testResults. Add fallback analysis + finally reset.

💡 Proposed resiliency patch
   async onTestRunEnd(
     testModules: readonly TestModule[],
     unhandledErrors: readonly SerializedError[]
   ) {
-    // Merge the current run into the persisted per-story history (kept on
-    // disk only — storyIds never enter telemetry) and use the merged set
-    // to compute cumulative stats across runs.
-    const cumulativeResults = await mergeAndWriteStoryHistory(this.testResults);
-    const analysis = analyzeTestResults(this.testResults, cumulativeResults);
+    let analysis = analyzeTestResults(this.testResults);
+    try {
+      // Merge the current run into the persisted per-story history (kept on
+      // disk only — storyIds never enter telemetry) and use the merged set
+      // to compute cumulative stats across runs.
+      const cumulativeResults = await mergeAndWriteStoryHistory(this.testResults);
+      analysis = analyzeTestResults(this.testResults, cumulativeResults);
+    } catch {
+      // Best-effort telemetry: keep run-scoped analysis when cache is unavailable.
+    }
     const duration = Date.now() - this.startTime;
@@
-    // Reset for next run (watch mode)
-    this.testResults = [];
+    // Reset for next run (watch mode)
+    this.testResults = [];
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.ts` around
lines 68 - 93, Wrap the history merge + analysis in a try/catch so a failing
merge does not abort telemetry emission or leave state dirty: call
mergeAndWriteStoryHistory(this.testResults) inside try, compute analysis =
analyzeTestResults(this.testResults, cumulativeResults) normally, but in catch
set a safe fallback (e.g., empty cumulativeResults / empty history) and run
analyzeTestResults with that fallback, and include the caught error (mergeError)
in the telemetry payload if helpful; always call telemetry afterwards and use a
finally block to reset this.testResults = [] so watch-mode state is cleared even
on errors.
🧹 Nitpick comments (1)
code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.test.ts (1)

15-17: ⚡ Quick win

Use a spy mock declaration here and keep behavior in beforeEach.

This inline factory mock conflicts with the test mocking rules and duplicates behavior already configured in beforeEach.

✅ Suggested adjustment
-vi.mock('./agent-story-history-cache.ts', () => ({
-  mergeAndWriteStoryHistory: vi.fn(async (results) => results),
-}));
+vi.mock('./agent-story-history-cache.ts', { spy: true });

As per coding guidelines, "Use vi.mock() with the spy: true option for all package and file mocks in Vitest tests" and "Implement mock behaviors in beforeEach blocks in Vitest tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.test.ts` around
lines 15 - 17, Replace the inline factory mock for agent-story-history-cache.ts
with a spy-style mock and move the implementation into the test setup: call
vi.mock('./agent-story-history-cache.ts', undefined, { spy: true }) (so the real
exported names are spied) or otherwise enable { spy: true } for that module,
remove the inline async factory that returns mergeAndWriteStoryHistory:
vi.fn(async (results) => results), and in beforeEach assign the behavior via the
spy (e.g., set mergeAndWriteStoryHistory.mockResolvedValue or mockImplementation
to return results) so the mock declaration is only a spy and the actual behavior
is configured in beforeEach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@code/addons/vitest/src/vitest-plugin/agent-story-history-cache.ts`:
- Around line 24-35: The current read/merge/write in the block around
readStoryHistory(), history and historyCache.set(CACHE_KEY, ...) is not atomic
and can lose concurrent updates; change it to an atomic update with a retry/CAS
or lock: fetch the current cache from historyCache.get(CACHE_KEY) (or use
historyCache's lock/CAS API if available), merge in results into that latest
snapshot (using result.storyId keys and timestamp logic currently in the loop),
then attempt to write back only if the snapshot hasn't changed (or hold the
lock) and retry a few times on conflict; update the code paths that call
readStoryHistory()/historyCache.set(CACHE_KEY, ...) so they use this looped CAS
or an acquired lock to guarantee no concurrent overwrite of other reporters'
updates.

---

Outside diff comments:
In `@code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.ts`:
- Around line 68-93: Wrap the history merge + analysis in a try/catch so a
failing merge does not abort telemetry emission or leave state dirty: call
mergeAndWriteStoryHistory(this.testResults) inside try, compute analysis =
analyzeTestResults(this.testResults, cumulativeResults) normally, but in catch
set a safe fallback (e.g., empty cumulativeResults / empty history) and run
analyzeTestResults with that fallback, and include the caught error (mergeError)
in the telemetry payload if helpful; always call telemetry afterwards and use a
finally block to reset this.testResults = [] so watch-mode state is cleared even
on errors.

---

Nitpick comments:
In `@code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.test.ts`:
- Around line 15-17: Replace the inline factory mock for
agent-story-history-cache.ts with a spy-style mock and move the implementation
into the test setup: call vi.mock('./agent-story-history-cache.ts', undefined, {
spy: true }) (so the real exported names are spied) or otherwise enable { spy:
true } for that module, remove the inline async factory that returns
mergeAndWriteStoryHistory: vi.fn(async (results) => results), and in beforeEach
assign the behavior via the spy (e.g., set
mergeAndWriteStoryHistory.mockResolvedValue or mockImplementation to return
results) so the mock declaration is only a spy and the actual behavior is
configured in beforeEach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 90f9930f-19a2-40a9-8180-f6dcd24f2ea7

📥 Commits

Reviewing files that changed from the base of the PR and between 5393fa9 and 1804ced.

📒 Files selected for processing (8)
  • code/addons/vitest/src/vitest-plugin/agent-story-history-cache.ts
  • code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.test.ts
  • code/addons/vitest/src/vitest-plugin/agent-telemetry-reporter.ts
  • code/core/src/core-server/index.ts
  • code/core/src/core-server/utils/ghost-stories/parse-vitest-report.test.ts
  • code/core/src/shared/utils/analyze-test-results.test.ts
  • code/core/src/shared/utils/analyze-test-results.ts
  • code/core/src/shared/utils/test-result-types.ts

Comment thread code/addons/vitest/src/vitest-plugin/agent-story-history-cache.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
code/core/src/core-server/server-channel/ghost-stories-channel.test.ts (1)

323-330: ⚡ Quick win

Also assert cumulative* metrics in the failure-path payload.

On Line 323 onward, this test only validates run* metrics. Since cumulative scoring is the core behavior change, asserting cumulative* fields here would prevent regressions in failed-run scenarios.

Proposed test expectation update
           results: expect.objectContaining({
             runTotal: 2,
             runPassed: 0,
             runSuccessRate: 0,
             // runCategorizedErrors is an object keyed by error category
             runCategorizedErrors: expect.any(Object),
             runCssCheck: 'not-run',
             runUniqueErrorCount: expect.any(Number),
             runPassedButEmptyRender: 0,
+            cumulativeTotal: 2,
+            cumulativePassed: 0,
+            cumulativeSuccessRate: 0,
+            cumulativeSuccessRateWithoutEmptyRender: 0,
+            cumulativeCategorizedErrors: expect.any(Object),
+            cumulativeCssCheck: 'not-run',
+            cumulativeUniqueErrorCount: expect.any(Number),
+            cumulativePassedButEmptyRender: 0,
           }),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@code/core/src/core-server/server-channel/ghost-stories-channel.test.ts`
around lines 323 - 330, The test in ghost-stories-channel.test.ts currently
asserts only the run* metrics (e.g., runTotal, runPassed, runSuccessRate,
runCategorizedErrors, runCssCheck, runUniqueErrorCount, runPassedButEmptyRender)
for the failure-path payload; update the expectation to also assert the
corresponding cumulative* metrics (e.g., cumulativeTotal, cumulativePassed,
cumulativeSuccessRate, cumulativeCategorizedErrors, cumulativeCssCheck,
cumulativeUniqueErrorCount, cumulativePassedButEmptyRender) using the same
matcher types (expect.any(Number) or expect.any(Object) or exact string values)
so the failure-path verifies cumulative scoring fields alongside the run
metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@code/core/src/core-server/server-channel/ghost-stories-channel.test.ts`:
- Around line 323-330: The test in ghost-stories-channel.test.ts currently
asserts only the run* metrics (e.g., runTotal, runPassed, runSuccessRate,
runCategorizedErrors, runCssCheck, runUniqueErrorCount, runPassedButEmptyRender)
for the failure-path payload; update the expectation to also assert the
corresponding cumulative* metrics (e.g., cumulativeTotal, cumulativePassed,
cumulativeSuccessRate, cumulativeCategorizedErrors, cumulativeCssCheck,
cumulativeUniqueErrorCount, cumulativePassedButEmptyRender) using the same
matcher types (expect.any(Number) or expect.any(Object) or exact string values)
so the failure-path verifies cumulative scoring fields alongside the run
metrics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06481320-031d-4ada-9283-a383c2671c2a

📥 Commits

Reviewing files that changed from the base of the PR and between 1804ced and a88c0ea.

📒 Files selected for processing (3)
  • code/core/src/core-server/server-channel/ghost-stories-channel.test.ts
  • scripts/eval/lib/grade.ts
  • scripts/eval/lib/story-render.ts

Comment thread code/core/src/shared/utils/analyze-test-results.ts
Comment thread code/core/src/shared/utils/analyze-test-results.test.ts
@Sidnioulz Sidnioulz merged commit 250db3f into next May 4, 2026
124 checks passed
@Sidnioulz Sidnioulz deleted the yann/cummulative-self-healing-score branch May 4, 2026 13:49
@github-actions github-actions Bot mentioned this pull request May 4, 2026
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants