Skip to content

Added AI Insight evals #263561

Merged
yuliia-fryshko merged 30 commits into
elastic:mainfrom
yuliia-fryshko:ai-insight-evals-505
May 6, 2026
Merged

Added AI Insight evals #263561
yuliia-fryshko merged 30 commits into
elastic:mainfrom
yuliia-fryshko:ai-insight-evals-505

Conversation

@yuliia-fryshko
Copy link
Copy Markdown
Contributor

@yuliia-fryshko yuliia-fryshko commented Apr 15, 2026

Closes https://github.com/elastic/obs-ai-team/issues/533
Closes https://github.com/elastic/obs-ai-team/issues/536
Closes https://github.com/elastic/obs-ai-team/issues/534
Closes https://github.com/elastic/obs-ai-team/issues/535

This PR introduces an evaluation dataset along with corresponding tests for AI Insights across different scenarios.

Added:

  1. Error AI Insights eval tests with the productCatalogFailure feature
  2. Alert AI Insights eval tests with paymentUnreachable
  3. Logs AI Insights eval tests with productCatalog and paymentUnreachable scenarios

These tests aim to improve coverage and ensure consistent evaluation across key AI Insights use cases.

@yuliia-fryshko yuliia-fryshko self-assigned this Apr 15, 2026
@yuliia-fryshko yuliia-fryshko added the release_note:skip Skip the PR/issue when compiling release notes label Apr 15, 2026
@yuliia-fryshko yuliia-fryshko requested a review from a team as a code owner April 15, 2026 16:53
@yuliia-fryshko yuliia-fryshko added backport:version Backport to applied version labels v9.4.0 evals:observability-ai Run the observability-ai evals @kbn/evals models:judge:eis/google-gemini-3.1-pro Override LLM-as-a-judge connector for evals: eis/google-gemini-3.1-pro models:weekly-eis-models Run evals against the weekly EIS model set (see eval_pipeline.ts) labels Apr 15, 2026
@elastic elastic deleted a comment from elasticmachine Apr 21, 2026
@yuliia-fryshko yuliia-fryshko changed the title Added Error AI Insight evals for Product Catalog failure Added AI Insight evals Apr 23, 2026
Copy link
Copy Markdown
Contributor

@SrdjanLL SrdjanLL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

I just left some minor comments (mainly questions and a suggestion for avoiding bespoke polling implementation).

const deadline = Date.now() + ALERT_POLL_TIMEOUT_MS;

await esClient.indices.refresh({ index: scenario.alertRule.alertsIndex });
while (Date.now() < deadline) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using pRetry here for polling with exponential backoff, similar to how it's done here.

* The AI insight endpoints return SSE (Server-Sent Events) streams.
* This parses the raw SSE text into the summary and context fields.
*/
function parseSseResponse(raw: unknown): AiInsightResponse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume this was the root cause of us not having AI Insights responses in the task-under-evaluation payloads?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was exactly it :)

- Validate that the error is properly handled and does not impact payment processing for valid tokens.
- If no further errors occur, monitor for recurrence but no urgent action is required. If errors increase, investigate token validation logic and upstream authentication flows.`;

const PAYMENT_UNREACHABLE_ALERT_EXPECTED = `- Summary: An APM error count alert fired for the frontend service because the payment service is unreachable. The checkout flow fails with a gRPC Unavailable error ("name resolver error: produced zero addresses") when attempting to charge a card via the payment service. This is a connectivity or infrastructure failure, not an application code defect.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Just curious if you tweaked the expected responses for all insights based on our preferences/expectations or this is an actual response from the AI Insight API?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, @SrdjanLL ! I took an answer from Claude Opus and tweaked the wording a bit

@yuliia-fryshko yuliia-fryshko requested review from a team as code owners April 28, 2026 13:49
@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented Apr 28, 2026

Catch flakiness early (recommended): run the flaky test runner against this PR before merging.

This PR unskips a previously-flaky Scout test (landing.spec.ts, ref #253824) with new retry timing, and adds a brand-new FTR integration test (search_rules.ts) loaded by both ESS and serverless configs.

Trigger a run with the Flaky Test Runner UI or post this comment on the PR:

/flaky scoutConfig:x-pack/solutions/observability/plugins/observability/test/scout/ui/parallel.playwright.config.ts:30 ftrConfig:x-pack/solutions/security/test/security_solution_api_integration/test_suites/detections_response/rules_management/rule_read/trial_license_complete_tier/configs/ess.config.ts:30 ftrConfig:x-pack/solutions/security/test/security_solution_api_integration/test_suites/detections_response/rules_management/rule_read/trial_license_complete_tier/configs/serverless.config.ts:30

Share feedback in the #appex-qa channel.

Posted via Macroscope — Flaky Test Runner nudge

@yuliia-fryshko yuliia-fryshko removed request for a team, dplumlee and rylnd April 28, 2026 14:00
@elastic elastic deleted a comment from elasticmachine Apr 28, 2026
Comment on lines +56 to +59
await kbnClient.request<void>({
method: 'POST',
path: `/internal/alerting/rule/${ruleId}/_run_soon`,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_soon sits inside the pRetry callback, so it fires on every poll iteration.
IIRC this will queue up rule runs, while we only need the rule to trigger once and the polling should just wait for the alert to appear.

I think it's worthing moving this outside of the pRetry's block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I noticed the CI is failing with:


Error: Alert not yet available
--
 
76 \|           const alertDoc = alertsResponse.hits.hits[0];
77 \|           if (!alertDoc) {
> 78 \|             throw new Error('Alert not yet available');
\|                   ^
79 \|           }
80 \|           return alertDoc._id as string;
81 \|         },

Do you think that's just a polling error? When you run snapshot replay manually (using CLI), are you able to see the alert?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @SrdjanLL , for the review and comments. I'm looking why it can happen, locally it worked fine

@elastic elastic deleted a comment from kibanamachine Apr 30, 2026
@elastic elastic deleted a comment from kibanamachine May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@yuliia-fryshko, it looks like you're updating the parameters for a rule type!

Please review the guidelines for making additive changes to rule type parameters and determine if your changes require an intermediate release.

@kibanamachine
Copy link
Copy Markdown
Contributor

kibanamachine commented May 6, 2026

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

cc @yuliia-fryshko

Copy link
Copy Markdown
Contributor

@SrdjanLL SrdjanLL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new scenarios LGTM (as long as they are 🟢 on CI 🙂)!

For visibility, c9dc50d removes a failed scenario where alert wasn't triggering on CI. This will be tracked separately so we get some of the work complete before @yuliia-fryshko 's PTO. The removed scenario is tracked separately via https://github.com/elastic/obs-ai-team/issues/537 and I've added it to current iteration.

@yuliia-fryshko yuliia-fryshko added backport:skip This PR does not require backporting v9.5.0 and removed backport:version Backport to applied version labels v9.4.0 labels May 6, 2026
@yuliia-fryshko yuliia-fryshko merged commit 32dff45 into elastic:main May 6, 2026
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting evals:observability-ai Run the observability-ai evals @kbn/evals models:judge:eis/google-gemini-3.1-pro Override LLM-as-a-judge connector for evals: eis/google-gemini-3.1-pro models:weekly-eis-models Run evals against the weekly EIS model set (see eval_pipeline.ts) release_note:skip Skip the PR/issue when compiling release notes v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants