[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion by SrdjanLL · Pull Request #228827 · elastic/kibana

SrdjanLL · 2025-07-21T16:31:59Z

Summary

Add fallback score when judge misses evaluating a criterion:

The score is 0 and reasoning: No score returned by LLM judge, defaulting to 0.
While the issue of inconsistent evaluation score was mitigated by [Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria #226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated.

Testing

Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge (here) with something like:

   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,

Then you can see the fallback scores populating in the evaluation and keeping the total consistent regardless of how well the score works.

Example from intentionally failed scoring with the prompt change above:

…ide a score for a criterion

elasticmachine · 2025-07-21T16:32:04Z

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

github-actions · 2025-07-21T16:32:11Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts

sorenlouv

nit: The description for the index property is "The number of the criterion". We should clarify this "The index number of the criterion"

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

SrdjanLL · 2025-07-22T08:02:06Z

@sorenlouv - thanks for the feedback, I addressed all :)

elasticmachine · 2025-07-22T09:10:39Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 385f062
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-228827-385f0624be75

Failed CI Steps

Jest Tests #16

Test Failures

[job] [logs] Jest Tests #16 / EQL Tab rendering pagination should load notes for current page only

Metrics [docs]

✅ unchanged

History

💚 Build #321496 succeeded 0db3eee

…valuating a criterion (#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by #226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

Add evaluation fallback score for rare cases when judge fails to prov…

0db3eee

…ide a score for a criterion

SrdjanLL added the release_note:skip Skip the PR/issue when compiling release notes label Jul 21, 2025

SrdjanLL requested a review from a team as a code owner July 21, 2025 16:32

SrdjanLL added backport:skip This PR does not require backporting Team:Obs AI Assistant Observability AI Assistant labels Jul 21, 2025

botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 21, 2025

sorenlouv reviewed Jul 21, 2025

View reviewed changes

...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts Outdated Show resolved Hide resolved

sorenlouv approved these changes Jul 21, 2025

View reviewed changes

sorenlouv reviewed Jul 21, 2025

View reviewed changes

SrdjanLL and others added 3 commits July 22, 2025 08:56

Update index property description on the 'scores' function

003de67

Pass the evaluation test only when all criteria scores are '1'

da045fc

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

Merge branch 'main' into evaluation-fallback-score

385f062

SrdjanLL enabled auto-merge (squash) July 22, 2025 08:00

SrdjanLL merged commit 0e71381 into elastic:main Jul 22, 2025
12 checks passed

kibanamachine added the v9.2.0 label Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion#228827

[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion#228827
SrdjanLL merged 4 commits intoelastic:mainfrom
SrdjanLL:evaluation-fallback-score

SrdjanLL commented Jul 21, 2025

Uh oh!

elasticmachine commented Jul 21, 2025

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

Uh oh!

sorenlouv left a comment

Uh oh!

SrdjanLL commented Jul 22, 2025

Uh oh!

elasticmachine commented Jul 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SrdjanLL commented Jul 21, 2025

Summary

Testing

Uh oh!

elasticmachine commented Jul 21, 2025

Uh oh!

github-actions bot commented Jul 21, 2025

🤖 GitHub comments

Uh oh!

Uh oh!

sorenlouv left a comment

Choose a reason for hiding this comment

Uh oh!

SrdjanLL commented Jul 22, 2025

Uh oh!

elasticmachine commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elasticmachine commented Jul 22, 2025 •

edited

Loading