Skip to content

[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion#228827

Merged
SrdjanLL merged 4 commits intoelastic:mainfrom
SrdjanLL:evaluation-fallback-score
Jul 22, 2025
Merged

[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion#228827
SrdjanLL merged 4 commits intoelastic:mainfrom
SrdjanLL:evaluation-fallback-score

Conversation

@SrdjanLL
Copy link
Copy Markdown
Contributor

Summary

Add fallback score when judge misses evaluating a criterion:

Testing

  • Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge (here) with something like:
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,

Then you can see the fallback scores populating in the evaluation and keeping the total consistent regardless of how well the score works.

Example from intentionally failed scoring with the prompt change above:
image

@SrdjanLL SrdjanLL added the release_note:skip Skip the PR/issue when compiling release notes label Jul 21, 2025
@SrdjanLL SrdjanLL requested a review from a team as a code owner July 21, 2025 16:32
@SrdjanLL SrdjanLL added backport:skip This PR does not require backporting Team:Obs AI Assistant Observability AI Assistant labels Jul 21, 2025
@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 21, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Copy Markdown
Contributor

@sorenlouv sorenlouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The description for the index property is "The number of the criterion". We should clarify this "The index number of the criterion"

@SrdjanLL SrdjanLL enabled auto-merge (squash) July 22, 2025 08:00
@SrdjanLL
Copy link
Copy Markdown
Contributor Author

@sorenlouv - thanks for the feedback, I addressed all :)

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Jul 22, 2025

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: 385f062
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-228827-385f0624be75

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #16 / EQL Tab rendering pagination should load notes for current page only

Metrics [docs]

✅ unchanged

History

@SrdjanLL SrdjanLL merged commit 0e71381 into elastic:main Jul 22, 2025
12 checks passed
maxcold pushed a commit that referenced this pull request Jul 22, 2025
…valuating a criterion (#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kdelemme pushed a commit to kdelemme/kibana that referenced this pull request Jul 23, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kertal pushed a commit to kertal/kibana that referenced this pull request Jul 25, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
crespocarlos pushed a commit to crespocarlos/kibana that referenced this pull request Jul 25, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:Obs AI Assistant Observability AI Assistant v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants