Skip to content

[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria#226983

Merged
SrdjanLL merged 2 commits intoelastic:mainfrom
SrdjanLL:evaluation-consistent-scoring
Jul 10, 2025
Merged

[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria#226983
SrdjanLL merged 2 commits intoelastic:mainfrom
SrdjanLL:evaluation-consistent-scoring

Conversation

@SrdjanLL
Copy link
Copy Markdown
Contributor

@SrdjanLL SrdjanLL commented Jul 8, 2025

Summary

Closes: #223422

Update evaluation script's system instructions and scoring tool parameters to consistently score all evaluation criteria. The solution in this PR well so I'm jumping ahead with the PR - but happy to hear if people think there are other, better solutions.

Problem

  • Evaluation runs were flaky: the LLM sometimes skipped showing scores of certain criteria, leading to missing details in the evaluation output and making the manual inspection more difficult. This also affected the total number of scenarios shown in the summary.
  • Consistent per-criterion scoring is required to track failures across models and scenarios.

Solution

  • Use the number of criteria passed to enforce the length of the array in the scoring tool schema to that exact number, so any response lacking items is rejected by the function-calling validator.
  • Updated system prompt to enforce the scoring process.
  • Result: LLM is forced (schema + prompt) to return a complete score list; evaluation now seems more deterministic.

Testing

  • Ran evaluation script with APM scenarios (the most inconsistent one) and confirmed the 10/10 runs produced consistent output.
  • Re-ran the evaluation framework for all scenarios and confirmed that all scenarios were scored and captured as part of the evaluation. Tested with Gemini 2.0 Flash:
-------------------------------------------
Model gemini-2-flash scored 97 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 9.5 out of 10
-------------------------
Category: APM - Scored 8.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 12.5 out of 14
-------------------------
Category: Elasticsearch function - Scored 18 out of 19
-------------------------
Category: ES|QL query generation - Scored 33.5 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@SrdjanLL SrdjanLL requested a review from dgieselaar July 8, 2025 11:15
@SrdjanLL SrdjanLL requested a review from a team as a code owner July 8, 2025 11:15
@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 8, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 8, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@SrdjanLL SrdjanLL added release_note:skip Skip the PR/issue when compiling release notes backport:version Backport to applied version labels labels Jul 8, 2025
…eters to consistently score all evaluation criteria
@SrdjanLL SrdjanLL force-pushed the evaluation-consistent-scoring branch from 42b74c3 to e08501c Compare July 8, 2025 12:06
@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Jul 8, 2025

💚 Build Succeeded

  • Buildkite Build
  • Commit: afd9211
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-226983-afd92117de75

Metrics [docs]

✅ unchanged

History

@SrdjanLL SrdjanLL merged commit 39a5a77 into elastic:main Jul 10, 2025
12 checks passed
@kibanamachine kibanamachine added v9.2.0 backport missing Added to PRs automatically when the are determined to be missing a backport. labels Jul 10, 2025
@kibanamachine
Copy link
Copy Markdown
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 226983 locally
cc: @SrdjanLL

@viduni94 viduni94 added the backport:skip This PR does not require backporting label Jul 11, 2025
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jul 11, 2025
SrdjanLL added a commit that referenced this pull request Jul 22, 2025
…valuating a criterion (#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
maxcold pushed a commit that referenced this pull request Jul 22, 2025
…valuating a criterion (#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kdelemme pushed a commit to kdelemme/kibana that referenced this pull request Jul 23, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kertal pushed a commit to kertal/kibana that referenced this pull request Jul 25, 2025
… evaluation criteria (elastic#226983)

## Summary

Closes: elastic#223422

Update evaluation script's system instructions and scoring tool
parameters to consistently score all evaluation criteria. The solution
in this PR well so I'm jumping ahead with the PR - but happy to hear if
people think there are other, better solutions.

### Problem
- Evaluation runs were flaky: the LLM sometimes skipped showing scores
of certain criteria, leading to missing details in the evaluation output
and making the manual inspection more difficult. This also affected the
total number of scenarios shown in the summary.
- Consistent per-criterion scoring is required to track failures across
models and scenarios.

### Solution
- Use the number of criteria passed to enforce the length of the array
in the scoring tool schema to that exact number, so any response lacking
items is rejected by the function-calling validator.
- Updated system prompt to enforce the scoring process.
- Result: LLM is forced (schema + prompt) to return a complete score
list; evaluation now seems more deterministic.

### Testing
- Ran evaluation script with APM scenarios (the most inconsistent one)
and confirmed the 10/10 runs produced consistent output.
- Re-ran the evaluation framework for all scenarios and confirmed that
all scenarios were scored and captured as part of the evaluation. Tested
with Gemini 2.0 Flash:

```bash
-------------------------------------------
Model gemini-2-flash scored 97 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 9.5 out of 10
-------------------------
Category: APM - Scored 8.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 12.5 out of 14
-------------------------
Category: Elasticsearch function - Scored 18 out of 19
-------------------------
Category: ES|QL query generation - Scored 33.5 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------
```

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/src/platform/packages/shared/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
- [ ] Review the [backport
guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing)
and apply applicable `backport:*` labels.

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...
kertal pushed a commit to kertal/kibana that referenced this pull request Jul 25, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
crespocarlos pushed a commit to crespocarlos/kibana that referenced this pull request Jul 25, 2025
…valuating a criterion (elastic#228827)

## Summary

Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.

### Testing

- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
   ### Scoring Contract

  * You MUST call the function "scores" exactly once.  
  * Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.

Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>

---------

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting backport:version Backport to applied version labels ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Obs AI Assistant] Evaluation Framework: Final summary total inaccurate due to failed tests

4 participants