[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria by SrdjanLL · Pull Request #226983 · elastic/kibana

SrdjanLL · 2025-07-08T11:15:53Z

Summary

Update evaluation script's system instructions and scoring tool parameters to consistently score all evaluation criteria. The solution in this PR well so I'm jumping ahead with the PR - but happy to hear if people think there are other, better solutions.

Problem

Evaluation runs were flaky: the LLM sometimes skipped showing scores of certain criteria, leading to missing details in the evaluation output and making the manual inspection more difficult. This also affected the total number of scenarios shown in the summary.
Consistent per-criterion scoring is required to track failures across models and scenarios.

Solution

Use the number of criteria passed to enforce the length of the array in the scoring tool schema to that exact number, so any response lacking items is rejected by the function-calling validator.
Updated system prompt to enforce the scoring process.
Result: LLM is forced (schema + prompt) to return a complete score list; evaluation now seems more deterministic.

Testing

Ran evaluation script with APM scenarios (the most inconsistent one) and confirmed the 10/10 runs produced consistent output.
Re-ran the evaluation framework for all scenarios and confirmed that all scenarios were scored and captured as part of the evaluation. Tested with Gemini 2.0 Flash:

-------------------------------------------
Model gemini-2-flash scored 97 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 9.5 out of 10
-------------------------
Category: APM - Scored 8.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 12.5 out of 14
-------------------------
Category: Elasticsearch function - Scored 18 out of 19
-------------------------
Category: ES|QL query generation - Scored 33.5 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
Flaky Test Runner was used on any tests changed
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

See some risk examples
...

github-actions · 2025-07-08T11:16:07Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

…eters to consistently score all evaluation criteria

elasticmachine · 2025-07-08T13:11:28Z

💚 Build Succeeded

Buildkite Build
Commit: afd9211
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-226983-afd92117de75

Metrics [docs]

✅ unchanged

History

💔 Build #316700 failed 42b74c3

kibanamachine · 2025-07-11T09:48:37Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 226983 locally
cc: @SrdjanLL

…valuating a criterion (#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by #226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

… evaluation criteria (elastic#226983) ## Summary Closes: elastic#223422 Update evaluation script's system instructions and scoring tool parameters to consistently score all evaluation criteria. The solution in this PR well so I'm jumping ahead with the PR - but happy to hear if people think there are other, better solutions. ### Problem - Evaluation runs were flaky: the LLM sometimes skipped showing scores of certain criteria, leading to missing details in the evaluation output and making the manual inspection more difficult. This also affected the total number of scenarios shown in the summary. - Consistent per-criterion scoring is required to track failures across models and scenarios. ### Solution - Use the number of criteria passed to enforce the length of the array in the scoring tool schema to that exact number, so any response lacking items is rejected by the function-calling validator. - Updated system prompt to enforce the scoring process. - Result: LLM is forced (schema + prompt) to return a complete score list; evaluation now seems more deterministic. ### Testing - Ran evaluation script with APM scenarios (the most inconsistent one) and confirmed the 10/10 runs produced consistent output. - Re-ran the evaluation framework for all scenarios and confirmed that all scenarios were scored and captured as part of the evaluation. Tested with Gemini 2.0 Flash: ```bash ------------------------------------------- Model gemini-2-flash scored 97 out of 123 ------------------------------------------- ------------------------------------------- Model gemini-2-flash Scores per Category ------------------------- Category: Alerts - Scored 9.5 out of 10 ------------------------- Category: APM - Scored 8.5 out of 17 ------------------------- Category: Retrieve documentation function - Scored 12.5 out of 14 ------------------------- Category: Elasticsearch function - Scored 18 out of 19 ------------------------- Category: ES|QL query generation - Scored 33.5 out of 48 ------------------------- Category: Knowledge base - Scored 15 out of 15 ------------------------------------------- ``` ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/src/platform/packages/shared/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [ ] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels. ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ...

…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>

SrdjanLL requested a review from dgieselaar July 8, 2025 11:15

SrdjanLL requested a review from a team as a code owner July 8, 2025 11:15

botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 8, 2025

SrdjanLL added release_note:skip Skip the PR/issue when compiling release notes backport:version Backport to applied version labels labels Jul 8, 2025

Update evaluation script's system instructions and scoring tool param…

e08501c

…eters to consistently score all evaluation criteria

SrdjanLL force-pushed the evaluation-consistent-scoring branch from 42b74c3 to e08501c Compare July 8, 2025 12:06

Use static system message

afd9211

viduni94 approved these changes Jul 9, 2025

View reviewed changes

SrdjanLL merged commit 39a5a77 into elastic:main Jul 10, 2025
12 checks passed

kibanamachine added v9.2.0 backport missing Added to PRs automatically when the are determined to be missing a backport. labels Jul 10, 2025

viduni94 added the backport:skip This PR does not require backporting label Jul 11, 2025

kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jul 11, 2025

SrdjanLL mentioned this pull request Jul 21, 2025

[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion #228827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria#226983

[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria#226983
SrdjanLL merged 2 commits intoelastic:mainfrom
SrdjanLL:evaluation-consistent-scoring

SrdjanLL commented Jul 8, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

elasticmachine commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

kibanamachine commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SrdjanLL commented Jul 8, 2025

Summary

Problem

Solution

Testing

Checklist

Identify risks

Uh oh!

github-actions bot commented Jul 8, 2025

🤖 GitHub comments

Uh oh!

elasticmachine commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Metrics [docs]

History

Uh oh!

Uh oh!

kibanamachine commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elasticmachine commented Jul 8, 2025 •

edited

Loading