[Obs AI Assistant] Update evaluation script to consistently score all evaluation criteria#226983
Merged
SrdjanLL merged 2 commits intoelastic:mainfrom Jul 10, 2025
Merged
Conversation
Contributor
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
…eters to consistently score all evaluation criteria
42b74c3 to
e08501c
Compare
Contributor
💚 Build Succeeded
Metrics [docs]
History
|
viduni94
approved these changes
Jul 9, 2025
Contributor
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
SrdjanLL
added a commit
that referenced
this pull request
Jul 22, 2025
…valuating a criterion (#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by #226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
maxcold
pushed a commit
that referenced
this pull request
Jul 22, 2025
…valuating a criterion (#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by #226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kdelemme
pushed a commit
to kdelemme/kibana
that referenced
this pull request
Jul 23, 2025
…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
kertal
pushed a commit
to kertal/kibana
that referenced
this pull request
Jul 25, 2025
… evaluation criteria (elastic#226983) ## Summary Closes: elastic#223422 Update evaluation script's system instructions and scoring tool parameters to consistently score all evaluation criteria. The solution in this PR well so I'm jumping ahead with the PR - but happy to hear if people think there are other, better solutions. ### Problem - Evaluation runs were flaky: the LLM sometimes skipped showing scores of certain criteria, leading to missing details in the evaluation output and making the manual inspection more difficult. This also affected the total number of scenarios shown in the summary. - Consistent per-criterion scoring is required to track failures across models and scenarios. ### Solution - Use the number of criteria passed to enforce the length of the array in the scoring tool schema to that exact number, so any response lacking items is rejected by the function-calling validator. - Updated system prompt to enforce the scoring process. - Result: LLM is forced (schema + prompt) to return a complete score list; evaluation now seems more deterministic. ### Testing - Ran evaluation script with APM scenarios (the most inconsistent one) and confirmed the 10/10 runs produced consistent output. - Re-ran the evaluation framework for all scenarios and confirmed that all scenarios were scored and captured as part of the evaluation. Tested with Gemini 2.0 Flash: ```bash ------------------------------------------- Model gemini-2-flash scored 97 out of 123 ------------------------------------------- ------------------------------------------- Model gemini-2-flash Scores per Category ------------------------- Category: Alerts - Scored 9.5 out of 10 ------------------------- Category: APM - Scored 8.5 out of 17 ------------------------- Category: Retrieve documentation function - Scored 12.5 out of 14 ------------------------- Category: Elasticsearch function - Scored 18 out of 19 ------------------------- Category: ES|QL query generation - Scored 33.5 out of 48 ------------------------- Category: Knowledge base - Scored 15 out of 15 ------------------------------------------- ``` ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/src/platform/packages/shared/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [ ] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels. ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ...
kertal
pushed a commit
to kertal/kibana
that referenced
this pull request
Jul 25, 2025
…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
crespocarlos
pushed a commit
to crespocarlos/kibana
that referenced
this pull request
Jul 25, 2025
…valuating a criterion (elastic#228827) ## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes: #223422
Update evaluation script's system instructions and scoring tool parameters to consistently score all evaluation criteria. The solution in this PR well so I'm jumping ahead with the PR - but happy to hear if people think there are other, better solutions.
Problem
Solution
Testing
Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
release_note:breakinglabel should be applied in these situations.release_note:*label is applied per the guidelinesbackport:*labels.Identify risks
Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.
Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.