[nl2esql] cleanup prompts by pgayvallet · Pull Request #235179 · elastic/kibana

pgayvallet · 2025-09-16T09:37:10Z

Summary

Cleanup the prompts of the NL-2-ESQL task

adapt the instructions based on the ML experiment
remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana)
use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment)

Numbers

First call (request_documentation): +700 tokens, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency.

Second call (generate_esql): -2300 tokens

Overall, -1600 input tokens, which represent, depending on the rest of the input (e.g mappings or not) -10% tokens to -20% tokens, and better efficiency

Evals

agent builder eval suite

Dataset	Filter level	Factuality	Groundedness	Relevance	Sequence Accuracy
Analytical	Baseline	36.7%	68.3%	82.1%	100.0%
Analytical	PR	41.1%	76.8%	89.2%	97.8%

(ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline.

inference plugin's esql eval suite

Model openai-gpt4o scored 27.449999999999996 out of 31

-------------------------------------------
Model openai-gpt4o scores per category
- category: ES|QL commands and functions usage - scored 11.8 out of 14
- category: ES|QL query generation - scored 12.200000000000001 out of 13
- category: SPL to ESQL - scored 3.45 out of 4
-------------------------------------------

Which, compared to the last runs (done in #224868), confirms there's no regression, or maybe even some slight improvements

o11y ES|QL eval suite

(baseline from this doc, evaluator was Gemini 2.5-pro for all candidates)

Dataset	Model	Baseline	Score	Delta
ESQL query generation	GPT4.1	147.5	145.5	-2
ESQL query generation	Gemini 2.5-pro	128.7	132.25	+3.55
ESQL query generation	Claude 3.7	150	158.75	+8.75

pgayvallet · 2025-09-16T09:37:17Z

/ci

…rompts

pgayvallet · 2025-09-16T09:42:43Z

/ci

viduni94 · 2025-09-17T01:57:36Z

Ran the Observability AI Assistant ES|QL evaluations on this PR. The results are as follows:

Model	Before (`main`)	After (This PR)	Diff
Gemini 2.0 Flash	102.25	100.5	-1.75
Gemini 2.5 Flash	117.5	130.2	+12.7
Gemini 2.5 Pro	128.7	131.5	+2.8
Claude Sonnet 3.5	149.2	157.75	+8.55
Claude Sonnet 3.7	150	160.5	+10.5
Claude Sonnet 4	166.25	162	-4.25

…rompts

…mples

pgayvallet · 2025-09-17T09:21:39Z

/ci

pgayvallet · 2025-09-18T08:08:12Z

...i_integration_deployment_agnostic/apis/ai_assistant/complete/functions/execute_query.spec.ts

-        it('contains ESQL documentation', () => {
-          const parsed = JSON.parse(last(thirdRequestBody.messages)?.content as string);
-          expect(parsed.documentation.OPERATORS).to.contain('Binary Operators');
-        });


We no longer automatically include some of the doc pages, and given this test doesn't stub the responses with actual logical data, there is no way to really adapt this test.

(Also this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task, which should be avoided.)

this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task

Maybe it can be adapted but I don't think it is (only) testing internals. The intention of this spec is to ensure that we don't have a bug on our end, and that we are actually returning product docs to the LLM.
Any other way we can verify that if not inspecting the outgoing messages?

You are doing assertions against internal calls performed by a task which is supposed to be a black box for your assistant. How is this not just testing the internals? E.g. this specific test I had to delete because I changed the way the task works internally?

Now don't get me wrong - I know that such testing is terribly painful and complicated. But still feels wrong to me

Are we talking about the same thing? This test is doing assertions against the returned product docs. That's not an implementation detail but the output I'd say.

Still, interested to hear what else we can test against, to make sure we are actually sending product docs to the LLM.

elasticmachine · 2025-09-18T10:43:48Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: d811762

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #121 / discover/group3 discover request counts ES|QL mode should send expected requests for saved search changes
[job] [logs] FTR Configs #20 / input controls input control options updateFiltersOnChange is false should replace existing filter pill(s) when new item is selected

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`inference`	46	47	+1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`inference`	9	10	+1

Unknown metric groups

API count

id	before	after	diff
`inference`	58	59	+1

References to deprecated APIs

id	before	after	diff
`@kbn/ai-tools`	0	1	+1

History

## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in elastic#224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |

## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in #224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |

## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in elastic#224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |

[nl2esql] cleanup prompts

e73ee71

Merge remote-tracking branch 'upstream/main' into kbn-xxx-nl-2-esql-p…

8fea78f

…rompts

pgayvallet added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.2.0 labels Sep 16, 2025

fix usages

0492ef2

pgayvallet marked this pull request as ready for review September 16, 2025 11:46

pgayvallet requested a review from a team as a code owner September 16, 2025 11:46

dgieselaar approved these changes Sep 16, 2025

View reviewed changes

pgayvallet added 3 commits September 16, 2025 15:04

tweak a few things

2ad5360

remove nesting, add section for comments

e4fe0f1

remove the LIMIT clause - keep it in onechat

1658546

pgayvallet added 3 commits September 17, 2025 08:33

Merge remote-tracking branch 'upstream/main' into kbn-xxx-nl-2-esql-p…

2ed976a

…rompts

remove space

7bc8752

cleanup more things, add safety limit instructions and a few more exa…

be5fe4c

…mples

fix FTR assert

077afa2

pgayvallet requested a review from a team as a code owner September 18, 2025 06:02

adapt FTR again

d811762

pgayvallet commented Sep 18, 2025

View reviewed changes

pgayvallet merged commit 98a1296 into elastic:main Sep 18, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[nl2esql] cleanup prompts#235179

[nl2esql] cleanup prompts#235179
pgayvallet merged 11 commits intoelastic:mainfrom
pgayvallet:kbn-xxx-nl-2-esql-prompts

pgayvallet commented Sep 16, 2025 •

edited

Loading

Uh oh!

pgayvallet commented Sep 16, 2025

Uh oh!

pgayvallet commented Sep 16, 2025

Uh oh!

viduni94 commented Sep 17, 2025 •

edited

Loading

Uh oh!

pgayvallet commented Sep 17, 2025

Uh oh!

pgayvallet Sep 18, 2025

Uh oh!

sorenlouv Sep 18, 2025 •

edited

Loading

Uh oh!

pgayvallet Sep 18, 2025

Uh oh!

sorenlouv Sep 18, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Sep 18, 2025

API count

References to deprecated APIs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

pgayvallet commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Numbers

Evals

Uh oh!

pgayvallet commented Sep 16, 2025

Uh oh!

pgayvallet commented Sep 16, 2025

Uh oh!

viduni94 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgayvallet commented Sep 17, 2025

Uh oh!

pgayvallet Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sorenlouv Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgayvallet Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sorenlouv Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Sep 18, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Public APIs missing comments

Public APIs missing exports

API count

References to deprecated APIs

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pgayvallet commented Sep 16, 2025 •

edited

Loading

viduni94 commented Sep 17, 2025 •

edited

Loading

sorenlouv Sep 18, 2025 •

edited

Loading

sorenlouv Sep 18, 2025 •

edited

Loading