Skip to content

Comments

[nl2esql] cleanup prompts#235179

Merged
pgayvallet merged 11 commits intoelastic:mainfrom
pgayvallet:kbn-xxx-nl-2-esql-prompts
Sep 18, 2025
Merged

[nl2esql] cleanup prompts#235179
pgayvallet merged 11 commits intoelastic:mainfrom
pgayvallet:kbn-xxx-nl-2-esql-prompts

Conversation

@pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Sep 16, 2025

Summary

Cleanup the prompts of the NL-2-ESQL task

  • adapt the instructions based on the ML experiment
  • remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana)
  • use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment)

Numbers

First call (request_documentation): +700 tokens, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency.

Second call (generate_esql): -2300 tokens

Overall, -1600 input tokens, which represent, depending on the rest of the input (e.g mappings or not) -10% tokens to -20% tokens, and better efficiency

Evals

agent builder eval suite

Dataset Filter level Factuality Groundedness Relevance Sequence Accuracy
Analytical Baseline 36.7% 68.3% 82.1% 100.0%
Analytical PR 41.1% 76.8% 89.2% 97.8%

(ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline.

inference plugin's esql eval suite

Model openai-gpt4o scored 27.449999999999996 out of 31

-------------------------------------------
Model openai-gpt4o scores per category
- category: ES|QL commands and functions usage - scored 11.8 out of 14
- category: ES|QL query generation - scored 12.200000000000001 out of 13
- category: SPL to ESQL - scored 3.45 out of 4
-------------------------------------------

Which, compared to the last runs (done in #224868), confirms there's no regression, or maybe even some slight improvements

o11y ES|QL eval suite

(baseline from this doc, evaluator was Gemini 2.5-pro for all candidates)

Dataset Model Baseline Score Delta
ESQL query generation GPT4.1 147.5 145.5 -2
ESQL query generation Gemini 2.5-pro 128.7 132.25 +3.55
ESQL query generation Claude 3.7 150 158.75 +8.75

@pgayvallet
Copy link
Contributor Author

/ci

@pgayvallet
Copy link
Contributor Author

/ci

@pgayvallet pgayvallet added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.2.0 labels Sep 16, 2025
@pgayvallet pgayvallet marked this pull request as ready for review September 16, 2025 11:46
@pgayvallet pgayvallet requested a review from a team as a code owner September 16, 2025 11:46
@viduni94
Copy link
Contributor

viduni94 commented Sep 17, 2025

Ran the Observability AI Assistant ES|QL evaluations on this PR. The results are as follows:

Model Before (main) After (This PR) Diff
Gemini 2.0 Flash 102.25 100.5 -1.75
Gemini 2.5 Flash 117.5 130.2 +12.7
Gemini 2.5 Pro 128.7 131.5 +2.8
Claude Sonnet 3.5 149.2 157.75 +8.55
Claude Sonnet 3.7 150 160.5 +10.5
Claude Sonnet 4 166.25 162 -4.25

@pgayvallet
Copy link
Contributor Author

/ci

@pgayvallet pgayvallet requested a review from a team as a code owner September 18, 2025 06:02
Comment on lines -160 to -163
it('contains ESQL documentation', () => {
const parsed = JSON.parse(last(thirdRequestBody.messages)?.content as string);
expect(parsed.documentation.OPERATORS).to.contain('Binary Operators');
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer automatically include some of the doc pages, and given this test doesn't stub the responses with actual logical data, there is no way to really adapt this test.

(Also this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task, which should be avoided.)

Copy link
Member

@sorenlouv sorenlouv Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task

Maybe it can be adapted but I don't think it is (only) testing internals. The intention of this spec is to ensure that we don't have a bug on our end, and that we are actually returning product docs to the LLM.
Any other way we can verify that if not inspecting the outgoing messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are doing assertions against internal calls performed by a task which is supposed to be a black box for your assistant. How is this not just testing the internals? E.g. this specific test I had to delete because I changed the way the task works internally?

Now don't get me wrong - I know that such testing is terribly painful and complicated. But still feels wrong to me

Copy link
Member

@sorenlouv sorenlouv Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking about the same thing? This test is doing assertions against the returned product docs. That's not an implementation detail but the output I'd say.

Still, interested to hear what else we can test against, to make sure we are actually sending product docs to the LLM.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #121 / discover/group3 discover request counts ES|QL mode should send expected requests for saved search changes
  • [job] [logs] FTR Configs #20 / input controls input control options updateFiltersOnChange is false should replace existing filter pill(s) when new item is selected

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
inference 46 47 +1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
inference 9 10 +1
Unknown metric groups

API count

id before after diff
inference 58 59 +1

References to deprecated APIs

id before after diff
@kbn/ai-tools 0 1 +1

History

@pgayvallet pgayvallet merged commit 98a1296 into elastic:main Sep 18, 2025
12 checks passed
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Sep 24, 2025
## Summary

Cleanup the prompts of the NL-2-ESQL task
- adapt the instructions based on the [ML
experiment](https://github.com/elastic/nl2esql/tree/main)
- remove parts of the doc which aren't really useful in any way (e.g how
to use ES|QL with Kibana)
- use nl->esql examples instead of examples describing what each request
does (more efficient according to the experiment)

### Numbers

First call (`request_documentation`): **+700 tokens**, explained by the
fact that we now provide ES|QL examples to the LLM during this step,
which can increase the selection efficiency.

Second call (`generate_esql`): **-2300** tokens

Overall, **-1600 input tokens**, which represent, depending on the rest
of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**,
and better efficiency

### Evals

**agent builder eval suite**

| Dataset | Filter level | Factuality | Groundedness | Relevance |
Sequence Accuracy |
| :--- | :--- | :---: | :---: | :---: | :---: |
| Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% |
| Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% |

(ran multiple time, quite stable). The better scores are likely caused
by one or two less failing queries compared to the baseline.

**inference plugin's esql eval suite**


```
Model openai-gpt4o scored 27.449999999999996 out of 31

-------------------------------------------
Model openai-gpt4o scores per category
- category: ES|QL commands and functions usage - scored 11.8 out of 14
- category: ES|QL query generation - scored 12.200000000000001 out of 13
- category: SPL to ESQL - scored 3.45 out of 4
-------------------------------------------
```

Which, compared to the last runs (done in
elastic#224868), confirms there's no
regression, or maybe even some slight improvements

**o11y ES|QL eval suite**

(baseline from [this
doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278),
evaluator was Gemini 2.5-pro for all candidates)

| Dataset | Model | Baseline | Score | Delta |
| :--- | :--- | :---: | :---: | :---: |
| ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** |
| ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** |
| ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
niros1 pushed a commit that referenced this pull request Sep 30, 2025
## Summary

Cleanup the prompts of the NL-2-ESQL task
- adapt the instructions based on the [ML
experiment](https://github.com/elastic/nl2esql/tree/main)
- remove parts of the doc which aren't really useful in any way (e.g how
to use ES|QL with Kibana)
- use nl->esql examples instead of examples describing what each request
does (more efficient according to the experiment)

### Numbers

First call (`request_documentation`): **+700 tokens**, explained by the
fact that we now provide ES|QL examples to the LLM during this step,
which can increase the selection efficiency.

Second call (`generate_esql`): **-2300** tokens

Overall, **-1600 input tokens**, which represent, depending on the rest
of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**,
and better efficiency

### Evals

**agent builder eval suite**

| Dataset | Filter level | Factuality | Groundedness | Relevance |
Sequence Accuracy |
| :--- | :--- | :---: | :---: | :---: | :---: |
| Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% |
| Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% |

(ran multiple time, quite stable). The better scores are likely caused
by one or two less failing queries compared to the baseline.

**inference plugin's esql eval suite**


```
Model openai-gpt4o scored 27.449999999999996 out of 31

-------------------------------------------
Model openai-gpt4o scores per category
- category: ES|QL commands and functions usage - scored 11.8 out of 14
- category: ES|QL query generation - scored 12.200000000000001 out of 13
- category: SPL to ESQL - scored 3.45 out of 4
-------------------------------------------
```

Which, compared to the last runs (done in
#224868), confirms there's no
regression, or maybe even some slight improvements

**o11y ES|QL eval suite**

(baseline from [this
doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278),
evaluator was Gemini 2.5-pro for all candidates)

| Dataset | Model | Baseline | Score | Delta |
| :--- | :--- | :---: | :---: | :---: |
| ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** |
| ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** |
| ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
rylnd pushed a commit to rylnd/kibana that referenced this pull request Oct 17, 2025
## Summary

Cleanup the prompts of the NL-2-ESQL task
- adapt the instructions based on the [ML
experiment](https://github.com/elastic/nl2esql/tree/main)
- remove parts of the doc which aren't really useful in any way (e.g how
to use ES|QL with Kibana)
- use nl->esql examples instead of examples describing what each request
does (more efficient according to the experiment)

### Numbers

First call (`request_documentation`): **+700 tokens**, explained by the
fact that we now provide ES|QL examples to the LLM during this step,
which can increase the selection efficiency.

Second call (`generate_esql`): **-2300** tokens

Overall, **-1600 input tokens**, which represent, depending on the rest
of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**,
and better efficiency

### Evals

**agent builder eval suite**

| Dataset | Filter level | Factuality | Groundedness | Relevance |
Sequence Accuracy |
| :--- | :--- | :---: | :---: | :---: | :---: |
| Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% |
| Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% |

(ran multiple time, quite stable). The better scores are likely caused
by one or two less failing queries compared to the baseline.

**inference plugin's esql eval suite**


```
Model openai-gpt4o scored 27.449999999999996 out of 31

-------------------------------------------
Model openai-gpt4o scores per category
- category: ES|QL commands and functions usage - scored 11.8 out of 14
- category: ES|QL query generation - scored 12.200000000000001 out of 13
- category: SPL to ESQL - scored 3.45 out of 4
-------------------------------------------
```

Which, compared to the last runs (done in
elastic#224868), confirms there's no
regression, or maybe even some slight improvements

**o11y ES|QL eval suite**

(baseline from [this
doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278),
evaluator was Gemini 2.5-pro for all candidates)

| Dataset | Model | Baseline | Score | Delta |
| :--- | :--- | :---: | :---: | :---: |
| ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** |
| ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** |
| ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants