[nl2esql] cleanup prompts#235179
Conversation
|
/ci |
|
/ci |
|
Ran the Observability AI Assistant ES|QL evaluations on this PR. The results are as follows:
|
|
/ci |
| it('contains ESQL documentation', () => { | ||
| const parsed = JSON.parse(last(thirdRequestBody.messages)?.content as string); | ||
| expect(parsed.documentation.OPERATORS).to.contain('Binary Operators'); | ||
| }); |
There was a problem hiding this comment.
We no longer automatically include some of the doc pages, and given this test doesn't stub the responses with actual logical data, there is no way to really adapt this test.
(Also this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task, which should be avoided.)
There was a problem hiding this comment.
this suite doesn't really make much sense and is just testing the internals of the NL-2-ESQL task
Maybe it can be adapted but I don't think it is (only) testing internals. The intention of this spec is to ensure that we don't have a bug on our end, and that we are actually returning product docs to the LLM.
Any other way we can verify that if not inspecting the outgoing messages?
There was a problem hiding this comment.
You are doing assertions against internal calls performed by a task which is supposed to be a black box for your assistant. How is this not just testing the internals? E.g. this specific test I had to delete because I changed the way the task works internally?
Now don't get me wrong - I know that such testing is terribly painful and complicated. But still feels wrong to me
There was a problem hiding this comment.
Are we talking about the same thing? This test is doing assertions against the returned product docs. That's not an implementation detail but the output I'd say.
Still, interested to hear what else we can test against, to make sure we are actually sending product docs to the LLM.
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
References to deprecated APIs
History
|
## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in elastic#224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in #224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
## Summary Cleanup the prompts of the NL-2-ESQL task - adapt the instructions based on the [ML experiment](https://github.com/elastic/nl2esql/tree/main) - remove parts of the doc which aren't really useful in any way (e.g how to use ES|QL with Kibana) - use nl->esql examples instead of examples describing what each request does (more efficient according to the experiment) ### Numbers First call (`request_documentation`): **+700 tokens**, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency. Second call (`generate_esql`): **-2300** tokens Overall, **-1600 input tokens**, which represent, depending on the rest of the input (e.g mappings or not) **-10% tokens** to **-20% tokens**, and better efficiency ### Evals **agent builder eval suite** | Dataset | Filter level | Factuality | Groundedness | Relevance | Sequence Accuracy | | :--- | :--- | :---: | :---: | :---: | :---: | | Analytical | **Baseline** | 36.7% | 68.3% | 82.1% | 100.0% | | Analytical | **PR** | 41.1% | 76.8% | 89.2% | 97.8% | (ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline. **inference plugin's esql eval suite** ``` Model openai-gpt4o scored 27.449999999999996 out of 31 ------------------------------------------- Model openai-gpt4o scores per category - category: ES|QL commands and functions usage - scored 11.8 out of 14 - category: ES|QL query generation - scored 12.200000000000001 out of 13 - category: SPL to ESQL - scored 3.45 out of 4 ------------------------------------------- ``` Which, compared to the last runs (done in elastic#224868), confirms there's no regression, or maybe even some slight improvements **o11y ES|QL eval suite** (baseline from [this doc](https://docs.google.com/spreadsheets/d/1aHJHj8KALdTLVJxjoyI2VdqGBVzcJS7v2P7yJRmESZY/edit?gid=243046278#gid=243046278), evaluator was Gemini 2.5-pro for all candidates) | Dataset | Model | Baseline | Score | Delta | | :--- | :--- | :---: | :---: | :---: | | ESQL query generation | GPT4.1 | 147.5| 145.5 | **-2** | | ESQL query generation | Gemini 2.5-pro | 128.7| 132.25 | **+3.55** | | ESQL query generation | Claude 3.7 | 150 | 158.75 | **+8.75** |
Summary
Cleanup the prompts of the NL-2-ESQL task
Numbers
First call (
request_documentation): +700 tokens, explained by the fact that we now provide ES|QL examples to the LLM during this step, which can increase the selection efficiency.Second call (
generate_esql): -2300 tokensOverall, -1600 input tokens, which represent, depending on the rest of the input (e.g mappings or not) -10% tokens to -20% tokens, and better efficiency
Evals
agent builder eval suite
(ran multiple time, quite stable). The better scores are likely caused by one or two less failing queries compared to the baseline.
inference plugin's esql eval suite
Which, compared to the last runs (done in #224868), confirms there's no regression, or maybe even some slight improvements
o11y ES|QL eval suite
(baseline from this doc, evaluator was Gemini 2.5-pro for all candidates)