[Obs AI Assistant] Add intent parameter to the query function and control downstream tool calling by SrdjanLL · Pull Request #228456 · elastic/kibana

SrdjanLL · 2025-07-17T13:17:07Z

Relates to https://github.com/elastic/obs-ai-assistant-team/issues/276
Closes https://github.com/elastic/obs-ai-assistant-team/issues/324

Summary

Adds explicit queryIntent handling to the query function of the AI Assistant for more controlled downstream tool control:

Extends the query function’s JSON schema with a queryIntent parameter (('example' | 'data' | 'visual').
Selects or hides ES|QL helper tools based on the intent:
- 'data' → force execute_query
- 'visual' → force visualize_query
- 'example' → expose no execution / visualisation tools
Passes a toolChoice hint to naturalLanguageToEsql for deterministic tool calling.
- Besides the intended goal, it also allows insight into how the Obs AI Assistant interprets users intentions by observing query tool calls:

Why?

As part of Gemini Prompt improvements (PR), we found that some models are less/more eager to execute tools and that no mater how (DIRECT) the system prompt, tool execution is not so deterministic. We also need to thread carefully between too eager models (like Claude) and less eager models (like Gemini 2.0 Flash) and figure out a structured balance.

The change outlined above showed as one of the strongest contributors to the improvements of the scores from the evaluation framework.

Evaluation Benchmark

I found the evaluation improvements were notable when used in conjunction with Gemini prompt improvements. On the main, there were no improvements but no regressions in the scores either.

*execute_connector evaluation scores are available, but omitted from the summary for comparison

Running on prompt improvements branch:

Gemini:

Improves the scores by ~17% compared to the current state of Gemini improvements (see here) and bringing the total improvement to the evaluation scores to ~76% (compared to the scores prior to improvements)

-------------------------------------------
Model gemini-2-flash scored 108 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 10 out of 10
-------------------------
Category: APM - Scored 13.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 12.5 out of 14
-------------------------
Category: Elasticsearch function - Scored 19 out of 19
-------------------------
Category: ES|QL query generation - Scored 38 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

Claude:

Improves the scores by ~10% compared to the latest evaluation from Gemini Improvements PR (see here)

-------------------------------------------
Model bedrock-claude-3_7 scored 121.25 out of 123
-------------------------------------------
-------------------------------------------
Model bedrock-claude-3_7 Scores per Category
-------------------------
Category: Alerts - Scored 10 out of 10
-------------------------
Category: APM - Scored 15.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 14 out of 14
-------------------------
Category: Elasticsearch function - Scored 19 out of 19
-------------------------
Category: ES|QL query generation - Scored 47.75 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

Running on main

Has less impact, but that is expected since the query execution control is not the main bottleneck to the performance (index/dataset assumptions proved to be more impactful there)

Gemini (no significant changes):

-------------------------------------------
Model gemini-2-flash scored 66 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 9 out of 10
-------------------------
Category: APM - Scored 3 out of 17
-------------------------
Category: Retrieve documentation function - Scored 13 out of 14
-------------------------
Category: Elasticsearch function - Scored 6 out of 197
-------------------------
Category: ES|QL query generation - Scored 20.5 out of 48
-------------------------
Category: Knowledge base - Scored 14.5 out of 15
-------------------------------------------

Claude (no significant changes):

-------------------------------------------
Model bedrock-claude-3_7 scored 109.5 out of 123
-------------------------------------------
-------------------------------------------
Model bedrock-claude-3_7 Scores per Category
-------------------------
Category: Alerts - Scored 9.5 out of 10
-------------------------
Category: APM - Scored 16 out of 17
-------------------------
Category: Retrieve documentation function - Scored 14 out of 14
-------------------------
Category: Elasticsearch function - Scored 16 out of 17
-------------------------
Category: ES|QL query generation - Scored 41 out of 48
-------------------------
Category: Knowledge base - Scored 13 out of 15
-------------------------------------------

Testing

Tested on Gemini Prompt improvements PR and on the current main.
Ran smoke tests described here
Ran evaluation framework as can be seen above
Tested display/visualize query buttons to ensure no regressions have happened in that flow.

Identify risks

Potential risks could occur in the standard users workflows where the user's intent is not well defined. I'm talking about a scenario I've not yet found, but there is a possibility. Likely way to overcome this is to ellaborate in the queryIntent in the system prompt and/or provide few-shot examples.

…l choice dependeing on the user's prompt

github-actions · 2025-07-17T13:17:16Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

dgieselaar · 2025-07-17T13:23:18Z

@SrdjanLL We've had this before actually. It did work better for our specific evaluations, but I don't think we should over-index on those. My concern is here that you force the LLM to make a decision before understanding the context. E.g., once it sees "execute", it might forcefully try to execute a query even though either ES|QL doesn't support it or it cannot find the data. I would prefer that we expand our evaluation examples first with more realistic scenarios, and in general, I'd like to hold off making changes here until we have expanded our evals. We also have #226616 in the waiting room.

…apping'

SrdjanLL · 2025-07-17T13:43:13Z

Thanks for the context @dgieselaar. This change gave a notable bump to the performance, but I didn't want to throw this in an already crowded PR for prompt improvements and was hoping to get some quick feedback like this.

From the workflow point of view I didn't see this change cause any errors (such as invalid/unavailable tool calls), which was a positive. On the decision-making/reasoning constraints through forced tool call, did you have a way to test this or can recall a scenario from the past that prompted the update to this workflow?

I would prefer that we expand our evaluation examples first with more realistic scenarios, and in general, I'd like to hold off making changes here until we have expanded our evals.

I agree with this, overall (and I have some ideas that I'd happily share), but just want to point out that even with the current evaluation scenarios, we see the pattern of hesitant tool calling with Gemini and I was hoping to avoid yelling in the system prompt and tool descriptions - hence this PR 😅 I would like to see whether #226616 helps overcome some of this.

elasticmachine · 2025-08-08T12:44:21Z

💔 Build Failed

Buildkite Build
Commit: 6909d39
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-228456-6909d393d564

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #133 / Serverless Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
[job] [logs] FTR Configs #133 / Serverless Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
[job] [logs] FTR Configs #58 / Stateful Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
[job] [logs] FTR Configs #58 / Stateful Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"

Metrics [docs]

✅ unchanged

History

💔 Build #321577 failed 40079b4
💔 Build #320495 failed baff904

SrdjanLL · 2025-10-27T16:08:32Z

No longer considering this change. Future enhancements on this will likely move us towards Agent Builder.

Add intent parameter to the query function and enforce downstream too…

adf2139

…l choice dependeing on the user's prompt

SrdjanLL requested a review from a team as a code owner July 17, 2025 13:17

SrdjanLL added enhancement New value added to drive a business result ci:project-deploy-observability Create an Observability project backport:version Backport to applied version labels v9.2.0 labels Jul 17, 2025

[CI] Auto-commit changed files from 'node scripts/styled_components_m…

baff904

…apping'

Merge branch 'main' into query-tool-calling

40079b4

viduni94 added the release_note:fix label Jul 21, 2025

Merge branch 'main' into query-tool-calling

6909d39

SrdjanLL closed this Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Obs AI Assistant] Add intent parameter to the query function and control downstream tool calling#228456

[Obs AI Assistant] Add intent parameter to the query function and control downstream tool calling#228456
SrdjanLL wants to merge 4 commits intoelastic:mainfrom
SrdjanLL:query-tool-calling

SrdjanLL commented Jul 17, 2025 •

edited by viduni94

Loading

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

dgieselaar commented Jul 17, 2025

Uh oh!

SrdjanLL commented Jul 17, 2025

Uh oh!

elasticmachine commented Aug 8, 2025 •

edited

Loading

Uh oh!

SrdjanLL commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

SrdjanLL commented Jul 17, 2025 • edited by viduni94 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why?

Evaluation Benchmark

Running on prompt improvements branch:

Running on main

Testing

Identify risks

Uh oh!

github-actions bot commented Jul 17, 2025

🤖 GitHub comments

Uh oh!

dgieselaar commented Jul 17, 2025

Uh oh!

SrdjanLL commented Jul 17, 2025

Uh oh!

elasticmachine commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

SrdjanLL commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SrdjanLL commented Jul 17, 2025 •

edited by viduni94

Loading

elasticmachine commented Aug 8, 2025 •

edited

Loading