Skip to content

[Obs AI Assistant] Add intent parameter to the query function and control downstream tool calling#228456

Closed
SrdjanLL wants to merge 4 commits intoelastic:mainfrom
SrdjanLL:query-tool-calling
Closed

[Obs AI Assistant] Add intent parameter to the query function and control downstream tool calling#228456
SrdjanLL wants to merge 4 commits intoelastic:mainfrom
SrdjanLL:query-tool-calling

Conversation

@SrdjanLL
Copy link
Copy Markdown
Contributor

@SrdjanLL SrdjanLL commented Jul 17, 2025

Relates to https://github.com/elastic/obs-ai-assistant-team/issues/276
Closes https://github.com/elastic/obs-ai-assistant-team/issues/324

Summary

Adds explicit queryIntent handling to the query function of the AI Assistant for more controlled downstream tool control:

  • Extends the query function’s JSON schema with a queryIntent parameter (('example' | 'data' | 'visual').
  • Selects or hides ES|QL helper tools based on the intent:
    • 'data' → force execute_query
    • 'visual' → force visualize_query
    • 'example' → expose no execution / visualisation tools
  • Passes a toolChoice hint to naturalLanguageToEsql for deterministic tool calling.
    • Besides the intended goal, it also allows insight into how the Obs AI Assistant interprets users intentions by observing query tool calls:
image

Why?

As part of Gemini Prompt improvements (PR), we found that some models are less/more eager to execute tools and that no mater how (DIRECT) the system prompt, tool execution is not so deterministic. We also need to thread carefully between too eager models (like Claude) and less eager models (like Gemini 2.0 Flash) and figure out a structured balance.

The change outlined above showed as one of the strongest contributors to the improvements of the scores from the evaluation framework.

Evaluation Benchmark

  • I found the evaluation improvements were notable when used in conjunction with Gemini prompt improvements. On the main, there were no improvements but no regressions in the scores either.

*execute_connector evaluation scores are available, but omitted from the summary for comparison

Running on prompt improvements branch:

Gemini:

  • Improves the scores by ~17% compared to the current state of Gemini improvements (see here) and bringing the total improvement to the evaluation scores to ~76% (compared to the scores prior to improvements)
-------------------------------------------
Model gemini-2-flash scored 108 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 10 out of 10
-------------------------
Category: APM - Scored 13.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 12.5 out of 14
-------------------------
Category: Elasticsearch function - Scored 19 out of 19
-------------------------
Category: ES|QL query generation - Scored 38 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

Claude:

  • Improves the scores by ~10% compared to the latest evaluation from Gemini Improvements PR (see here)
-------------------------------------------
Model bedrock-claude-3_7 scored 121.25 out of 123
-------------------------------------------
-------------------------------------------
Model bedrock-claude-3_7 Scores per Category
-------------------------
Category: Alerts - Scored 10 out of 10
-------------------------
Category: APM - Scored 15.5 out of 17
-------------------------
Category: Retrieve documentation function - Scored 14 out of 14
-------------------------
Category: Elasticsearch function - Scored 19 out of 19
-------------------------
Category: ES|QL query generation - Scored 47.75 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------
Running on main
  • Has less impact, but that is expected since the query execution control is not the main bottleneck to the performance (index/dataset assumptions proved to be more impactful there)

Gemini (no significant changes):

-------------------------------------------
Model gemini-2-flash scored 66 out of 123
-------------------------------------------
-------------------------------------------
Model gemini-2-flash Scores per Category
-------------------------
Category: Alerts - Scored 9 out of 10
-------------------------
Category: APM - Scored 3 out of 17
-------------------------
Category: Retrieve documentation function - Scored 13 out of 14
-------------------------
Category: Elasticsearch function - Scored 6 out of 197
-------------------------
Category: ES|QL query generation - Scored 20.5 out of 48
-------------------------
Category: Knowledge base - Scored 14.5 out of 15
-------------------------------------------

Claude (no significant changes):

-------------------------------------------
Model bedrock-claude-3_7 scored 109.5 out of 123
-------------------------------------------
-------------------------------------------
Model bedrock-claude-3_7 Scores per Category
-------------------------
Category: Alerts - Scored 9.5 out of 10
-------------------------
Category: APM - Scored 16 out of 17
-------------------------
Category: Retrieve documentation function - Scored 14 out of 14
-------------------------
Category: Elasticsearch function - Scored 16 out of 17
-------------------------
Category: ES|QL query generation - Scored 41 out of 48
-------------------------
Category: Knowledge base - Scored 13 out of 15
-------------------------------------------

Testing

  • Tested on Gemini Prompt improvements PR and on the current main.
  • Ran smoke tests described here
  • Ran evaluation framework as can be seen above
  • Tested display/visualize query buttons to ensure no regressions have happened in that flow.

Identify risks

  • Potential risks could occur in the standard users workflows where the user's intent is not well defined. I'm talking about a scenario I've not yet found, but there is a possibility. Likely way to overcome this is to ellaborate in the queryIntent in the system prompt and/or provide few-shot examples.

@SrdjanLL SrdjanLL requested a review from a team as a code owner July 17, 2025 13:17
@SrdjanLL SrdjanLL added enhancement New value added to drive a business result ci:project-deploy-observability Create an Observability project backport:version Backport to applied version labels v9.2.0 labels Jul 17, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@dgieselaar
Copy link
Copy Markdown
Contributor

@SrdjanLL We've had this before actually. It did work better for our specific evaluations, but I don't think we should over-index on those. My concern is here that you force the LLM to make a decision before understanding the context. E.g., once it sees "execute", it might forcefully try to execute a query even though either ES|QL doesn't support it or it cannot find the data. I would prefer that we expand our evaluation examples first with more realistic scenarios, and in general, I'd like to hold off making changes here until we have expanded our evals. We also have #226616 in the waiting room.

@SrdjanLL
Copy link
Copy Markdown
Contributor Author

Thanks for the context @dgieselaar. This change gave a notable bump to the performance, but I didn't want to throw this in an already crowded PR for prompt improvements and was hoping to get some quick feedback like this.

From the workflow point of view I didn't see this change cause any errors (such as invalid/unavailable tool calls), which was a positive. On the decision-making/reasoning constraints through forced tool call, did you have a way to test this or can recall a scenario from the past that prompted the update to this workflow?

I would prefer that we expand our evaluation examples first with more realistic scenarios, and in general, I'd like to hold off making changes here until we have expanded our evals.

I agree with this, overall (and I have some ideas that I'd happily share), but just want to point out that even with the current evaluation scenarios, we see the pattern of hesitant tool calling with Gemini and I was hoping to avoid yelling in the system prompt and tool descriptions - hence this PR 😅 I would like to see whether #226616 helps overcome some of this.

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Aug 8, 2025

💔 Build Failed

  • Buildkite Build
  • Commit: 6909d39
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-228456-6909d393d564

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #133 / Serverless Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
  • [job] [logs] FTR Configs #133 / Serverless Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
  • [job] [logs] FTR Configs #58 / Stateful Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"
  • [job] [logs] FTR Configs #58 / Stateful Observability - Deployment-agnostic AI Assistant API integration tests observability AI Assistant tool: execute_query POST /internal/observability_ai_assistant/chat/complete "before all" hook for "makes 4 requests to the LLM"

Metrics [docs]

✅ unchanged

History

@SrdjanLL
Copy link
Copy Markdown
Contributor Author

No longer considering this change. Future enhancements on this will likely move us towards Agent Builder.

@SrdjanLL SrdjanLL closed this Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels ci:project-deploy-observability Create an Observability project enhancement New value added to drive a business result release_note:fix v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants