Skip to content

[Agent Builder] Fix graph recursion limit and cycle state carry-over#265538

Draft
hop-dev wants to merge 1 commit intoelastic:mainfrom
hop-dev:fix/agent-builder-graph-recursion-limit
Draft

[Agent Builder] Fix graph recursion limit and cycle state carry-over#265538
hop-dev wants to merge 1 commit intoelastic:mainfrom
hop-dev:fix/agent-builder-graph-recursion-limit

Conversation

@hop-dev
Copy link
Copy Markdown
Contributor

@hop-dev hop-dev commented Apr 24, 2026

Summary

We noticed that the Entity Analytics skill in Agent Builder was often hitting a GRAPH_RECURSION_LIMIT error during multi-turn conversations. The repro is straightforward: ask the agent to show the Entity Analytics dashboard, then follow up with something like "help me analyze the 3 critical risks" — the second turn would fail with the LangGraph recursion error.

Digging in, it looks like the Entity Analytics skill surfaces this because it's one of the more tool-heavy flows (search entities → get entity per result → add dashboard attachment), but the underlying issue appears to be in the platform agent builder graph, not the skill itself.

Fix 1: getRecursionLimit formula

The getRecursionLimit function assumed 2 graph steps per research cycle (researchAgent + executeTool), but the graph actually traverses 3 nodes per cycle because checkBackgroundWork sits between executeTool and researchAgent on every loop. This meant the LangGraph recursion limit was set to 38 (15 * 2 + 8) when the graph could need up to ~55 steps to complete 15 cycles plus the answer phase. In practice the graph would hit the hard limit at around 11 tool cycles instead of the intended 15.

Updated to cycleLimit * 3 + 10.

Fix 2: currentCycle carry-over across rounds

currentCycle and errorCount were being restored from the previous round's state unconditionally. This means if round 1 used 6 cycles, round 2 would start at cycle 6 instead of 0 — shrinking the effective tool budget for every subsequent turn in a conversation. This state restoration looks like it's intended for HITL (human-in-the-loop) resume when the round is awaitingPrompt, so the fix gates it on that status.

The `getRecursionLimit` formula assumed 2 graph steps per research cycle
(researchAgent + executeTool), but the actual graph has 3 steps per cycle
(checkBackgroundWork + researchAgent + executeTool). This caused
LangGraph's GRAPH_RECURSION_LIMIT to fire at ~11 tool cycles instead of
the intended 15, particularly affecting tool-heavy skills like entity
analytics.

Additionally, `currentCycle` and `errorCount` were unconditionally
restored from the previous round's state, meaning each new user message
inherited the accumulated cycle count from the prior turn. This is only
correct for HITL (human-in-the-loop) resume when the round status is
`awaitingPrompt`; completed rounds should start fresh at cycle 0.

Made-with: Cursor
@hop-dev hop-dev added release_note:skip Skip the PR/issue when compiling release notes Team:Entity Analytics Security Entity Analytics Team backport:version Backport to applied version labels v9.4.0 labels Apr 24, 2026
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:Entity Analytics Security Entity Analytics Team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant