[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors. by logeekal · Pull Request #236535 · elastic/kibana

logeekal · 2025-09-26T04:39:08Z

Problem

Resolves https://github.com/elastic/security-team/issues/14004

Tip

Enable below experimental feature before using this feature.

xpack.securitySolution.enableExperimental:
  - automaticDashboardsMigration

This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations.

Before this PR, a dashboard migration run leads to lot of panels erroring out with Rate Limits error as can be seen in below screenshot from main. you can see that almost all panels are failing.

Rate Limit Error Screenshot

This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway.

Explanation

It was because of too many Translation graphs running and each Translation graph triggering All Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit.

That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit
calls.

Tip

The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues.

TLDR;

After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications.

Feel free to skip the section and start the testing instead.

Solution and Experiments

All changes are applicable to Dashboard migrations only.

I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see.

All tests were done on Elastic LLM.
3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit.
All tests were done on same dashboard data set as explained below
- 7 dashboards
- 4 fully/partially translatable
- 2 have errors - No panels found
- 1 (Content Overview) errors with some unknown error in Index pattern node. Not sure of the error. Let's ignore this for this node.

Dashboard Concurrency	Panels Concurrency	Error %	Token Usage	Time Taken	Langsmith Trace
5	10	89%	~95K	~3m	https://ela.st/5x10
5	4	71%	~148K	3m 20s	https://ela.st/5x4
3	4	50%	120K	~3m	https://ela.st/3x4-concurrency
1	10	40%	~292K	~6m	https://ela.st/1x10

[!TIP]
In 3x4 and 1x10, I did not see any instance of Rate limit error.
These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also.

See screenshots for all of runs mentioned above

5x10

5x4

3x4

1x10

Final run on the deployed Project.

With the select 3x4 configuration, i did a final run and results were much better. However there were still Rate limit errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the RetryPolicy.

There were total 40 Dashboards and most of them had valid data.

Dashboard Concurrency	Panels Concurrency	Error %	Token Usage	Time taken	Langsmith Trace
3	4	37%	380K	~21m	https://ela.st/3x4-project

Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214

Credentials can be found here

Testing Guidelines.

Things to test

First run given dashboards migration on main and note the following
- Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below
- Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below.
- Time server requests are taking.
- Time taken during hard refresh
- Navigation Lags.
Next, repeat the same on this PR branch and results should be much better.

Data

7 Dashboards ( great for checking rate limit error)
40 Dashboards ( great for checking kibana perf)

Both Macros and lookups are also available in the same folder.

kibanamachine · 2025-09-26T07:28:13Z

Project deployed, see credentials at: https://buildkite.com/elastic/kibana-deploy-project-from-pr/builds/623

elasticmachine · 2025-09-26T14:23:33Z

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

e40pud · 2025-09-30T12:58:44Z

Tested this PR locally and here are the results:

Rate limit 🟢

It looks like this PR addressed the "429 rate limit" error. In main, I saw many error messages as shown below which are gone in the PR branch

[2025-09-30T14:26:44.601+02:00][WARN ][plugins.actions.gen-ai] action execution failure: .gen-ai:gpt-4-1: GPT-4.1 (Azure OpenAI): an error occurred while running the action: Status code: 429. Message: API Error: Too Many Requests - Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2025-01-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. For Free Account customers, upgrade to Pay as you Go here: https://aka.ms/429TrialUpgrade.; retry: true

Performance 🔴

Performance-wise the kibana is really slow (at least locally). Navigating to different pages while dashboards translation is going on is really painful and takes ages to load any page. As discussed in the sync, I will test dashboards translations in the deployed environment where it can be not that critical due to multiple (scalable nature of) task managers.

e40pud

It looks like we do not have that bad performance issue in the dployed environment where we utilize scalable task managers. We still should keep an eye on this one and monitor the performance.

...rity/plugins/security_solution/server/lib/siem_migrations/dashboards/task/agent/constants.ts

kqualters-elastic · 2025-09-30T16:19:40Z

Agreed on everything @e40pud said, and we can probably merge this close to as is as it fixes the rate limiting issue. The performance issues we need to continue to look at though I think, feels like something is going on where the main thread is blocked more often than it should be.

logeekal · 2025-09-30T16:32:04Z

Agreed on everything @e40pud said, and we can probably merge this close to as is as it fixes the rate limiting issue. The task manager issue we need to continue to look at though I think, feels like something is going on where the main thread is blocked more often than it should be.

@kqualters-elastic, Great. I have made it Ready for review , i just saw that this also needs one Threat hunting approval. I have put it on Auto merge as well.

kqualters-elastic

looked at it closely locally, definitely fixes the backoff issue, I think there are some other slightly off things going on in this code path, but can address those later. LGTM 👍

logeekal · 2025-10-01T03:03:06Z

@elasticsearch merge upstream

logeekal · 2025-10-01T04:28:53Z

@elasticmachine merge upstream

elasticmachine · 2025-10-01T05:57:00Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 8a129ef
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-236535-8a129efb6d64
Security Deployment

Failed CI Steps

Jest Tests #10

Test Failures

[job] [logs] Jest Tests #10 / CasesWebhookActionConnectorFields renders Step Validation Step 2 is properly validated

Metrics [docs]

✅ unchanged

History

…cy to avoid Rate limit errors. (elastic#236535) ## Problem Resolves elastic/security-team#14004 > [!TIP] > Enable below experimental feature before using this feature. > ``` > xpack.securitySolution.enableExperimental: > - automaticDashboardsMigration > ``` This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations. Before this PR, a dashboard migration run leads to lot of panels erroring out with `Rate Limits error` as can be seen in below screenshot from `main`. you can see that almost all panels are failing. <details> <summary>Rate Limit Error Screenshot </summary> <img width="2271" height="1202" alt="Image" src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600" /> </details> This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway. ## Explanation It was because of too many Translation graphs running and each Translation graph triggering `All` Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit. That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit calls. > [!TIP] > The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues. ## TLDR; After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications. Feel free to skip the section and start the testing instead. <details> <summary><h2> Solution and Experiments </h2></summary> All changes are applicable to `Dashboard` migrations only. I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see. - All tests were done on `Elastic LLM`. - 3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit. - All tests were done on same dashboard data set as explained below - 7 dashboards - 4 fully/partially translatable - 2 have errors - No panels found - 1 (`Content Overview`) errors with some unknown error in `Index pattern` node. Not sure of the error. Let's ignore this for this node. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time Taken |Langsmith Trace | |---|---|---|---|---|---| |5|10|89%|~95K|~3m|https://ela.st/5x10| |5|4|71%|~148K|3m 20s|https://ela.st/5x4| |3|4|50%|120K|~3m|https://ela.st/3x4-concurrency| |1|10|40%|~292K|~6m|https://ela.st/1x10| > [!TIP] > In 3x4 and 1x10, I did not see any instance of Rate limit error. > These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also. See screenshots for all of runs mentioned above <details> <summary>5x10</summary> <img width="2279" height="769" alt="image" src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768" /> </details> <details> <summary>5x4</summary> <img width="2281" height="727" alt="image" src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c" /> </details> <details> <summary>3x4</summary> <img width="2288" height="804" alt="image" src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5" /> </details> <details> <summary>1x10</summary> <img width="2305" height="759" alt="image" src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9" /> </details> ## Final run on the deployed Project. With the select 3x4 configuration, i did a final run and results were much better. However there were still `Rate limit` errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the `RetryPolicy`. There were total 40 Dashboards and most of them had valid data. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time taken|Langsmith Trace | |---|---|---|---|---|---| |3|4|37%|380K|~21m|https://ela.st/3x4-project| <img width="1107" height="606" alt="image" src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4" /> Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214 Credentials can be found [here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h) </details> ## Testing Guidelines. ### Things to test 1. First run given dashboards migration on `main` and note the following - Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below - <img width="1216" height="269" alt="image" src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa" /> - Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below. - Time server requests are taking. - Time taken during hard refresh - Navigation Lags. 2. Next, repeat the same on this PR branch and results should be much better. ### Data - [7 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking rate limit error) - [40 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking kibana perf) Both Macros and lookups are also available in the same folder.

logeekal added 2 commits September 26, 2025 06:37

fix: task concurrency + retry policy

e4bf23a

remove unncessary code changes

7baf4a4

logeekal added 2 commits September 26, 2025 08:08

fix: types

7e4809e

enable dashboards for PR deployment

45aceff

logeekal added the v9.2.0 label Sep 26, 2025

logeekal marked this pull request as ready for review September 26, 2025 14:23

logeekal requested review from a team as code owners September 26, 2025 14:23

logeekal changed the title ~~[Automatic Migrations] Tune Dashboards Migration Graph task concurrency~~ [Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors. Sep 26, 2025

logeekal removed the ci:project-redeploy Always create a new Cloud project label Sep 26, 2025

Merge branch 'main' into fix/rate_limit_error

0da7a72

logeekal marked this pull request as draft September 26, 2025 14:32

logeekal added 2 commits September 28, 2025 22:22

fix: feature flag and retry policy

e70b92a

fix: types

0e360e8

e40pud approved these changes Sep 30, 2025

View reviewed changes

...rity/plugins/security_solution/server/lib/siem_migrations/dashboards/task/agent/constants.ts Show resolved Hide resolved

logeekal marked this pull request as ready for review September 30, 2025 16:25

logeekal enabled auto-merge (squash) September 30, 2025 16:26

Merge branch 'main' into fix/rate_limit_error

8cfb950

kqualters-elastic approved these changes Sep 30, 2025

View reviewed changes

logeekal added 2 commits September 30, 2025 22:44

fix: tests

ed2ef9c

Merge main --> current branch

13f9e87

Merge branch 'main' into fix/rate_limit_error

8a129ef

logeekal merged commit 82cd586 into elastic:main Oct 1, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors.#236535

[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors.#236535
logeekal merged 11 commits intoelastic:mainfrom
logeekal:fix/rate_limit_error

logeekal commented Sep 26, 2025 •

edited

Loading

Uh oh!

kibanamachine commented Sep 26, 2025

Uh oh!

elasticmachine commented Sep 26, 2025

Uh oh!

e40pud commented Sep 30, 2025

Uh oh!

e40pud left a comment

Uh oh!

Uh oh!

kqualters-elastic commented Sep 30, 2025 •

edited

Loading

Uh oh!

logeekal commented Sep 30, 2025

Uh oh!

kqualters-elastic left a comment

Uh oh!

logeekal commented Oct 1, 2025

Uh oh!

logeekal commented Oct 1, 2025

Uh oh!

elasticmachine commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

logeekal commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Explanation

TLDR;

Solution and Experiments

Final run on the deployed Project.

Testing Guidelines.

Things to test

Data

Uh oh!

kibanamachine commented Sep 26, 2025

Uh oh!

elasticmachine commented Sep 26, 2025

Uh oh!

e40pud commented Sep 30, 2025

Rate limit 🟢

Performance 🔴

Uh oh!

e40pud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kqualters-elastic commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logeekal commented Sep 30, 2025

Uh oh!

kqualters-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

logeekal commented Oct 1, 2025

Uh oh!

logeekal commented Oct 1, 2025

Uh oh!

elasticmachine commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

logeekal commented Sep 26, 2025 •

edited

Loading

kqualters-elastic commented Sep 30, 2025 •

edited

Loading

elasticmachine commented Oct 1, 2025 •

edited

Loading