Skip to content

[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors.#236535

Merged
logeekal merged 11 commits intoelastic:mainfrom
logeekal:fix/rate_limit_error
Oct 1, 2025
Merged

[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors.#236535
logeekal merged 11 commits intoelastic:mainfrom
logeekal:fix/rate_limit_error

Conversation

@logeekal
Copy link
Copy Markdown
Contributor

@logeekal logeekal commented Sep 26, 2025

Problem

Resolves https://github.com/elastic/security-team/issues/14004

Tip

Enable below experimental feature before using this feature.

xpack.securitySolution.enableExperimental:
  - automaticDashboardsMigration

This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations.

Before this PR, a dashboard migration run leads to lot of panels erroring out with Rate Limits error as can be seen in below screenshot from main. you can see that almost all panels are failing.

Rate Limit Error Screenshot Image

This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway.

Explanation

It was because of too many Translation graphs running and each Translation graph triggering All Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit.

That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit
calls.

Tip

The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues.

TLDR;

After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications.

Feel free to skip the section and start the testing instead.

Solution and Experiments

All changes are applicable to Dashboard migrations only.

I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see.

  • All tests were done on Elastic LLM.
  • 3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit.
  • All tests were done on same dashboard data set as explained below
    • 7 dashboards
    • 4 fully/partially translatable
    • 2 have errors - No panels found
    • 1 (Content Overview) errors with some unknown error in Index pattern node. Not sure of the error. Let's ignore this for this node.
Dashboard Concurrency Panels Concurrency Error % Token Usage Time Taken Langsmith Trace
5 10 89% ~95K ~3m https://ela.st/5x10
5 4 71% ~148K 3m 20s https://ela.st/5x4
3 4 50% 120K ~3m https://ela.st/3x4-concurrency
1 10 40% ~292K ~6m https://ela.st/1x10

[!TIP]
In 3x4 and 1x10, I did not see any instance of Rate limit error.
These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also.

See screenshots for all of runs mentioned above

5x10 image
5x4 image
3x4 image
1x10 image

Final run on the deployed Project.

With the select 3x4 configuration, i did a final run and results were much better. However there were still Rate limit errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the RetryPolicy.

There were total 40 Dashboards and most of them had valid data.

Dashboard Concurrency Panels Concurrency Error % Token Usage Time taken Langsmith Trace
3 4 37% 380K ~21m https://ela.st/3x4-project
image

Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214

Credentials can be found here

Testing Guidelines.

Things to test

  1. First run given dashboards migration on main and note the following

    • Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below

    • image
    • Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below.

    • Time server requests are taking.

    • Time taken during hard refresh

    • Navigation Lags.

  2. Next, repeat the same on this PR branch and results should be much better.

Data

Both Macros and lookups are also available in the same folder.

@logeekal logeekal added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:Threat Hunting Security Solution Threat Hunting Team ci:project-deploy-security Create a Security Serverless Project ci:project-persist-deployment Persist project deployment indefinitely ci:project-redeploy Always create a new Cloud project labels Sep 26, 2025
@kibanamachine
Copy link
Copy Markdown
Contributor

Project deployed, see credentials at: https://buildkite.com/elastic/kibana-deploy-project-from-pr/builds/623

@logeekal logeekal marked this pull request as ready for review September 26, 2025 14:23
@logeekal logeekal requested review from a team as code owners September 26, 2025 14:23
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

@logeekal logeekal changed the title [Automatic Migrations] Tune Dashboards Migration Graph task concurrency [Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors. Sep 26, 2025
@logeekal logeekal removed the ci:project-redeploy Always create a new Cloud project label Sep 26, 2025
@logeekal logeekal marked this pull request as draft September 26, 2025 14:32
@e40pud
Copy link
Copy Markdown
Contributor

e40pud commented Sep 30, 2025

Tested this PR locally and here are the results:

Rate limit 🟢

It looks like this PR addressed the "429 rate limit" error. In main, I saw many error messages as shown below which are gone in the PR branch

[2025-09-30T14:26:44.601+02:00][WARN ][plugins.actions.gen-ai] action execution failure: .gen-ai:gpt-4-1: GPT-4.1 (Azure OpenAI): an error occurred while running the action: Status code: 429. Message: API Error: Too Many Requests - Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2025-01-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. For Free Account customers, upgrade to Pay as you Go here: https://aka.ms/429TrialUpgrade.; retry: true

Performance 🔴

Performance-wise the kibana is really slow (at least locally). Navigating to different pages while dashboards translation is going on is really painful and takes ages to load any page. As discussed in the sync, I will test dashboards translations in the deployed environment where it can be not that critical due to multiple (scalable nature of) task managers.

Copy link
Copy Markdown
Contributor

@e40pud e40pud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we do not have that bad performance issue in the dployed environment where we utilize scalable task managers. We still should keep an eye on this one and monitor the performance.

@kqualters-elastic
Copy link
Copy Markdown
Contributor

kqualters-elastic commented Sep 30, 2025

Agreed on everything @e40pud said, and we can probably merge this close to as is as it fixes the rate limiting issue. The performance issues we need to continue to look at though I think, feels like something is going on where the main thread is blocked more often than it should be.

@logeekal logeekal marked this pull request as ready for review September 30, 2025 16:25
@logeekal logeekal enabled auto-merge (squash) September 30, 2025 16:26
@logeekal
Copy link
Copy Markdown
Contributor Author

Agreed on everything @e40pud said, and we can probably merge this close to as is as it fixes the rate limiting issue. The task manager issue we need to continue to look at though I think, feels like something is going on where the main thread is blocked more often than it should be.

@kqualters-elastic, Great. I have made it Ready for review , i just saw that this also needs one Threat hunting approval. I have put it on Auto merge as well.

Copy link
Copy Markdown
Contributor

@kqualters-elastic kqualters-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked at it closely locally, definitely fixes the backoff issue, I think there are some other slightly off things going on in this code path, but can address those later. LGTM 👍

@logeekal
Copy link
Copy Markdown
Contributor Author

logeekal commented Oct 1, 2025

@elasticsearch merge upstream

@logeekal
Copy link
Copy Markdown
Contributor Author

logeekal commented Oct 1, 2025

@elasticmachine merge upstream

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Oct 1, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #10 / CasesWebhookActionConnectorFields renders Step Validation Step 2 is properly validated

Metrics [docs]

✅ unchanged

History

@logeekal logeekal merged commit 82cd586 into elastic:main Oct 1, 2025
12 checks passed
fkanout pushed a commit to fkanout/kibana that referenced this pull request Oct 1, 2025
…cy to avoid Rate limit errors. (elastic#236535)

## Problem

Resolves elastic/security-team#14004

> [!TIP]
> Enable below experimental feature before using this feature.
> ```
> xpack.securitySolution.enableExperimental:
>   - automaticDashboardsMigration
> ```


This PR improves the CPU performance, Token Performance and Error rate
for Dashboard Migrations.

Before this PR, a dashboard migration run leads to lot of panels
erroring out with `Rate Limits error` as can be seen in below screenshot
from `main`. you can see that almost all panels are failing.

<details>
<summary>Rate Limit Error Screenshot </summary>

<img width="2271" height="1202" alt="Image"
src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600"
/>

</details>


This would also result in Kibana Task manager being choked up as you
might observe when desk testing. This is hard to reproduce because issue
occurs intermittently, depending on which part of graph you are in.
Specially after 2-3 minutes when graph is well underway.

## Explanation

It was because of too many Translation graphs running and each
Translation graph triggering `All` Panel graphs in parallel. Current
limit on Dashboard graph is 10 and on panels there is not limit.

That means, if there are 10 Dashboard graphs ( out current concurrency
limit ) running each with 5 panels, there will be 50 tasks running at
the same time and making calls to LLM. This results in system choking up
and LLM more rate limit
calls.

> [!TIP]
> The objective of this PR was to tune the concurrency of Dashboard
migration so as to reduce the instances of above mentioned issues.

## TLDR;

After testing multiple concurrency configurations, I think we can go
with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below
section details of those experiments and corresponding justifications.

Feel free to skip the section and start the testing instead.

<details>

<summary><h2> Solution and Experiments </h2></summary>

All changes are applicable to `Dashboard` migrations only. 

I reached to a setting of 3x4 concurrency. What this means is 3
Dashboards ( at max) each with 4 panels ( at max ) running concurrently.
I arrived at this configuration after a series of experiments as you can
see.

- All tests were done on `Elastic LLM`.
- 3 Retries has been switched on for all LLM nodes. So if Rate limit
error occurs, we can retry it. If those retries fail for a panel, the
process is aborted. So in below traces Aborted error are because of
multi retries after rate limit was hit.
- All tests were done on same dashboard data set as explained below
   - 7 dashboards 
   - 4 fully/partially translatable
   - 2 have errors - No panels found
- 1 (`Content Overview`) errors with some unknown error in `Index
pattern` node. Not sure of the error. Let's ignore this for this node.


| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time Taken |Langsmith Trace |
|---|---|---|---|---|---|
|5|10|89%|~95K|~3m|https://ela.st/5x10|
|5|4|71%|~148K|3m 20s|https://ela.st/5x4|
|3|4|50%|120K|~3m|https://ela.st/3x4-concurrency|
|1|10|40%|~292K|~6m|https://ela.st/1x10|

> [!TIP]
> In 3x4 and 1x10, I did not see any instance of Rate limit error. 
> These error rates are only for comparison, they will perform better
with more dashboards which are bound to be successful. Out test dataset
had some dashboard which were failures also.

See screenshots for all of runs mentioned above

<details>
<summary>5x10</summary>
<img width="2279" height="769" alt="image"
src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768"
/>

</details>



<details>
<summary>5x4</summary>
<img width="2281" height="727" alt="image"
src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c"
/>

</details>




<details>
<summary>3x4</summary>
<img width="2288" height="804" alt="image"
src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5"
/>

</details>

<details>
<summary>1x10</summary>
<img width="2305" height="759" alt="image"
src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9"
/>

</details>


## Final run on the deployed Project.

With the select 3x4 configuration, i did a final run and results were
much better. However there were still `Rate limit` errors here and
there. I think we can merge this in and see the performance overtime and
do some more tweaks to the `RetryPolicy`.

There were total 40 Dashboards and most of them had valid data.

| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time taken|Langsmith Trace |
|---|---|---|---|---|---|
|3|4|37%|380K|~21m|https://ela.st/3x4-project|



<img width="1107" height="606" alt="image"
src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4"
/>

Results are available here :
https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214

Credentials can be found
[here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h)

</details>

## Testing Guidelines.

### Things to test

1. First run given dashboards migration on `main` and note the following
- Rate limit Errors in each panel ( can be observed from Comments
section ). Easier to do with small data set as given below
- <img width="1216" height="269" alt="image"
src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa"
/>


- Performance of Kibana as run when Migration is running. Pay specific
attention to following. This is easier to with big dataset. Also given
below.
     - Time server requests are taking.
     - Time taken during hard refresh
     - Navigation Lags.

2. Next, repeat the same on this PR branch and results should be much
better.

### Data

- [7
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking rate limit error)
- [40
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking kibana perf)

Both Macros and lookups are also available in the same folder.
rylnd pushed a commit to rylnd/kibana that referenced this pull request Oct 17, 2025
…cy to avoid Rate limit errors. (elastic#236535)

## Problem

Resolves elastic/security-team#14004

> [!TIP]
> Enable below experimental feature before using this feature.
> ```
> xpack.securitySolution.enableExperimental:
>   - automaticDashboardsMigration
> ```


This PR improves the CPU performance, Token Performance and Error rate
for Dashboard Migrations.

Before this PR, a dashboard migration run leads to lot of panels
erroring out with `Rate Limits error` as can be seen in below screenshot
from `main`. you can see that almost all panels are failing.

<details>
<summary>Rate Limit Error Screenshot </summary>

<img width="2271" height="1202" alt="Image"
src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600"
/>

</details>


This would also result in Kibana Task manager being choked up as you
might observe when desk testing. This is hard to reproduce because issue
occurs intermittently, depending on which part of graph you are in.
Specially after 2-3 minutes when graph is well underway.

## Explanation

It was because of too many Translation graphs running and each
Translation graph triggering `All` Panel graphs in parallel. Current
limit on Dashboard graph is 10 and on panels there is not limit.

That means, if there are 10 Dashboard graphs ( out current concurrency
limit ) running each with 5 panels, there will be 50 tasks running at
the same time and making calls to LLM. This results in system choking up
and LLM more rate limit
calls.

> [!TIP]
> The objective of this PR was to tune the concurrency of Dashboard
migration so as to reduce the instances of above mentioned issues.

## TLDR;

After testing multiple concurrency configurations, I think we can go
with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below
section details of those experiments and corresponding justifications.

Feel free to skip the section and start the testing instead.

<details>

<summary><h2> Solution and Experiments </h2></summary>

All changes are applicable to `Dashboard` migrations only. 

I reached to a setting of 3x4 concurrency. What this means is 3
Dashboards ( at max) each with 4 panels ( at max ) running concurrently.
I arrived at this configuration after a series of experiments as you can
see.

- All tests were done on `Elastic LLM`.
- 3 Retries has been switched on for all LLM nodes. So if Rate limit
error occurs, we can retry it. If those retries fail for a panel, the
process is aborted. So in below traces Aborted error are because of
multi retries after rate limit was hit.
- All tests were done on same dashboard data set as explained below
   - 7 dashboards 
   - 4 fully/partially translatable
   - 2 have errors - No panels found
- 1 (`Content Overview`) errors with some unknown error in `Index
pattern` node. Not sure of the error. Let's ignore this for this node.


| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time Taken |Langsmith Trace |
|---|---|---|---|---|---|
|5|10|89%|~95K|~3m|https://ela.st/5x10|
|5|4|71%|~148K|3m 20s|https://ela.st/5x4|
|3|4|50%|120K|~3m|https://ela.st/3x4-concurrency|
|1|10|40%|~292K|~6m|https://ela.st/1x10|

> [!TIP]
> In 3x4 and 1x10, I did not see any instance of Rate limit error. 
> These error rates are only for comparison, they will perform better
with more dashboards which are bound to be successful. Out test dataset
had some dashboard which were failures also.

See screenshots for all of runs mentioned above

<details>
<summary>5x10</summary>
<img width="2279" height="769" alt="image"
src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768"
/>

</details>



<details>
<summary>5x4</summary>
<img width="2281" height="727" alt="image"
src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c"
/>

</details>




<details>
<summary>3x4</summary>
<img width="2288" height="804" alt="image"
src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5"
/>

</details>

<details>
<summary>1x10</summary>
<img width="2305" height="759" alt="image"
src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9"
/>

</details>


## Final run on the deployed Project.

With the select 3x4 configuration, i did a final run and results were
much better. However there were still `Rate limit` errors here and
there. I think we can merge this in and see the performance overtime and
do some more tweaks to the `RetryPolicy`.

There were total 40 Dashboards and most of them had valid data.

| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time taken|Langsmith Trace |
|---|---|---|---|---|---|
|3|4|37%|380K|~21m|https://ela.st/3x4-project|



<img width="1107" height="606" alt="image"
src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4"
/>

Results are available here :
https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214

Credentials can be found
[here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h)

</details>

## Testing Guidelines.

### Things to test

1. First run given dashboards migration on `main` and note the following
- Rate limit Errors in each panel ( can be observed from Comments
section ). Easier to do with small data set as given below
- <img width="1216" height="269" alt="image"
src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa"
/>


- Performance of Kibana as run when Migration is running. Pay specific
attention to following. This is easier to with big dataset. Also given
below.
     - Time server requests are taking.
     - Time taken during hard refresh
     - Navigation Lags.

2. Next, repeat the same on this PR branch and results should be much
better.

### Data

- [7
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking rate limit error)
- [40
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking kibana perf)

Both Macros and lookups are also available in the same folder.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:project-deploy-security Create a Security Serverless Project ci:project-persist-deployment Persist project deployment indefinitely release_note:skip Skip the PR/issue when compiling release notes Team:Threat Hunting Security Solution Threat Hunting Team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants