[Automatic Migrations] Tune Dashboards Migration Graph task concurrency to avoid Rate limit errors.#236535
Conversation
|
Project deployed, see credentials at: https://buildkite.com/elastic/kibana-deploy-project-from-pr/builds/623 |
|
Pinging @elastic/security-threat-hunting (Team:Threat Hunting) |
|
Tested this PR locally and here are the results: Rate limit 🟢It looks like this PR addressed the "429 rate limit" error. In main, I saw many error messages as shown below which are gone in the PR branch Performance 🔴Performance-wise the kibana is really slow (at least locally). Navigating to different pages while dashboards translation is going on is really painful and takes ages to load any page. As discussed in the sync, I will test dashboards translations in the deployed environment where it can be not that critical due to multiple (scalable nature of) task managers. |
e40pud
left a comment
There was a problem hiding this comment.
It looks like we do not have that bad performance issue in the dployed environment where we utilize scalable task managers. We still should keep an eye on this one and monitor the performance.
...rity/plugins/security_solution/server/lib/siem_migrations/dashboards/task/agent/constants.ts
Show resolved
Hide resolved
|
Agreed on everything @e40pud said, and we can probably merge this close to as is as it fixes the rate limiting issue. The performance issues we need to continue to look at though I think, feels like something is going on where the main thread is blocked more often than it should be. |
@kqualters-elastic, Great. I have made it |
kqualters-elastic
left a comment
There was a problem hiding this comment.
looked at it closely locally, definitely fixes the backoff issue, I think there are some other slightly off things going on in this code path, but can address those later. LGTM 👍
|
@elasticsearch merge upstream |
|
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
|
…cy to avoid Rate limit errors. (elastic#236535) ## Problem Resolves elastic/security-team#14004 > [!TIP] > Enable below experimental feature before using this feature. > ``` > xpack.securitySolution.enableExperimental: > - automaticDashboardsMigration > ``` This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations. Before this PR, a dashboard migration run leads to lot of panels erroring out with `Rate Limits error` as can be seen in below screenshot from `main`. you can see that almost all panels are failing. <details> <summary>Rate Limit Error Screenshot </summary> <img width="2271" height="1202" alt="Image" src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600" /> </details> This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway. ## Explanation It was because of too many Translation graphs running and each Translation graph triggering `All` Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit. That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit calls. > [!TIP] > The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues. ## TLDR; After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications. Feel free to skip the section and start the testing instead. <details> <summary><h2> Solution and Experiments </h2></summary> All changes are applicable to `Dashboard` migrations only. I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see. - All tests were done on `Elastic LLM`. - 3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit. - All tests were done on same dashboard data set as explained below - 7 dashboards - 4 fully/partially translatable - 2 have errors - No panels found - 1 (`Content Overview`) errors with some unknown error in `Index pattern` node. Not sure of the error. Let's ignore this for this node. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time Taken |Langsmith Trace | |---|---|---|---|---|---| |5|10|89%|~95K|~3m|https://ela.st/5x10| |5|4|71%|~148K|3m 20s|https://ela.st/5x4| |3|4|50%|120K|~3m|https://ela.st/3x4-concurrency| |1|10|40%|~292K|~6m|https://ela.st/1x10| > [!TIP] > In 3x4 and 1x10, I did not see any instance of Rate limit error. > These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also. See screenshots for all of runs mentioned above <details> <summary>5x10</summary> <img width="2279" height="769" alt="image" src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768" /> </details> <details> <summary>5x4</summary> <img width="2281" height="727" alt="image" src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c" /> </details> <details> <summary>3x4</summary> <img width="2288" height="804" alt="image" src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5" /> </details> <details> <summary>1x10</summary> <img width="2305" height="759" alt="image" src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9" /> </details> ## Final run on the deployed Project. With the select 3x4 configuration, i did a final run and results were much better. However there were still `Rate limit` errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the `RetryPolicy`. There were total 40 Dashboards and most of them had valid data. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time taken|Langsmith Trace | |---|---|---|---|---|---| |3|4|37%|380K|~21m|https://ela.st/3x4-project| <img width="1107" height="606" alt="image" src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4" /> Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214 Credentials can be found [here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h) </details> ## Testing Guidelines. ### Things to test 1. First run given dashboards migration on `main` and note the following - Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below - <img width="1216" height="269" alt="image" src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa" /> - Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below. - Time server requests are taking. - Time taken during hard refresh - Navigation Lags. 2. Next, repeat the same on this PR branch and results should be much better. ### Data - [7 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking rate limit error) - [40 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking kibana perf) Both Macros and lookups are also available in the same folder.
…cy to avoid Rate limit errors. (elastic#236535) ## Problem Resolves elastic/security-team#14004 > [!TIP] > Enable below experimental feature before using this feature. > ``` > xpack.securitySolution.enableExperimental: > - automaticDashboardsMigration > ``` This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations. Before this PR, a dashboard migration run leads to lot of panels erroring out with `Rate Limits error` as can be seen in below screenshot from `main`. you can see that almost all panels are failing. <details> <summary>Rate Limit Error Screenshot </summary> <img width="2271" height="1202" alt="Image" src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600" /> </details> This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway. ## Explanation It was because of too many Translation graphs running and each Translation graph triggering `All` Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit. That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit calls. > [!TIP] > The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues. ## TLDR; After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications. Feel free to skip the section and start the testing instead. <details> <summary><h2> Solution and Experiments </h2></summary> All changes are applicable to `Dashboard` migrations only. I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see. - All tests were done on `Elastic LLM`. - 3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit. - All tests were done on same dashboard data set as explained below - 7 dashboards - 4 fully/partially translatable - 2 have errors - No panels found - 1 (`Content Overview`) errors with some unknown error in `Index pattern` node. Not sure of the error. Let's ignore this for this node. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time Taken |Langsmith Trace | |---|---|---|---|---|---| |5|10|89%|~95K|~3m|https://ela.st/5x10| |5|4|71%|~148K|3m 20s|https://ela.st/5x4| |3|4|50%|120K|~3m|https://ela.st/3x4-concurrency| |1|10|40%|~292K|~6m|https://ela.st/1x10| > [!TIP] > In 3x4 and 1x10, I did not see any instance of Rate limit error. > These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also. See screenshots for all of runs mentioned above <details> <summary>5x10</summary> <img width="2279" height="769" alt="image" src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768" /> </details> <details> <summary>5x4</summary> <img width="2281" height="727" alt="image" src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c" /> </details> <details> <summary>3x4</summary> <img width="2288" height="804" alt="image" src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5" /> </details> <details> <summary>1x10</summary> <img width="2305" height="759" alt="image" src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9" /> </details> ## Final run on the deployed Project. With the select 3x4 configuration, i did a final run and results were much better. However there were still `Rate limit` errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the `RetryPolicy`. There were total 40 Dashboards and most of them had valid data. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time taken|Langsmith Trace | |---|---|---|---|---|---| |3|4|37%|380K|~21m|https://ela.st/3x4-project| <img width="1107" height="606" alt="image" src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4" /> Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214 Credentials can be found [here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h) </details> ## Testing Guidelines. ### Things to test 1. First run given dashboards migration on `main` and note the following - Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below - <img width="1216" height="269" alt="image" src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa" /> - Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below. - Time server requests are taking. - Time taken during hard refresh - Navigation Lags. 2. Next, repeat the same on this PR branch and results should be much better. ### Data - [7 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking rate limit error) - [40 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking kibana perf) Both Macros and lookups are also available in the same folder.
Problem
Resolves https://github.com/elastic/security-team/issues/14004
Tip
Enable below experimental feature before using this feature.
This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations.
Before this PR, a dashboard migration run leads to lot of panels erroring out with
Rate Limits erroras can be seen in below screenshot frommain. you can see that almost all panels are failing.Rate Limit Error Screenshot
This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway.
Explanation
It was because of too many Translation graphs running and each Translation graph triggering
AllPanel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit.That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit
calls.
Tip
The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues.
TLDR;
After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications.
Feel free to skip the section and start the testing instead.
Solution and Experiments
All changes are applicable to
Dashboardmigrations only.I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see.
Elastic LLM.Content Overview) errors with some unknown error inIndex patternnode. Not sure of the error. Let's ignore this for this node.See screenshots for all of runs mentioned above
5x10
5x4
3x4
1x10
Final run on the deployed Project.
With the select 3x4 configuration, i did a final run and results were much better. However there were still
Rate limiterrors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to theRetryPolicy.There were total 40 Dashboards and most of them had valid data.
Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214
Credentials can be found here
Testing Guidelines.
Things to test
First run given dashboards migration on
mainand note the followingRate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below
Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below.
Time server requests are taking.
Time taken during hard refresh
Navigation Lags.
Next, repeat the same on this PR branch and results should be much better.
Data
Both Macros and lookups are also available in the same folder.