Revert "[Incident Management] Add page attachment modal (#231186)"#236869
Closed
fkanout wants to merge 7 commits intoelastic:mainfrom
Closed
Revert "[Incident Management] Add page attachment modal (#231186)"#236869fkanout wants to merge 7 commits intoelastic:mainfrom
fkanout wants to merge 7 commits intoelastic:mainfrom
Conversation
…)" This reverts commit 807b177.
Contributor
|
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
…urce-definitions/scripts/fix-location-collection.ts'
…cy to avoid Rate limit errors. (elastic#236535) ## Problem Resolves elastic/security-team#14004 > [!TIP] > Enable below experimental feature before using this feature. > ``` > xpack.securitySolution.enableExperimental: > - automaticDashboardsMigration > ``` This PR improves the CPU performance, Token Performance and Error rate for Dashboard Migrations. Before this PR, a dashboard migration run leads to lot of panels erroring out with `Rate Limits error` as can be seen in below screenshot from `main`. you can see that almost all panels are failing. <details> <summary>Rate Limit Error Screenshot </summary> <img width="2271" height="1202" alt="Image" src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600" /> </details> This would also result in Kibana Task manager being choked up as you might observe when desk testing. This is hard to reproduce because issue occurs intermittently, depending on which part of graph you are in. Specially after 2-3 minutes when graph is well underway. ## Explanation It was because of too many Translation graphs running and each Translation graph triggering `All` Panel graphs in parallel. Current limit on Dashboard graph is 10 and on panels there is not limit. That means, if there are 10 Dashboard graphs ( out current concurrency limit ) running each with 5 panels, there will be 50 tasks running at the same time and making calls to LLM. This results in system choking up and LLM more rate limit calls. > [!TIP] > The objective of this PR was to tune the concurrency of Dashboard migration so as to reduce the instances of above mentioned issues. ## TLDR; After testing multiple concurrency configurations, I think we can go with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below section details of those experiments and corresponding justifications. Feel free to skip the section and start the testing instead. <details> <summary><h2> Solution and Experiments </h2></summary> All changes are applicable to `Dashboard` migrations only. I reached to a setting of 3x4 concurrency. What this means is 3 Dashboards ( at max) each with 4 panels ( at max ) running concurrently. I arrived at this configuration after a series of experiments as you can see. - All tests were done on `Elastic LLM`. - 3 Retries has been switched on for all LLM nodes. So if Rate limit error occurs, we can retry it. If those retries fail for a panel, the process is aborted. So in below traces Aborted error are because of multi retries after rate limit was hit. - All tests were done on same dashboard data set as explained below - 7 dashboards - 4 fully/partially translatable - 2 have errors - No panels found - 1 (`Content Overview`) errors with some unknown error in `Index pattern` node. Not sure of the error. Let's ignore this for this node. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time Taken |Langsmith Trace | |---|---|---|---|---|---| |5|10|89%|~95K|~3m|https://ela.st/5x10| |5|4|71%|~148K|3m 20s|https://ela.st/5x4| |3|4|50%|120K|~3m|https://ela.st/3x4-concurrency| |1|10|40%|~292K|~6m|https://ela.st/1x10| > [!TIP] > In 3x4 and 1x10, I did not see any instance of Rate limit error. > These error rates are only for comparison, they will perform better with more dashboards which are bound to be successful. Out test dataset had some dashboard which were failures also. See screenshots for all of runs mentioned above <details> <summary>5x10</summary> <img width="2279" height="769" alt="image" src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768" /> </details> <details> <summary>5x4</summary> <img width="2281" height="727" alt="image" src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c" /> </details> <details> <summary>3x4</summary> <img width="2288" height="804" alt="image" src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5" /> </details> <details> <summary>1x10</summary> <img width="2305" height="759" alt="image" src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9" /> </details> ## Final run on the deployed Project. With the select 3x4 configuration, i did a final run and results were much better. However there were still `Rate limit` errors here and there. I think we can merge this in and see the performance overtime and do some more tweaks to the `RetryPolicy`. There were total 40 Dashboards and most of them had valid data. | Dashboard Concurrency| Panels Concurrency | Error % | Token Usage | Time taken|Langsmith Trace | |---|---|---|---|---|---| |3|4|37%|380K|~21m|https://ela.st/3x4-project| <img width="1107" height="606" alt="image" src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4" /> Results are available here : https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214 Credentials can be found [here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h) </details> ## Testing Guidelines. ### Things to test 1. First run given dashboards migration on `main` and note the following - Rate limit Errors in each panel ( can be observed from Comments section ). Easier to do with small data set as given below - <img width="1216" height="269" alt="image" src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa" /> - Performance of Kibana as run when Migration is running. Pay specific attention to following. This is easier to with big dataset. Also given below. - Time server requests are taking. - Time taken during hard refresh - Navigation Lags. 2. Next, repeat the same on this PR branch and results should be much better. ### Data - [7 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking rate limit error) - [40 Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link) ( great for checking kibana perf) Both Macros and lookups are also available in the same folder.
… attached page (feature flagged) (elastic#225750) (elastic#236872) This reverts commit f1eb210.
Contributor
Author
|
closing this in favor of #237064 |
Contributor
💔 Build Failed
Failed CI StepsHistory
cc @fkanout |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This reverts commit 807b177.
Depends on #236872 (review)