Skip to content

Revert "[Incident Management] Add page attachment modal (#231186)"#236869

Closed
fkanout wants to merge 7 commits intoelastic:mainfrom
fkanout:revert-add-page-attachement-modal
Closed

Revert "[Incident Management] Add page attachment modal (#231186)"#236869
fkanout wants to merge 7 commits intoelastic:mainfrom
fkanout:revert-add-page-attachement-modal

Conversation

@fkanout
Copy link
Copy Markdown
Contributor

@fkanout fkanout commented Sep 30, 2025

This reverts commit 807b177.

Depends on #236872 (review)

@fkanout fkanout self-assigned this Sep 30, 2025
@fkanout fkanout requested a review from a team as a code owner September 30, 2025 05:46
@fkanout fkanout added the release_note:skip Skip the PR/issue when compiling release notes label Sep 30, 2025
@fkanout fkanout requested a review from a team as a code owner September 30, 2025 05:46
@fkanout fkanout added backport:skip This PR does not require backporting Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. v9.2.0 labels Sep 30, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@github-actions github-actions bot added the author:obs-ux-management PRs authored by the obs ux management team label Sep 30, 2025
…urce-definitions/scripts/fix-location-collection.ts'
@mgiota mgiota self-requested a review September 30, 2025 09:35
Copy link
Copy Markdown
Contributor

@mgiota mgiota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this one on top of the revert of this one and I verify that the modal that used to open the comment textarea doesn't appear anymore.

Screen.Recording.2025-09-30.at.13.58.56.mov

fkanout and others added 5 commits October 1, 2025 06:56
…cy to avoid Rate limit errors. (elastic#236535)

## Problem

Resolves elastic/security-team#14004

> [!TIP]
> Enable below experimental feature before using this feature.
> ```
> xpack.securitySolution.enableExperimental:
>   - automaticDashboardsMigration
> ```


This PR improves the CPU performance, Token Performance and Error rate
for Dashboard Migrations.

Before this PR, a dashboard migration run leads to lot of panels
erroring out with `Rate Limits error` as can be seen in below screenshot
from `main`. you can see that almost all panels are failing.

<details>
<summary>Rate Limit Error Screenshot </summary>

<img width="2271" height="1202" alt="Image"
src="https://github.com/user-attachments/assets/36949d7f-a029-4bfd-a834-4899a4b9a600"
/>

</details>


This would also result in Kibana Task manager being choked up as you
might observe when desk testing. This is hard to reproduce because issue
occurs intermittently, depending on which part of graph you are in.
Specially after 2-3 minutes when graph is well underway.

## Explanation

It was because of too many Translation graphs running and each
Translation graph triggering `All` Panel graphs in parallel. Current
limit on Dashboard graph is 10 and on panels there is not limit.

That means, if there are 10 Dashboard graphs ( out current concurrency
limit ) running each with 5 panels, there will be 50 tasks running at
the same time and making calls to LLM. This results in system choking up
and LLM more rate limit
calls.

> [!TIP]
> The objective of this PR was to tune the concurrency of Dashboard
migration so as to reduce the instances of above mentioned issues.

## TLDR;

After testing multiple concurrency configurations, I think we can go
with 3x4 ( 3 dashboards, each with 4 Panels concurrently ) config. Below
section details of those experiments and corresponding justifications.

Feel free to skip the section and start the testing instead.

<details>

<summary><h2> Solution and Experiments </h2></summary>

All changes are applicable to `Dashboard` migrations only. 

I reached to a setting of 3x4 concurrency. What this means is 3
Dashboards ( at max) each with 4 panels ( at max ) running concurrently.
I arrived at this configuration after a series of experiments as you can
see.

- All tests were done on `Elastic LLM`.
- 3 Retries has been switched on for all LLM nodes. So if Rate limit
error occurs, we can retry it. If those retries fail for a panel, the
process is aborted. So in below traces Aborted error are because of
multi retries after rate limit was hit.
- All tests were done on same dashboard data set as explained below
   - 7 dashboards 
   - 4 fully/partially translatable
   - 2 have errors - No panels found
- 1 (`Content Overview`) errors with some unknown error in `Index
pattern` node. Not sure of the error. Let's ignore this for this node.


| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time Taken |Langsmith Trace |
|---|---|---|---|---|---|
|5|10|89%|~95K|~3m|https://ela.st/5x10|
|5|4|71%|~148K|3m 20s|https://ela.st/5x4|
|3|4|50%|120K|~3m|https://ela.st/3x4-concurrency|
|1|10|40%|~292K|~6m|https://ela.st/1x10|

> [!TIP]
> In 3x4 and 1x10, I did not see any instance of Rate limit error. 
> These error rates are only for comparison, they will perform better
with more dashboards which are bound to be successful. Out test dataset
had some dashboard which were failures also.

See screenshots for all of runs mentioned above

<details>
<summary>5x10</summary>
<img width="2279" height="769" alt="image"
src="https://github.com/user-attachments/assets/cb2a3664-8bf4-497c-b2b6-3da099fc7768"
/>

</details>



<details>
<summary>5x4</summary>
<img width="2281" height="727" alt="image"
src="https://github.com/user-attachments/assets/af16de94-f787-4e46-9818-52e882b2190c"
/>

</details>




<details>
<summary>3x4</summary>
<img width="2288" height="804" alt="image"
src="https://github.com/user-attachments/assets/be0dbf2a-003c-4770-ae67-0d74902ab7a5"
/>

</details>

<details>
<summary>1x10</summary>
<img width="2305" height="759" alt="image"
src="https://github.com/user-attachments/assets/34b91294-973d-4506-b431-09cfbd44ecc9"
/>

</details>


## Final run on the deployed Project.

With the select 3x4 configuration, i did a final run and results were
much better. However there were still `Rate limit` errors here and
there. I think we can merge this in and see the performance overtime and
do some more tweaks to the `RetryPolicy`.

There were total 40 Dashboards and most of them had valid data.

| Dashboard Concurrency| Panels Concurrency | Error % | Token Usage |
Time taken|Langsmith Trace |
|---|---|---|---|---|---|
|3|4|37%|380K|~21m|https://ela.st/3x4-project|



<img width="1107" height="606" alt="image"
src="https://github.com/user-attachments/assets/6c476571-4cf8-43fe-8c74-5fab3a414be4"
/>

Results are available here :
https://keepkibana-pr-236535-security-b9fbee.kb.eu-west-1.aws.qa.elastic.cloud/app/security/siem_migrations/dashboards/fe0ab206-2249-47b4-8bce-d55811efd214

Credentials can be found
[here](https://p.elstc.co/paste/QAIlMYEH#-Zf/fUtX2xUUQ83uSLaNRf9QxJtBJVkSCuMhwracC+h)

</details>

## Testing Guidelines.

### Things to test

1. First run given dashboards migration on `main` and note the following
- Rate limit Errors in each panel ( can be observed from Comments
section ). Easier to do with small data set as given below
- <img width="1216" height="269" alt="image"
src="https://github.com/user-attachments/assets/bf4385c6-98ba-4d5b-9fae-dea7a76317aa"
/>


- Performance of Kibana as run when Migration is running. Pay specific
attention to following. This is easier to with big dataset. Also given
below.
     - Time server requests are taking.
     - Time taken during hard refresh
     - Navigation Lags.

2. Next, repeat the same on this PR branch and results should be much
better.

### Data

- [7
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking rate limit error)
- [40
Dashboards](https://drive.google.com/drive/folders/1D3BibV4AnBmIs7En5WPFSuEbIkNucG49?usp=drive_link)
( great for checking kibana perf)

Both Macros and lookups are also available in the same folder.
@fkanout fkanout requested a review from a team as a code owner October 1, 2025 07:17
@fkanout
Copy link
Copy Markdown
Contributor Author

fkanout commented Oct 1, 2025

closing this in favor of #237064

@fkanout fkanout closed this Oct 1, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Oct 1, 2025

💔 Build Failed

Failed CI Steps

History

cc @fkanout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:obs-ux-management PRs authored by the obs ux management team backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants