-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[FTR] Improve running FTR tests by avoiding rerunning a config when agent is lost #237705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
delanni
reviewed
Oct 8, 2025
delanni
reviewed
Oct 8, 2025
delanni
approved these changes
Oct 10, 2025
Contributor
delanni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this as is, the warning removal is a nice addition
Contributor
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]
History
|
delanni
added a commit
to delanni/kibana
that referenced
this pull request
Oct 13, 2025
…g when agent is lost (elastic#237705)" This reverts commit 4618cdf.
Contributor
|
Reverted in #238621 |
delanni
added a commit
to delanni/kibana
that referenced
this pull request
Oct 13, 2025
…ig when agent is lost (elastic#237705)" (elastic#238621) This reverts commit 8a39f08.
delanni
added a commit
that referenced
this pull request
Oct 13, 2025
…ent is lost (reapplied) (#238631) ## Summary Reapply @maryam-saeidi's #237705 + Adds a fix for when the scout reporter is not enabled
baileycash-elastic
pushed a commit
to baileycash-elastic/kibana
that referenced
this pull request
Oct 14, 2025
…g when agent is lost (elastic#237705)" (elastic#238621) ## Summary This reverts commit 4618cdf (of elastic#237705) We've seen issues with the change on pipelines outside the PR / on-merge jobs, without obvious solution to why they were failing. We're rolling this back while we investigate
baileycash-elastic
pushed a commit
to baileycash-elastic/kibana
that referenced
this pull request
Oct 14, 2025
…ent is lost (reapplied) (elastic#238631) ## Summary Reapply @maryam-saeidi's elastic#237705 + Adds a fix for when the scout reporter is not enabled
mgadewoll
pushed a commit
to tkajtoch/kibana
that referenced
this pull request
Oct 17, 2025
…ent is lost (reapplied) (elastic#238631) ## Summary Reapply @maryam-saeidi's elastic#237705 + Adds a fix for when the scout reporter is not enabled
rylnd
pushed a commit
to rylnd/kibana
that referenced
this pull request
Oct 17, 2025
…gent is lost (elastic#237705) ## Summary Since we use spot agents for our FTR tests, there are cases that some of the configs are already run but due to agent lost, we will try all the config of that group again. In this PR, we use Buildkite metadata config to keep track of configs that are already executed, so that if the agent is lost, we first check if there is a metadata for this config, if yes, we will skip running that config. For this logic to work, we also need to save the Scout events related to each config after running each config instead of at the end of each config group, to ensure if the agent is lost, we will keep the related execution stats and events. #### Expected improvement |Build|Before|After|Improvement| |---|---|---|---| |[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h 25m (estimate)|1h 37m|saves 48 minutes (~33% faster)| |[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h 39m|1h 10m (estimate)|saves 29 minutes (~30% faster)| |[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h 3m|1h 17m (estimate)| saves 46 minutes (~37% faster)| In the last [example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall), `FTR Configs #2` takes almost double the time because the agent is lost when executing the last config. <img width="2588" height="456" alt="image" src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358" /> Here is a video that illustrates the issue for this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348229): https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41 ### 🧪 How to test What I did was run a small portion of the tests in this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348316), wait for one config to finish and report its stats, then cancel the build and retry it to see if the new build would skip the completed config as expected. In this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348415), it also improved `FTR Config 6`, although the previous failure was "Exited with status 10" not agent loss. - **Before**: 2h 25m (estimate) - **After**: 1h 37m - **Time saved**: 48 minutes (~33% faster) <img width="2934" height="146" alt="image" src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6" />
rylnd
pushed a commit
to rylnd/kibana
that referenced
this pull request
Oct 17, 2025
…g when agent is lost (elastic#237705)" (elastic#238621) ## Summary This reverts commit 67fa3e5 (of elastic#237705) We've seen issues with the change on pipelines outside the PR / on-merge jobs, without obvious solution to why they were failing. We're rolling this back while we investigate
rylnd
pushed a commit
to rylnd/kibana
that referenced
this pull request
Oct 17, 2025
…ent is lost (reapplied) (elastic#238631) ## Summary Reapply @maryam-saeidi's elastic#237705 + Adds a fix for when the scout reporter is not enabled
nickpeihl
pushed a commit
to nickpeihl/kibana
that referenced
this pull request
Oct 23, 2025
…gent is lost (elastic#237705) ## Summary Since we use spot agents for our FTR tests, there are cases that some of the configs are already run but due to agent lost, we will try all the config of that group again. In this PR, we use Buildkite metadata config to keep track of configs that are already executed, so that if the agent is lost, we first check if there is a metadata for this config, if yes, we will skip running that config. For this logic to work, we also need to save the Scout events related to each config after running each config instead of at the end of each config group, to ensure if the agent is lost, we will keep the related execution stats and events. #### Expected improvement |Build|Before|After|Improvement| |---|---|---|---| |[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h 25m (estimate)|1h 37m|saves 48 minutes (~33% faster)| |[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h 39m|1h 10m (estimate)|saves 29 minutes (~30% faster)| |[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h 3m|1h 17m (estimate)| saves 46 minutes (~37% faster)| In the last [example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall), `FTR Configs #2` takes almost double the time because the agent is lost when executing the last config. <img width="2588" height="456" alt="image" src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358" /> Here is a video that illustrates the issue for this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348229): https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41 ### 🧪 How to test What I did was run a small portion of the tests in this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348316), wait for one config to finish and report its stats, then cancel the build and retry it to see if the new build would skip the completed config as expected. In this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348415), it also improved `FTR Config 6`, although the previous failure was "Exited with status 10" not agent loss. - **Before**: 2h 25m (estimate) - **After**: 1h 37m - **Time saved**: 48 minutes (~33% faster) <img width="2934" height="146" alt="image" src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6" />
nickpeihl
pushed a commit
to nickpeihl/kibana
that referenced
this pull request
Oct 23, 2025
…g when agent is lost (elastic#237705)" (elastic#238621) ## Summary This reverts commit 4618cdf (of elastic#237705) We've seen issues with the change on pipelines outside the PR / on-merge jobs, without obvious solution to why they were failing. We're rolling this back while we investigate
nickpeihl
pushed a commit
to nickpeihl/kibana
that referenced
this pull request
Oct 23, 2025
…ent is lost (reapplied) (elastic#238631) ## Summary Reapply @maryam-saeidi's elastic#237705 + Adds a fix for when the scout reporter is not enabled
maryam-saeidi
added a commit
that referenced
this pull request
Oct 27, 2025
…ig split (#240461) ### Summary Keeping the FTR configs small for shorter retry times, and balancing and distributing test configs on BK workers more effectively. Also, smaller configs have a higher chance of not being retried in case of an agent lost ([PR](#237705)). This PR splits x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.ts config into 2: - x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.ts - x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.rules.ts During CI build, some FTR test configs take a long time as you can see in the following warning message ([build](https://buildkite.com/elastic/kibana-on-merge/builds/80437)): <img width="2998" height="290" alt="image" src="https://github.com/user-attachments/assets/1e26ba14-6d02-48af-8d8f-489d310430d7" /> #### Before <img width="3002" height="128" alt="image" src="https://github.com/user-attachments/assets/0e8676fb-57ef-4923-9dc9-5217184da555" /> #### After <img width="2998" height="256" alt="image" src="https://github.com/user-attachments/assets/b858796c-cd30-4356-84a6-62584af4ddcd" />
NicholasPeretti
pushed a commit
to NicholasPeretti/kibana
that referenced
this pull request
Oct 27, 2025
…gent is lost (elastic#237705) ## Summary Since we use spot agents for our FTR tests, there are cases that some of the configs are already run but due to agent lost, we will try all the config of that group again. In this PR, we use Buildkite metadata config to keep track of configs that are already executed, so that if the agent is lost, we first check if there is a metadata for this config, if yes, we will skip running that config. For this logic to work, we also need to save the Scout events related to each config after running each config instead of at the end of each config group, to ensure if the agent is lost, we will keep the related execution stats and events. #### Expected improvement |Build|Before|After|Improvement| |---|---|---|---| |[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h 25m (estimate)|1h 37m|saves 48 minutes (~33% faster)| |[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h 39m|1h 10m (estimate)|saves 29 minutes (~30% faster)| |[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h 3m|1h 17m (estimate)| saves 46 minutes (~37% faster)| In the last [example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall), `FTR Configs elastic#2` takes almost double the time because the agent is lost when executing the last config. <img width="2588" height="456" alt="image" src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358" /> Here is a video that illustrates the issue for this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348229): https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41 ### 🧪 How to test What I did was run a small portion of the tests in this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348316), wait for one config to finish and report its stats, then cancel the build and retry it to see if the new build would skip the completed config as expected. In this [build](https://buildkite.com/elastic/kibana-pull-request/builds/348415), it also improved `FTR Config 6`, although the previous failure was "Exited with status 10" not agent loss. - **Before**: 2h 25m (estimate) - **After**: 1h 37m - **Time saved**: 48 minutes (~33% faster) <img width="2934" height="146" alt="image" src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6" />
NicholasPeretti
pushed a commit
to NicholasPeretti/kibana
that referenced
this pull request
Oct 27, 2025
…g when agent is lost (elastic#237705)" (elastic#238621) ## Summary This reverts commit 4618cdf (of elastic#237705) We've seen issues with the change on pipelines outside the PR / on-merge jobs, without obvious solution to why they were failing. We're rolling this back while we investigate
NicholasPeretti
pushed a commit
to NicholasPeretti/kibana
that referenced
this pull request
Oct 27, 2025
…ent is lost (reapplied) (elastic#238631) ## Summary Reapply @maryam-saeidi's elastic#237705 + Adds a fix for when the scout reporter is not enabled
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
author:obs-ux-management
PRs authored by the obs ux management team
backport:skip
This PR does not require backporting
release_note:skip
Skip the PR/issue when compiling release notes
reverted
v9.3.0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Since we use spot agents for our FTR tests, there are cases that some of the configs are already run but due to agent lost, we will try all the config of that group again. In this PR, we use Buildkite metadata config to keep track of configs that are already executed, so that if the agent is lost, we first check if there is a metadata for this config, if yes, we will skip running that config.
For this logic to work, we also need to save the Scout events related to each config after running each config instead of at the end of each config group, to ensure if the agent is lost, we will keep the related execution stats and events.
Expected improvement
In the last example,
FTR Configs #2takes almost double the time because the agent is lost when executing the last config.Here is a video that illustrates the issue for this build:
Screen.Recording.2025-10-07.at.16.59.27.mov
🧪 How to test
What I did was run a small portion of the tests in this build, wait for one config to finish and report its stats, then cancel the build and retry it to see if the new build would skip the completed config as expected.
In this build, it also improved
FTR Config 6, although the previous failure was "Exited with status 10" not agent loss.