Skip to content

Conversation

@maryam-saeidi
Copy link
Member

@maryam-saeidi maryam-saeidi commented Oct 6, 2025

Summary

Since we use spot agents for our FTR tests, there are cases that some of the configs are already run but due to agent lost, we will try all the config of that group again. In this PR, we use Buildkite metadata config to keep track of configs that are already executed, so that if the agent is lost, we first check if there is a metadata for this config, if yes, we will skip running that config.

For this logic to work, we also need to save the Scout events related to each config after running each config instead of at the end of each config group, to ensure if the agent is lost, we will keep the related execution stats and events.

Expected improvement

Build Before After Improvement
348415 2h 25m (estimate) 1h 37m saves 48 minutes (~33% faster)
348229 1h 39m 1h 10m (estimate) saves 29 minutes (~30% faster)
348223 2h 3m 1h 17m (estimate) saves 46 minutes (~37% faster)

In the last example, FTR Configs #2 takes almost double the time because the agent is lost when executing the last config.

image

Here is a video that illustrates the issue for this build:

Screen.Recording.2025-10-07.at.16.59.27.mov

🧪 How to test

What I did was run a small portion of the tests in this build, wait for one config to finish and report its stats, then cancel the build and retry it to see if the new build would skip the completed config as expected.

In this build, it also improved FTR Config 6, although the previous failure was "Exited with status 10" not agent loss.

  • Before: 2h 25m (estimate)
  • After: 1h 37m
  • Time saved: 48 minutes (~33% faster)
image

@maryam-saeidi maryam-saeidi self-assigned this Oct 6, 2025
@maryam-saeidi maryam-saeidi added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting labels Oct 6, 2025
@github-actions github-actions bot added the author:obs-ux-management PRs authored by the obs ux management team label Oct 6, 2025
@maryam-saeidi maryam-saeidi marked this pull request as ready for review October 8, 2025 08:24
@maryam-saeidi maryam-saeidi requested review from a team as code owners October 8, 2025 08:24
@maryam-saeidi maryam-saeidi marked this pull request as draft October 9, 2025 11:21
@maryam-saeidi maryam-saeidi marked this pull request as ready for review October 9, 2025 13:35
Copy link
Contributor

@delanni delanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this as is, the warning removal is a nice addition

@maryam-saeidi maryam-saeidi enabled auto-merge (squash) October 10, 2025 19:16
@maryam-saeidi maryam-saeidi merged commit 4618cdf into elastic:main Oct 10, 2025
13 checks passed
@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

cc @maryam-saeidi

@maryam-saeidi maryam-saeidi deleted the improve-ftr-retry branch October 10, 2025 20:46
delanni added a commit to delanni/kibana that referenced this pull request Oct 13, 2025
delanni added a commit that referenced this pull request Oct 13, 2025
…g when agent is lost (#237705)" (#238621)

## Summary
This reverts commit 4618cdf (of
#237705)

We've seen issues with the change on pipelines outside the PR / on-merge
jobs, without obvious solution to why they were failing. We're rolling
this back while we investigate
@delanni
Copy link
Contributor

delanni commented Oct 13, 2025

Reverted in #238621

delanni added a commit to delanni/kibana that referenced this pull request Oct 13, 2025
delanni added a commit that referenced this pull request Oct 13, 2025
…ent is lost (reapplied) (#238631)

## Summary
Reapply @maryam-saeidi's #237705 

+ Adds a fix for when the scout reporter is not enabled
baileycash-elastic pushed a commit to baileycash-elastic/kibana that referenced this pull request Oct 14, 2025
…g when agent is lost (elastic#237705)" (elastic#238621)

## Summary
This reverts commit 4618cdf (of
elastic#237705)

We've seen issues with the change on pipelines outside the PR / on-merge
jobs, without obvious solution to why they were failing. We're rolling
this back while we investigate
baileycash-elastic pushed a commit to baileycash-elastic/kibana that referenced this pull request Oct 14, 2025
…ent is lost (reapplied) (elastic#238631)

## Summary
Reapply @maryam-saeidi's elastic#237705 

+ Adds a fix for when the scout reporter is not enabled
mgadewoll pushed a commit to tkajtoch/kibana that referenced this pull request Oct 17, 2025
…ent is lost (reapplied) (elastic#238631)

## Summary
Reapply @maryam-saeidi's elastic#237705 

+ Adds a fix for when the scout reporter is not enabled
rylnd pushed a commit to rylnd/kibana that referenced this pull request Oct 17, 2025
…gent is lost (elastic#237705)

## Summary

Since we use spot agents for our FTR tests, there are cases that some of
the configs are already run but due to agent lost, we will try all the
config of that group again. In this PR, we use Buildkite metadata config
to keep track of configs that are already executed, so that if the agent
is lost, we first check if there is a metadata for this config, if yes,
we will skip running that config.

For this logic to work, we also need to save the Scout events related to
each config after running each config instead of at the end of each
config group, to ensure if the agent is lost, we will keep the related
execution stats and events.

#### Expected improvement

|Build|Before|After|Improvement|
|---|---|---|---|

|[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h
25m (estimate)|1h 37m|saves 48 minutes (~33% faster)|

|[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h
39m|1h 10m (estimate)|saves 29 minutes (~30% faster)|

|[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h
3m|1h 17m (estimate)| saves 46 minutes (~37% faster)|

In the last
[example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall),
`FTR Configs #2` takes almost double the time because the agent is lost
when executing the last config.

<img width="2588" height="456" alt="image"
src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358"
/>

Here is a video that illustrates the issue for this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348229):


https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41



### 🧪 How to test

What I did was run a small portion of the tests in this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348316),
wait for one config to finish and report its stats, then cancel the
build and retry it to see if the new build would skip the completed
config as expected.

In this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348415),
it also improved `FTR Config 6`, although the previous failure was
"Exited with status 10" not agent loss.

- **Before**: 2h 25m (estimate)
- **After**: 1h 37m
- **Time saved**: 48 minutes (~33% faster)

<img width="2934" height="146" alt="image"
src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6"
/>
rylnd pushed a commit to rylnd/kibana that referenced this pull request Oct 17, 2025
…g when agent is lost (elastic#237705)" (elastic#238621)

## Summary
This reverts commit 67fa3e5 (of
elastic#237705)

We've seen issues with the change on pipelines outside the PR / on-merge
jobs, without obvious solution to why they were failing. We're rolling
this back while we investigate
rylnd pushed a commit to rylnd/kibana that referenced this pull request Oct 17, 2025
…ent is lost (reapplied) (elastic#238631)

## Summary
Reapply @maryam-saeidi's elastic#237705 

+ Adds a fix for when the scout reporter is not enabled
nickpeihl pushed a commit to nickpeihl/kibana that referenced this pull request Oct 23, 2025
…gent is lost (elastic#237705)

## Summary

Since we use spot agents for our FTR tests, there are cases that some of
the configs are already run but due to agent lost, we will try all the
config of that group again. In this PR, we use Buildkite metadata config
to keep track of configs that are already executed, so that if the agent
is lost, we first check if there is a metadata for this config, if yes,
we will skip running that config.

For this logic to work, we also need to save the Scout events related to
each config after running each config instead of at the end of each
config group, to ensure if the agent is lost, we will keep the related
execution stats and events.

#### Expected improvement

|Build|Before|After|Improvement|
|---|---|---|---|

|[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h
25m (estimate)|1h 37m|saves 48 minutes (~33% faster)|

|[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h
39m|1h 10m (estimate)|saves 29 minutes (~30% faster)|

|[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h
3m|1h 17m (estimate)| saves 46 minutes (~37% faster)|

In the last
[example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall),
`FTR Configs #2` takes almost double the time because the agent is lost
when executing the last config.

<img width="2588" height="456" alt="image"
src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358"
/>

Here is a video that illustrates the issue for this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348229):


https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41



### 🧪 How to test

What I did was run a small portion of the tests in this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348316),
wait for one config to finish and report its stats, then cancel the
build and retry it to see if the new build would skip the completed
config as expected.

In this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348415),
it also improved `FTR Config 6`, although the previous failure was
"Exited with status 10" not agent loss.

- **Before**: 2h 25m (estimate)
- **After**: 1h 37m
- **Time saved**: 48 minutes (~33% faster)

<img width="2934" height="146" alt="image"
src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6"
/>
nickpeihl pushed a commit to nickpeihl/kibana that referenced this pull request Oct 23, 2025
…g when agent is lost (elastic#237705)" (elastic#238621)

## Summary
This reverts commit 4618cdf (of
elastic#237705)

We've seen issues with the change on pipelines outside the PR / on-merge
jobs, without obvious solution to why they were failing. We're rolling
this back while we investigate
nickpeihl pushed a commit to nickpeihl/kibana that referenced this pull request Oct 23, 2025
…ent is lost (reapplied) (elastic#238631)

## Summary
Reapply @maryam-saeidi's elastic#237705 

+ Adds a fix for when the scout reporter is not enabled
maryam-saeidi added a commit that referenced this pull request Oct 27, 2025
…ig split (#240461)

### Summary

Keeping the FTR configs small for shorter retry times, and balancing and
distributing test configs on BK workers more effectively. Also, smaller
configs have a higher chance of not being retried in case of an agent
lost ([PR](#237705)).

This PR splits
x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.ts
config into 2:

-
x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.ts
-
x-pack/platform/test/functional_with_es_ssl/apps/triggers_actions_ui/config.rules.ts

During CI build, some FTR test configs take a long time as you can see
in the following warning message
([build](https://buildkite.com/elastic/kibana-on-merge/builds/80437)):

<img width="2998" height="290" alt="image"
src="https://github.com/user-attachments/assets/1e26ba14-6d02-48af-8d8f-489d310430d7"
/>

#### Before

<img width="3002" height="128" alt="image"
src="https://github.com/user-attachments/assets/0e8676fb-57ef-4923-9dc9-5217184da555"
/>

#### After

<img width="2998" height="256" alt="image"
src="https://github.com/user-attachments/assets/b858796c-cd30-4356-84a6-62584af4ddcd"
/>
NicholasPeretti pushed a commit to NicholasPeretti/kibana that referenced this pull request Oct 27, 2025
…gent is lost (elastic#237705)

## Summary

Since we use spot agents for our FTR tests, there are cases that some of
the configs are already run but due to agent lost, we will try all the
config of that group again. In this PR, we use Buildkite metadata config
to keep track of configs that are already executed, so that if the agent
is lost, we first check if there is a metadata for this config, if yes,
we will skip running that config.

For this logic to work, we also need to save the Scout events related to
each config after running each config instead of at the end of each
config group, to ensure if the agent is lost, we will keep the related
execution stats and events.

#### Expected improvement

|Build|Before|After|Improvement|
|---|---|---|---|

|[348415](https://buildkite.com/elastic/kibana-pull-request/builds/348415)|2h
25m (estimate)|1h 37m|saves 48 minutes (~33% faster)|

|[348229](https://buildkite.com/elastic/kibana-pull-request/builds/348229)|1h
39m|1h 10m (estimate)|saves 29 minutes (~30% faster)|

|[348223](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall)|2h
3m|1h 17m (estimate)| saves 46 minutes (~37% faster)|

In the last
[example](https://buildkite.com/elastic/kibana-pull-request/builds/348223/waterfall),
`FTR Configs elastic#2` takes almost double the time because the agent is lost
when executing the last config.

<img width="2588" height="456" alt="image"
src="https://github.com/user-attachments/assets/992ffc6b-4412-47f9-9dd2-ecd5ff607358"
/>

Here is a video that illustrates the issue for this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348229):


https://github.com/user-attachments/assets/5f499f78-5841-40e7-8582-e761b885ed41



### 🧪 How to test

What I did was run a small portion of the tests in this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348316),
wait for one config to finish and report its stats, then cancel the
build and retry it to see if the new build would skip the completed
config as expected.

In this
[build](https://buildkite.com/elastic/kibana-pull-request/builds/348415),
it also improved `FTR Config 6`, although the previous failure was
"Exited with status 10" not agent loss.

- **Before**: 2h 25m (estimate)
- **After**: 1h 37m
- **Time saved**: 48 minutes (~33% faster)

<img width="2934" height="146" alt="image"
src="https://github.com/user-attachments/assets/88b7ad5a-46b1-42ad-9321-f33a81d89ee6"
/>
NicholasPeretti pushed a commit to NicholasPeretti/kibana that referenced this pull request Oct 27, 2025
…g when agent is lost (elastic#237705)" (elastic#238621)

## Summary
This reverts commit 4618cdf (of
elastic#237705)

We've seen issues with the change on pipelines outside the PR / on-merge
jobs, without obvious solution to why they were failing. We're rolling
this back while we investigate
NicholasPeretti pushed a commit to NicholasPeretti/kibana that referenced this pull request Oct 27, 2025
…ent is lost (reapplied) (elastic#238631)

## Summary
Reapply @maryam-saeidi's elastic#237705 

+ Adds a fix for when the scout reporter is not enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:obs-ux-management PRs authored by the obs ux management team backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes reverted v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants