Skip to content

fix: race condition for pre-existing StackConfigPolicy#8928

Merged
pkoutsovasilis merged 13 commits intoelastic:mainfrom
pkoutsovasilis:fix/scp_preexisting_apply_immediately
Nov 28, 2025
Merged

fix: race condition for pre-existing StackConfigPolicy#8928
pkoutsovasilis merged 13 commits intoelastic:mainfrom
pkoutsovasilis:fix/scp_preexisting_apply_immediately

Conversation

@pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Nov 24, 2025

Summary

Reduces the time window where Elasticsearch clusters start with empty file-settings when a pre-existing
StackConfigPolicy should be applied, by coordinating reconciliation between ES and SCP controllers.

Problem

Prior to this PR:

  1. User creates a StackConfigPolicy that targets ES clusters
  2. User creates an Elasticsearch cluster matching the SCP resource selector
  3. ES controller reconciles first and creates an empty file-settings secret
  4. ES cluster starts and runs with empty file-settings
  5. SCP controller eventually reconciles and updates the file-settings secret with policy configurations
  6. ES cluster must reload/restart to apply the policy settings

This creates a non-negligible time window where the ES cluster operates without the intended policy configurations,
even though the policy existed before the cluster was created.

Solution

The Elasticsearch controller now coordinates with the StackConfigPolicy controller to ensure policy settings are
applied from the start when pre-existing policies target a new ES cluster.

New Reconciliation Flow

When enterprise features are disabled:

  • ES controller creates an empty file-settings secret immediately (no change)

When enterprise features are enabled:

  1. ES controller checks for targeting policies:

    • Lists all StackConfigPolicies in the cluster
    • Checks if any policy targets this ES cluster
  2. If a pre-existing policy targets the cluster:

    • ES controller defers file-settings secret creation to the SCP controller
    • Returns re-queue to check again later
    • Result: SCP controller creates the secret with policy settings applied from the start
  3. If no policy targets the cluster:

    • ES controller creates an empty file-settings secret as before

Testing

  • Unit tests coverage for policy targeting scenarios
    • Verify requeue behaviour when policies target ES clusters
    • Confirm empty secret creation when no policies exist
  • Manual testing with pre-existing StackConfigPolicies confirmed policy applied at cluster creation

@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Nov 24, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic bot added the triage label Nov 24, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/scp_preexisting_apply_immediately branch 2 times, most recently from 5ed57af to 89bebd7 Compare November 24, 2025 21:12
@pkoutsovasilis pkoutsovasilis added >bug Something isn't working v3.3.0 and removed triage labels Nov 24, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/scp_preexisting_apply_immediately branch from 89bebd7 to 8f68eca Compare November 24, 2025 21:25
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review November 24, 2025 21:26
@pkoutsovasilis pkoutsovasilis linked an issue Nov 25, 2025 that may be closed by this pull request
@pkoutsovasilis pkoutsovasilis self-assigned this Nov 26, 2025
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@barkbay
Copy link
Contributor

barkbay commented Nov 28, 2025

buildkite test this -f p=gke,t=TestStackConfigPolicy*

@pkoutsovasilis pkoutsovasilis requested a review from pebrc November 28, 2025 08:25
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pkoutsovasilis pkoutsovasilis merged commit 8b21497 into elastic:main Nov 28, 2025
9 checks passed
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Feb 3, 2026
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [eck-operator](https://github.com/elastic/cloud-on-k8s) | minor | `3.2.0` → `3.3.0` |

---

### Release Notes

<details>
<summary>elastic/cloud-on-k8s (eck-operator)</summary>

### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0)

[Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0)

##### Elastic Cloud on Kubernetes 3.3.0

- [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart)

##### Release Highlights

##### AutoOps Integration (Enterprise feature)

ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details.

##### Elastic Package Registry Integration

ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details.

##### Multiple Stack Configuration Policies composition support (Enterprise feature)

ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details.

##### Features and enhancements

- AutoOpsAgentPolicy support [#&#8203;8941](elastic/cloud-on-k8s#8941) (issue: [#&#8203;8789](elastic/cloud-on-k8s#8789))
- ElasticPackageRegistry support [#&#8203;8800](elastic/cloud-on-k8s#8800) (issue: [#&#8203;8925](elastic/cloud-on-k8s#8925))
- Stack Config Policies composition support [#&#8203;8917](elastic/cloud-on-k8s#8917)
- Use standard Kibana labels and Helm labels on the ECK Operator pod [#&#8203;8840](elastic/cloud-on-k8s#8840) (issue: [#&#8203;8584](elastic/cloud-on-k8s#8584))
- Add service customization support for Elasticsearch remote cluster server [#&#8203;8892](elastic/cloud-on-k8s#8892)
- Removal of Elasticsearch 6.x support from codebase [#&#8203;8979](elastic/cloud-on-k8s#8979)

##### Fixes

- Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#&#8203;8871](elastic/cloud-on-k8s#8871) (issue: [#&#8203;8429](elastic/cloud-on-k8s#8429))
- Fix race condition for pre-existing Stack Config Policy [#&#8203;8928](elastic/cloud-on-k8s#8928) (issue: [#&#8203;8912](elastic/cloud-on-k8s#8912))
- Do not set Kibana server.name [#&#8203;8930](elastic/cloud-on-k8s#8930) (issue: [#&#8203;8929](elastic/cloud-on-k8s#8929))
- Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#&#8203;8932](elastic/cloud-on-k8s#8932) (issue: [#&#8203;8781](elastic/cloud-on-k8s#8781))
- Cleanup orphaned secret mounts when removed from StackConfigPolicy [#&#8203;8937](elastic/cloud-on-k8s#8937) (issue: [#&#8203;8921](elastic/cloud-on-k8s#8921))
- Avoid duplicate error logging for generate GET operations on a GVK [#&#8203;8957](elastic/cloud-on-k8s#8957)
- Remove single master at a time upscale restriction [#&#8203;8940](elastic/cloud-on-k8s#8940) (issue: [#&#8203;8939](elastic/cloud-on-k8s#8939))
- AutoOps: Ignore deprecated ES clusters [#&#8203;9008](elastic/cloud-on-k8s#9008) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- AutoOps: Require 9.2.1 for AutoOps agent [#&#8203;9007](elastic/cloud-on-k8s#9007) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- Multi-SCP: Flip weight semantics - higher weight takes precedence [#&#8203;9046](elastic/cloud-on-k8s#9046)

##### Documentation improvements

- Update Google Cloud LoadBalancer recipe for new requirements [#&#8203;8843](elastic/cloud-on-k8s#8843)
- Fix minUnavailable typo in PDB documentation [#&#8203;8898](elastic/cloud-on-k8s#8898)
- Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#&#8203;8982](elastic/cloud-on-k8s#8982)
- Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#&#8203;8986](elastic/cloud-on-k8s#8986)
- Remove support for Stack 7.17. [#&#8203;9038](elastic/cloud-on-k8s#9038)

##### Dependency updates

- Go 1.25.2 => 1.25.6
- github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5
- github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f
- github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19
- github.com/google/go-containerregistry v0.20.6 => v0.20.7
- github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8
- github.com/prometheus/common v0.67.1 => v0.67.5
- github.com/spf13/cobra v1.10.1 => v1.10.2
- go.elastic.co/apm/v2 v2.7.1 => v2.7.2
- go.uber.org/zap v1.27.0 => v1.27.1
- golang.org/x/crypto v0.40.0 => v0.46.0
- k8s.io/api v0.34.1 => v0.35.0
- k8s.io/apimachinery v0.34.1 => v0.35.0
- k8s.io/client-go v0.34.1 => v0.35.0
- k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4
- sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4
- sigs.k8s.io/controller-tools v0.19.0 => v0.20.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImNoYXJ0Il19-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3682
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Feb 3, 2026
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [elastic/cloud-on-k8s](https://github.com/elastic/cloud-on-k8s) | minor | `v3.2.0` → `v3.3.0` |

---

### Release Notes

<details>
<summary>elastic/cloud-on-k8s (elastic/cloud-on-k8s)</summary>

### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0)

[Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0)

### Elastic Cloud on Kubernetes 3.3.0

- [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart)

##### Release Highlights

##### AutoOps Integration (Enterprise feature)

ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details.

##### Elastic Package Registry Integration

ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details.

##### Multiple Stack Configuration Policies composition support (Enterprise feature)

ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details.

##### Features and enhancements

- AutoOpsAgentPolicy support [#&#8203;8941](elastic/cloud-on-k8s#8941) (issue: [#&#8203;8789](elastic/cloud-on-k8s#8789))
- ElasticPackageRegistry support [#&#8203;8800](elastic/cloud-on-k8s#8800) (issue: [#&#8203;8925](elastic/cloud-on-k8s#8925))
- Stack Config Policies composition support [#&#8203;8917](elastic/cloud-on-k8s#8917)
- Use standard Kibana labels and Helm labels on the ECK Operator pod [#&#8203;8840](elastic/cloud-on-k8s#8840) (issue: [#&#8203;8584](elastic/cloud-on-k8s#8584))
- Add service customization support for Elasticsearch remote cluster server [#&#8203;8892](elastic/cloud-on-k8s#8892)
- Removal of Elasticsearch 6.x support from codebase [#&#8203;8979](elastic/cloud-on-k8s#8979)

##### Fixes

- Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#&#8203;8871](elastic/cloud-on-k8s#8871) (issue: [#&#8203;8429](elastic/cloud-on-k8s#8429))
- Fix race condition for pre-existing Stack Config Policy [#&#8203;8928](elastic/cloud-on-k8s#8928) (issue: [#&#8203;8912](elastic/cloud-on-k8s#8912))
- Do not set Kibana server.name [#&#8203;8930](elastic/cloud-on-k8s#8930) (issue: [#&#8203;8929](elastic/cloud-on-k8s#8929))
- Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#&#8203;8932](elastic/cloud-on-k8s#8932) (issue: [#&#8203;8781](elastic/cloud-on-k8s#8781))
- Cleanup orphaned secret mounts when removed from StackConfigPolicy [#&#8203;8937](elastic/cloud-on-k8s#8937) (issue: [#&#8203;8921](elastic/cloud-on-k8s#8921))
- Avoid duplicate error logging for generate GET operations on a GVK [#&#8203;8957](elastic/cloud-on-k8s#8957)
- Remove single master at a time upscale restriction [#&#8203;8940](elastic/cloud-on-k8s#8940) (issue: [#&#8203;8939](elastic/cloud-on-k8s#8939))
- AutoOps: Ignore deprecated ES clusters [#&#8203;9008](elastic/cloud-on-k8s#9008) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- AutoOps: Require 9.2.1 for AutoOps agent [#&#8203;9007](elastic/cloud-on-k8s#9007) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- Multi-SCP: Flip weight semantics - higher weight takes precedence [#&#8203;9046](elastic/cloud-on-k8s#9046)

##### Documentation improvements

- Update Google Cloud LoadBalancer recipe for new requirements [#&#8203;8843](elastic/cloud-on-k8s#8843)
- Fix minUnavailable typo in PDB documentation [#&#8203;8898](elastic/cloud-on-k8s#8898)
- Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#&#8203;8982](elastic/cloud-on-k8s#8982)
- Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#&#8203;8986](elastic/cloud-on-k8s#8986)
- Remove support for Stack 7.17. [#&#8203;9038](elastic/cloud-on-k8s#9038)

##### Dependency updates

- Go 1.25.2 => 1.25.6
- github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5
- github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f
- github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19
- github.com/google/go-containerregistry v0.20.6 => v0.20.7
- github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8
- github.com/prometheus/common v0.67.1 => v0.67.5
- github.com/spf13/cobra v1.10.1 => v1.10.2
- go.elastic.co/apm/v2 v2.7.1 => v2.7.2
- go.uber.org/zap v1.27.0 => v1.27.1
- golang.org/x/crypto v0.40.0 => v0.46.0
- k8s.io/api v0.34.1 => v0.35.0
- k8s.io/apimachinery v0.34.1 => v0.35.0
- k8s.io/client-go v0.34.1 => v0.35.0
- k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4
- sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4
- sigs.k8s.io/controller-tools v0.19.0 => v0.20.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImltYWdlIl19-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3685
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug Something isn't working v3.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StackConfigPolicy might not apply immediately on new clusters

4 participants