Skip to content

Remove one master at a time upscale restriction#8940

Merged
pebrc merged 6 commits intoelastic:mainfrom
pebrc:master-upscale-state
Dec 11, 2025
Merged

Remove one master at a time upscale restriction#8940
pebrc merged 6 commits intoelastic:mainfrom
pebrc:master-upscale-state

Conversation

@pebrc
Copy link
Collaborator

@pebrc pebrc commented Dec 4, 2025

Fixes #8939

#8939 describes how requests to Elasticsearch are made during upscales. When scaling an overloaded ES cluster this can be problematic as the cluster won't be able to serve these requests in time.

This change removes the source of those requests: the on master at a time scale up restriction. This functionality stems for a time when the ECK operator had to manually reconfigure the quorum during scale up (Elasticsearch 6 and earlier). We no longer support these early versions where a quick scale up could lead to potential split brain scenarios if a partition was formed during scale up.

By removing the one-at-at-time invariant the call to the _nodes API during scale up becomes unnecessary. The call is now only made when calculating the scale downs.

@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Dec 4, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic bot added the triage label Dec 4, 2025
@pebrc pebrc changed the title Dedicated master upscale state Remove one master at a time upscale restriction Dec 5, 2025
@pebrc pebrc added the >bug Something isn't working label Dec 5, 2025
@botelastic botelastic bot removed the triage label Dec 5, 2025
@pebrc pebrc marked this pull request as ready for review December 5, 2025 09:54
@pebrc
Copy link
Collaborator Author

pebrc commented Dec 5, 2025

buildkite test this -f p=kind,t="TestMutation.*"

@pebrc pebrc requested a review from Copilot December 5, 2025 09:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the restriction that limited master node upscaling to one node at a time, which was necessary for Elasticsearch 6 and earlier versions to prevent split-brain scenarios during quorum reconfiguration. Since ECK no longer supports these early versions, this restriction is unnecessary and causes problems when scaling overloaded clusters, as it requires calls to the Elasticsearch _nodes API that may fail.

Key Changes:

  • Removed one-at-a-time master node creation logic and associated ES API calls during upscale
  • Exported InitialMasterNodesAnnotation constant for broader use
  • Added test coverage for upscale scenarios with unavailable Elasticsearch API

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/controller/elasticsearch/driver/upscale.go Removed comment about limiting master node creation to one at a time
pkg/controller/elasticsearch/driver/upscale_state.go Removed isBootstrapped, allowMasterCreation fields and related logic including isMasterNodeJoining, recordMasterNodeCreation, canCreateMasterNode, and limitMasterNodesCreation functions
pkg/controller/elasticsearch/driver/upscale_state_test.go Updated tests to remove bootstrap and master-specific logic, simplified test cases
pkg/controller/elasticsearch/driver/upscale_test.go Updated test expectations to allow multiple master nodes to be created at once, added test for upscale with unavailable ES API
pkg/controller/elasticsearch/version/zen2/initial_master_nodes.go Changed initialMasterNodesAnnotation from private to exported InitialMasterNodesAnnotation
pkg/controller/elasticsearch/version/zen2/initial_master_nodes_test.go Updated references to use exported constant name
test/e2e/test/elasticsearch/checks_budget.go Removed NewMasterChangeBudgetWatcher function and unused math import
test/e2e/test/elasticsearch/steps_mutation.go Removed NewMasterChangeBudgetWatcher from mutation test watchers
pkg/controller/common/expectations/expectations.go Removed comment about preventing creation of more than one master node at a time

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pebrc
Copy link
Collaborator Author

pebrc commented Dec 5, 2025

buildkite test this -f p=gke,t=TestMutation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand in which cases we still need this. Could we just remove it?

diff --git a/pkg/controller/elasticsearch/driver/nodes.go b/pkg/controller/elasticsearch/driver/nodes.go
index 65199fc48..7beb9f0e4 100644
--- a/pkg/controller/elasticsearch/driver/nodes.go
+++ b/pkg/controller/elasticsearch/driver/nodes.go
@@ -96,13 +96,11 @@ func (d *defaultDriver) reconcileNodeSpecs(
                }
        }
 
-       esState := NewMemoizingESState(ctx, esClient)
        // Phase 1: apply expected StatefulSets resources and scale up.
        upscaleCtx := upscaleCtx{
                parentCtx:            ctx,
                k8sClient:            d.K8sClient(),
                es:                   d.ES,
-               esState:              esState,
                expectations:         d.Expectations,
                validateStorageClass: d.OperatorParameters.ValidateStorageClass,
                upscaleReporter:      reconcileState.UpscaleReporter,
@@ -188,6 +186,7 @@ func (d *defaultDriver) reconcileNodeSpecs(
                results.WithReconciliationState(defaultRequeue.WithReason("Cannot clear voting exclusions yet"))
        }
        // shutdown logic is dependent on Elasticsearch version
+       esState := NewMemoizingESState(ctx, esClient)
        nodeShutdowns, err := newShutdownInterface(ctx, d.ES, esClient, esState, reconcileState.StatusReporter)
        if err != nil {
                return results.WithError(err)
diff --git a/pkg/controller/elasticsearch/driver/upscale.go b/pkg/controller/elasticsearch/driver/upscale.go
index bf2047aaf..df7ee75cc 100644
--- a/pkg/controller/elasticsearch/driver/upscale.go
+++ b/pkg/controller/elasticsearch/driver/upscale.go
@@ -25,10 +25,10 @@ import (
 )
 
 type upscaleCtx struct {
-       parentCtx            context.Context
-       k8sClient            k8s.Client
-       es                   esv1.Elasticsearch
-       esState              ESState
+       parentCtx context.Context
+       k8sClient k8s.Client
+       es        esv1.Elasticsearch
+
        expectations         *expectations.Expectations
        validateStorageClass bool
        upscaleReporter      *reconcile.UpscaleReporter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 42ee6eb I think we still want ESState for the predicates and the shutdown handling but it is no longer required for the upscale and spec changes part.

@barkbay barkbay self-requested a review December 11, 2025 09:13
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pebrc pebrc merged commit bbf5ff4 into elastic:main Dec 11, 2025
9 checks passed
@naemono naemono added the v3.3.0 label Jan 18, 2026
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Feb 3, 2026
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [eck-operator](https://github.com/elastic/cloud-on-k8s) | minor | `3.2.0` → `3.3.0` |

---

### Release Notes

<details>
<summary>elastic/cloud-on-k8s (eck-operator)</summary>

### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0)

[Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0)

##### Elastic Cloud on Kubernetes 3.3.0

- [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart)

##### Release Highlights

##### AutoOps Integration (Enterprise feature)

ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details.

##### Elastic Package Registry Integration

ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details.

##### Multiple Stack Configuration Policies composition support (Enterprise feature)

ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details.

##### Features and enhancements

- AutoOpsAgentPolicy support [#&#8203;8941](elastic/cloud-on-k8s#8941) (issue: [#&#8203;8789](elastic/cloud-on-k8s#8789))
- ElasticPackageRegistry support [#&#8203;8800](elastic/cloud-on-k8s#8800) (issue: [#&#8203;8925](elastic/cloud-on-k8s#8925))
- Stack Config Policies composition support [#&#8203;8917](elastic/cloud-on-k8s#8917)
- Use standard Kibana labels and Helm labels on the ECK Operator pod [#&#8203;8840](elastic/cloud-on-k8s#8840) (issue: [#&#8203;8584](elastic/cloud-on-k8s#8584))
- Add service customization support for Elasticsearch remote cluster server [#&#8203;8892](elastic/cloud-on-k8s#8892)
- Removal of Elasticsearch 6.x support from codebase [#&#8203;8979](elastic/cloud-on-k8s#8979)

##### Fixes

- Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#&#8203;8871](elastic/cloud-on-k8s#8871) (issue: [#&#8203;8429](elastic/cloud-on-k8s#8429))
- Fix race condition for pre-existing Stack Config Policy [#&#8203;8928](elastic/cloud-on-k8s#8928) (issue: [#&#8203;8912](elastic/cloud-on-k8s#8912))
- Do not set Kibana server.name [#&#8203;8930](elastic/cloud-on-k8s#8930) (issue: [#&#8203;8929](elastic/cloud-on-k8s#8929))
- Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#&#8203;8932](elastic/cloud-on-k8s#8932) (issue: [#&#8203;8781](elastic/cloud-on-k8s#8781))
- Cleanup orphaned secret mounts when removed from StackConfigPolicy [#&#8203;8937](elastic/cloud-on-k8s#8937) (issue: [#&#8203;8921](elastic/cloud-on-k8s#8921))
- Avoid duplicate error logging for generate GET operations on a GVK [#&#8203;8957](elastic/cloud-on-k8s#8957)
- Remove single master at a time upscale restriction [#&#8203;8940](elastic/cloud-on-k8s#8940) (issue: [#&#8203;8939](elastic/cloud-on-k8s#8939))
- AutoOps: Ignore deprecated ES clusters [#&#8203;9008](elastic/cloud-on-k8s#9008) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- AutoOps: Require 9.2.1 for AutoOps agent [#&#8203;9007](elastic/cloud-on-k8s#9007) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- Multi-SCP: Flip weight semantics - higher weight takes precedence [#&#8203;9046](elastic/cloud-on-k8s#9046)

##### Documentation improvements

- Update Google Cloud LoadBalancer recipe for new requirements [#&#8203;8843](elastic/cloud-on-k8s#8843)
- Fix minUnavailable typo in PDB documentation [#&#8203;8898](elastic/cloud-on-k8s#8898)
- Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#&#8203;8982](elastic/cloud-on-k8s#8982)
- Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#&#8203;8986](elastic/cloud-on-k8s#8986)
- Remove support for Stack 7.17. [#&#8203;9038](elastic/cloud-on-k8s#9038)

##### Dependency updates

- Go 1.25.2 => 1.25.6
- github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5
- github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f
- github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19
- github.com/google/go-containerregistry v0.20.6 => v0.20.7
- github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8
- github.com/prometheus/common v0.67.1 => v0.67.5
- github.com/spf13/cobra v1.10.1 => v1.10.2
- go.elastic.co/apm/v2 v2.7.1 => v2.7.2
- go.uber.org/zap v1.27.0 => v1.27.1
- golang.org/x/crypto v0.40.0 => v0.46.0
- k8s.io/api v0.34.1 => v0.35.0
- k8s.io/apimachinery v0.34.1 => v0.35.0
- k8s.io/client-go v0.34.1 => v0.35.0
- k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4
- sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4
- sigs.k8s.io/controller-tools v0.19.0 => v0.20.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImNoYXJ0Il19-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3682
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Feb 3, 2026
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [elastic/cloud-on-k8s](https://github.com/elastic/cloud-on-k8s) | minor | `v3.2.0` → `v3.3.0` |

---

### Release Notes

<details>
<summary>elastic/cloud-on-k8s (elastic/cloud-on-k8s)</summary>

### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0)

[Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0)

### Elastic Cloud on Kubernetes 3.3.0

- [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart)

##### Release Highlights

##### AutoOps Integration (Enterprise feature)

ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details.

##### Elastic Package Registry Integration

ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details.

##### Multiple Stack Configuration Policies composition support (Enterprise feature)

ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details.

##### Features and enhancements

- AutoOpsAgentPolicy support [#&#8203;8941](elastic/cloud-on-k8s#8941) (issue: [#&#8203;8789](elastic/cloud-on-k8s#8789))
- ElasticPackageRegistry support [#&#8203;8800](elastic/cloud-on-k8s#8800) (issue: [#&#8203;8925](elastic/cloud-on-k8s#8925))
- Stack Config Policies composition support [#&#8203;8917](elastic/cloud-on-k8s#8917)
- Use standard Kibana labels and Helm labels on the ECK Operator pod [#&#8203;8840](elastic/cloud-on-k8s#8840) (issue: [#&#8203;8584](elastic/cloud-on-k8s#8584))
- Add service customization support for Elasticsearch remote cluster server [#&#8203;8892](elastic/cloud-on-k8s#8892)
- Removal of Elasticsearch 6.x support from codebase [#&#8203;8979](elastic/cloud-on-k8s#8979)

##### Fixes

- Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#&#8203;8871](elastic/cloud-on-k8s#8871) (issue: [#&#8203;8429](elastic/cloud-on-k8s#8429))
- Fix race condition for pre-existing Stack Config Policy [#&#8203;8928](elastic/cloud-on-k8s#8928) (issue: [#&#8203;8912](elastic/cloud-on-k8s#8912))
- Do not set Kibana server.name [#&#8203;8930](elastic/cloud-on-k8s#8930) (issue: [#&#8203;8929](elastic/cloud-on-k8s#8929))
- Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#&#8203;8932](elastic/cloud-on-k8s#8932) (issue: [#&#8203;8781](elastic/cloud-on-k8s#8781))
- Cleanup orphaned secret mounts when removed from StackConfigPolicy [#&#8203;8937](elastic/cloud-on-k8s#8937) (issue: [#&#8203;8921](elastic/cloud-on-k8s#8921))
- Avoid duplicate error logging for generate GET operations on a GVK [#&#8203;8957](elastic/cloud-on-k8s#8957)
- Remove single master at a time upscale restriction [#&#8203;8940](elastic/cloud-on-k8s#8940) (issue: [#&#8203;8939](elastic/cloud-on-k8s#8939))
- AutoOps: Ignore deprecated ES clusters [#&#8203;9008](elastic/cloud-on-k8s#9008) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- AutoOps: Require 9.2.1 for AutoOps agent [#&#8203;9007](elastic/cloud-on-k8s#9007) (issue: [#&#8203;9000](elastic/cloud-on-k8s#9000))
- Multi-SCP: Flip weight semantics - higher weight takes precedence [#&#8203;9046](elastic/cloud-on-k8s#9046)

##### Documentation improvements

- Update Google Cloud LoadBalancer recipe for new requirements [#&#8203;8843](elastic/cloud-on-k8s#8843)
- Fix minUnavailable typo in PDB documentation [#&#8203;8898](elastic/cloud-on-k8s#8898)
- Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#&#8203;8982](elastic/cloud-on-k8s#8982)
- Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#&#8203;8986](elastic/cloud-on-k8s#8986)
- Remove support for Stack 7.17. [#&#8203;9038](elastic/cloud-on-k8s#9038)

##### Dependency updates

- Go 1.25.2 => 1.25.6
- github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5
- github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f
- github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19
- github.com/google/go-containerregistry v0.20.6 => v0.20.7
- github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8
- github.com/prometheus/common v0.67.1 => v0.67.5
- github.com/spf13/cobra v1.10.1 => v1.10.2
- go.elastic.co/apm/v2 v2.7.1 => v2.7.2
- go.uber.org/zap v1.27.0 => v1.27.1
- golang.org/x/crypto v0.40.0 => v0.46.0
- k8s.io/api v0.34.1 => v0.35.0
- k8s.io/apimachinery v0.34.1 => v0.35.0
- k8s.io/client-go v0.34.1 => v0.35.0
- k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4
- sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4
- sigs.k8s.io/controller-tools v0.19.0 => v0.20.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImltYWdlIl19-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3685
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug Something isn't working v3.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scaling up non-master nodes should not be blocked by unresponsive Elasticsearch API

5 participants