Remove one master at a time upscale restriction#8940
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
buildkite test this -f p=kind,t="TestMutation.*" |
There was a problem hiding this comment.
Pull request overview
This PR removes the restriction that limited master node upscaling to one node at a time, which was necessary for Elasticsearch 6 and earlier versions to prevent split-brain scenarios during quorum reconfiguration. Since ECK no longer supports these early versions, this restriction is unnecessary and causes problems when scaling overloaded clusters, as it requires calls to the Elasticsearch _nodes API that may fail.
Key Changes:
- Removed one-at-a-time master node creation logic and associated ES API calls during upscale
- Exported
InitialMasterNodesAnnotationconstant for broader use - Added test coverage for upscale scenarios with unavailable Elasticsearch API
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controller/elasticsearch/driver/upscale.go | Removed comment about limiting master node creation to one at a time |
| pkg/controller/elasticsearch/driver/upscale_state.go | Removed isBootstrapped, allowMasterCreation fields and related logic including isMasterNodeJoining, recordMasterNodeCreation, canCreateMasterNode, and limitMasterNodesCreation functions |
| pkg/controller/elasticsearch/driver/upscale_state_test.go | Updated tests to remove bootstrap and master-specific logic, simplified test cases |
| pkg/controller/elasticsearch/driver/upscale_test.go | Updated test expectations to allow multiple master nodes to be created at once, added test for upscale with unavailable ES API |
| pkg/controller/elasticsearch/version/zen2/initial_master_nodes.go | Changed initialMasterNodesAnnotation from private to exported InitialMasterNodesAnnotation |
| pkg/controller/elasticsearch/version/zen2/initial_master_nodes_test.go | Updated references to use exported constant name |
| test/e2e/test/elasticsearch/checks_budget.go | Removed NewMasterChangeBudgetWatcher function and unused math import |
| test/e2e/test/elasticsearch/steps_mutation.go | Removed NewMasterChangeBudgetWatcher from mutation test watchers |
| pkg/controller/common/expectations/expectations.go | Removed comment about preventing creation of more than one master node at a time |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
buildkite test this -f p=gke,t=TestMutation |
There was a problem hiding this comment.
I'm trying to understand in which cases we still need this. Could we just remove it?
diff --git a/pkg/controller/elasticsearch/driver/nodes.go b/pkg/controller/elasticsearch/driver/nodes.go
index 65199fc48..7beb9f0e4 100644
--- a/pkg/controller/elasticsearch/driver/nodes.go
+++ b/pkg/controller/elasticsearch/driver/nodes.go
@@ -96,13 +96,11 @@ func (d *defaultDriver) reconcileNodeSpecs(
}
}
- esState := NewMemoizingESState(ctx, esClient)
// Phase 1: apply expected StatefulSets resources and scale up.
upscaleCtx := upscaleCtx{
parentCtx: ctx,
k8sClient: d.K8sClient(),
es: d.ES,
- esState: esState,
expectations: d.Expectations,
validateStorageClass: d.OperatorParameters.ValidateStorageClass,
upscaleReporter: reconcileState.UpscaleReporter,
@@ -188,6 +186,7 @@ func (d *defaultDriver) reconcileNodeSpecs(
results.WithReconciliationState(defaultRequeue.WithReason("Cannot clear voting exclusions yet"))
}
// shutdown logic is dependent on Elasticsearch version
+ esState := NewMemoizingESState(ctx, esClient)
nodeShutdowns, err := newShutdownInterface(ctx, d.ES, esClient, esState, reconcileState.StatusReporter)
if err != nil {
return results.WithError(err)
diff --git a/pkg/controller/elasticsearch/driver/upscale.go b/pkg/controller/elasticsearch/driver/upscale.go
index bf2047aaf..df7ee75cc 100644
--- a/pkg/controller/elasticsearch/driver/upscale.go
+++ b/pkg/controller/elasticsearch/driver/upscale.go
@@ -25,10 +25,10 @@ import (
)
type upscaleCtx struct {
- parentCtx context.Context
- k8sClient k8s.Client
- es esv1.Elasticsearch
- esState ESState
+ parentCtx context.Context
+ k8sClient k8s.Client
+ es esv1.Elasticsearch
+
expectations *expectations.Expectations
validateStorageClass bool
upscaleReporter *reconcile.UpscaleReporterThere was a problem hiding this comment.
Addressed in 42ee6eb I think we still want ESState for the predicates and the shutdown handling but it is no longer required for the upscale and spec changes part.
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [eck-operator](https://github.com/elastic/cloud-on-k8s) | minor | `3.2.0` → `3.3.0` | --- ### Release Notes <details> <summary>elastic/cloud-on-k8s (eck-operator)</summary> ### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0) [Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0) ##### Elastic Cloud on Kubernetes 3.3.0 - [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart) ##### Release Highlights ##### AutoOps Integration (Enterprise feature) ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details. ##### Elastic Package Registry Integration ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details. ##### Multiple Stack Configuration Policies composition support (Enterprise feature) ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details. ##### Features and enhancements - AutoOpsAgentPolicy support [#​8941](elastic/cloud-on-k8s#8941) (issue: [#​8789](elastic/cloud-on-k8s#8789)) - ElasticPackageRegistry support [#​8800](elastic/cloud-on-k8s#8800) (issue: [#​8925](elastic/cloud-on-k8s#8925)) - Stack Config Policies composition support [#​8917](elastic/cloud-on-k8s#8917) - Use standard Kibana labels and Helm labels on the ECK Operator pod [#​8840](elastic/cloud-on-k8s#8840) (issue: [#​8584](elastic/cloud-on-k8s#8584)) - Add service customization support for Elasticsearch remote cluster server [#​8892](elastic/cloud-on-k8s#8892) - Removal of Elasticsearch 6.x support from codebase [#​8979](elastic/cloud-on-k8s#8979) ##### Fixes - Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#​8871](elastic/cloud-on-k8s#8871) (issue: [#​8429](elastic/cloud-on-k8s#8429)) - Fix race condition for pre-existing Stack Config Policy [#​8928](elastic/cloud-on-k8s#8928) (issue: [#​8912](elastic/cloud-on-k8s#8912)) - Do not set Kibana server.name [#​8930](elastic/cloud-on-k8s#8930) (issue: [#​8929](elastic/cloud-on-k8s#8929)) - Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#​8932](elastic/cloud-on-k8s#8932) (issue: [#​8781](elastic/cloud-on-k8s#8781)) - Cleanup orphaned secret mounts when removed from StackConfigPolicy [#​8937](elastic/cloud-on-k8s#8937) (issue: [#​8921](elastic/cloud-on-k8s#8921)) - Avoid duplicate error logging for generate GET operations on a GVK [#​8957](elastic/cloud-on-k8s#8957) - Remove single master at a time upscale restriction [#​8940](elastic/cloud-on-k8s#8940) (issue: [#​8939](elastic/cloud-on-k8s#8939)) - AutoOps: Ignore deprecated ES clusters [#​9008](elastic/cloud-on-k8s#9008) (issue: [#​9000](elastic/cloud-on-k8s#9000)) - AutoOps: Require 9.2.1 for AutoOps agent [#​9007](elastic/cloud-on-k8s#9007) (issue: [#​9000](elastic/cloud-on-k8s#9000)) - Multi-SCP: Flip weight semantics - higher weight takes precedence [#​9046](elastic/cloud-on-k8s#9046) ##### Documentation improvements - Update Google Cloud LoadBalancer recipe for new requirements [#​8843](elastic/cloud-on-k8s#8843) - Fix minUnavailable typo in PDB documentation [#​8898](elastic/cloud-on-k8s#8898) - Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#​8982](elastic/cloud-on-k8s#8982) - Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#​8986](elastic/cloud-on-k8s#8986) - Remove support for Stack 7.17. [#​9038](elastic/cloud-on-k8s#9038) ##### Dependency updates - Go 1.25.2 => 1.25.6 - github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5 - github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f - github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19 - github.com/google/go-containerregistry v0.20.6 => v0.20.7 - github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8 - github.com/prometheus/common v0.67.1 => v0.67.5 - github.com/spf13/cobra v1.10.1 => v1.10.2 - go.elastic.co/apm/v2 v2.7.1 => v2.7.2 - go.uber.org/zap v1.27.0 => v1.27.1 - golang.org/x/crypto v0.40.0 => v0.46.0 - k8s.io/api v0.34.1 => v0.35.0 - k8s.io/apimachinery v0.34.1 => v0.35.0 - k8s.io/client-go v0.34.1 => v0.35.0 - k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4 - sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4 - sigs.k8s.io/controller-tools v0.19.0 => v0.20.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImNoYXJ0Il19--> Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3682 Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net> Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [elastic/cloud-on-k8s](https://github.com/elastic/cloud-on-k8s) | minor | `v3.2.0` → `v3.3.0` | --- ### Release Notes <details> <summary>elastic/cloud-on-k8s (elastic/cloud-on-k8s)</summary> ### [`v3.3.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.3.0) [Compare Source](elastic/cloud-on-k8s@v3.2.0...v3.3.0) ### Elastic Cloud on Kubernetes 3.3.0 - [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart) ##### Release Highlights ##### AutoOps Integration (Enterprise feature) ECK now supports integration with Elastic AutoOps through a new `AutoOpsAgentPolicy` custom resource. This allows you to instrument multiple Elasticsearch clusters at once for automated health monitoring and performance recommendations. The [AutoOps documentation](https://www.elastic.co/docs/deploy-manage/monitor/autoops) provides more details. ##### Elastic Package Registry Integration ECK now supports deploying and managing Elastic Package Registry (EPR) through a new `PackageRegistry` custom resource. This is particularly useful for air-gapped environments, enabling Kibana to reference a self-hosted registry instead of the public one. The [package registry documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/package-registry) provides more details. ##### Multiple Stack Configuration Policies composition support (Enterprise feature) ECK now includes support for multiple Stack Config Policies targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition. The [stack config policy documentation](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/elastic-stack-configuration-policies) provides more details. ##### Features and enhancements - AutoOpsAgentPolicy support [#​8941](elastic/cloud-on-k8s#8941) (issue: [#​8789](elastic/cloud-on-k8s#8789)) - ElasticPackageRegistry support [#​8800](elastic/cloud-on-k8s#8800) (issue: [#​8925](elastic/cloud-on-k8s#8925)) - Stack Config Policies composition support [#​8917](elastic/cloud-on-k8s#8917) - Use standard Kibana labels and Helm labels on the ECK Operator pod [#​8840](elastic/cloud-on-k8s#8840) (issue: [#​8584](elastic/cloud-on-k8s#8584)) - Add service customization support for Elasticsearch remote cluster server [#​8892](elastic/cloud-on-k8s#8892) - Removal of Elasticsearch 6.x support from codebase [#​8979](elastic/cloud-on-k8s#8979) ##### Fixes - Upgrade master StatefulSets last when performing a version upgrade of Elasticsearch [#​8871](elastic/cloud-on-k8s#8871) (issue: [#​8429](elastic/cloud-on-k8s#8429)) - Fix race condition for pre-existing Stack Config Policy [#​8928](elastic/cloud-on-k8s#8928) (issue: [#​8912](elastic/cloud-on-k8s#8912)) - Do not set Kibana server.name [#​8930](elastic/cloud-on-k8s#8930) (issue: [#​8929](elastic/cloud-on-k8s#8929)) - Do not write `elasticsearch.k8s.elastic.co/managed-remote-clusters` when not necessary [#​8932](elastic/cloud-on-k8s#8932) (issue: [#​8781](elastic/cloud-on-k8s#8781)) - Cleanup orphaned secret mounts when removed from StackConfigPolicy [#​8937](elastic/cloud-on-k8s#8937) (issue: [#​8921](elastic/cloud-on-k8s#8921)) - Avoid duplicate error logging for generate GET operations on a GVK [#​8957](elastic/cloud-on-k8s#8957) - Remove single master at a time upscale restriction [#​8940](elastic/cloud-on-k8s#8940) (issue: [#​8939](elastic/cloud-on-k8s#8939)) - AutoOps: Ignore deprecated ES clusters [#​9008](elastic/cloud-on-k8s#9008) (issue: [#​9000](elastic/cloud-on-k8s#9000)) - AutoOps: Require 9.2.1 for AutoOps agent [#​9007](elastic/cloud-on-k8s#9007) (issue: [#​9000](elastic/cloud-on-k8s#9000)) - Multi-SCP: Flip weight semantics - higher weight takes precedence [#​9046](elastic/cloud-on-k8s#9046) ##### Documentation improvements - Update Google Cloud LoadBalancer recipe for new requirements [#​8843](elastic/cloud-on-k8s#8843) - Fix minUnavailable typo in PDB documentation [#​8898](elastic/cloud-on-k8s#8898) - Use GKE ComputeClass instead of DaemonSet for GKE AutoPilot [#​8982](elastic/cloud-on-k8s#8982) - Adjust `vm.max_map_count` to [`1048576`](elastic/cloud-on-k8s@1048576) in GKE AutoPilot recipes [#​8986](elastic/cloud-on-k8s#8986) - Remove support for Stack 7.17. [#​9038](elastic/cloud-on-k8s#9038) ##### Dependency updates - Go 1.25.2 => 1.25.6 - github.com/KimMachineGun/automemlimit v0.7.4 => v0.7.5 - github.com/elastic/go-ucfg v0.8.9-0.20250307075119-2a22403faaea => v0.8.9-0.20251017163010-3520930bed4f - github.com/gkampitakis/go-snaps v0.5.15 => v0.5.19 - github.com/google/go-containerregistry v0.20.6 => v0.20.7 - github.com/googlecloudplatform/compute-class-api => v0.0.0-20251208134148-ae2e7936c1f8 - github.com/prometheus/common v0.67.1 => v0.67.5 - github.com/spf13/cobra v1.10.1 => v1.10.2 - go.elastic.co/apm/v2 v2.7.1 => v2.7.2 - go.uber.org/zap v1.27.0 => v1.27.1 - golang.org/x/crypto v0.40.0 => v0.46.0 - k8s.io/api v0.34.1 => v0.35.0 - k8s.io/apimachinery v0.34.1 => v0.35.0 - k8s.io/client-go v0.34.1 => v0.35.0 - k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 => v0.0.0-20251002143259-bc988d571ff4 - sigs.k8s.io/controller-runtime v0.22.2 => v0.22.4 - sigs.k8s.io/controller-tools v0.19.0 => v0.20.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4wLjMiLCJ1cGRhdGVkSW5WZXIiOiI0My4wLjMiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImltYWdlIl19--> Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3685 Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net> Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
Fixes #8939
#8939 describes how requests to Elasticsearch are made during upscales. When scaling an overloaded ES cluster this can be problematic as the cluster won't be able to serve these requests in time.
This change removes the source of those requests: the on master at a time scale up restriction. This functionality stems for a time when the ECK operator had to manually reconfigure the quorum during scale up (Elasticsearch 6 and earlier). We no longer support these early versions where a quick scale up could lead to potential split brain scenarios if a partition was formed during scale up.
By removing the one-at-at-time invariant the call to the
_nodesAPI during scale up becomes unnecessary. The call is now only made when calculating the scale downs.