[Enhancement] Detect scale-in and drop CN node from FE #663
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes: #550
Background
When a scale-in operation happens:
For BE with shared-nothing deployment, the cleanup actions include:
For CN with shared-data deployment, the cleanup actions include:
Among them,
Decommission
andStop pod
are time-consuming operations, and the Operator cannot wait for the operation to complete in a blocking manner.In summary, the cleanup actions
Stop pod
andDrop node
are likely not in the same tuning loop. Therefore, we cannot rely on whether the current operation is a scaling-in operation to execute these logics.How
At the end of each tuning cycle, the operator performs the following validation steps:
1. Verify Replica Consistency
Compare the replicas field in the StarRocksCN Custom Resource Definition (CRD) with the spec.replicas field of the corresponding CN StatefulSet. These values must be identical.
2. Validate Running Pod Count
Compare the replicas field in the StarRocksCluster CRD with the number of running and ready CN pods. These values must match.
3. Confirm Revision Hash Match
Ensure that the
controller-revision-hash
label on all running CN pods exactly matches thestatus.updateRevision
field of the CN StatefulSet.4. Perform DROP COMPUTE NODE
If all three conditions are met, the operator will compare the list of compute nodes registered in the Frontend (FE) cluster against the current running CN pods. Initiate the
DROP COMPUTE NODE
operation for any nodes that are no longer present in the pod list.Checklist
For operator, please complete the following checklist:
make generate
to generate the code.golangci-lint run
to check the code style.make test
to run UT.make manifests
to update the yaml files of CRD.For helm chart, please complete the following checklist:
file of starrocks chart.
scripts
directory, runbash create-parent-chart-values.sh
to update the values.yaml file of the parentchart( kube-starrocks chart).