[Enhancement] Detect scale-in and drop CN node from FE #663

yandongxiao · 2025-07-17T03:18:12Z

Description

Fixes: #550

Background

When a scale-in operation happens:

For BE with shared-nothing deployment, the cleanup actions include:

Decommission operation (time-consuming)
Stop pod operation (need to wait for termination of BE pod)
Drop BE from SR

For CN with shared-data deployment, the cleanup actions include:

Stop pod operation (need to wait for termination of CN pod)
Drop CN from SR

Among them, Decommission and Stop pod are time-consuming operations, and the Operator cannot wait for the operation to complete in a blocking manner.

In summary, the cleanup actions Stop pod and Drop node are likely not in the same tuning loop. Therefore, we cannot rely on whether the current operation is a scaling-in operation to execute these logics.

How

At the end of each tuning cycle, the operator performs the following validation steps:

1. Verify Replica Consistency

Compare the replicas field in the StarRocksCN Custom Resource Definition (CRD) with the spec.replicas field of the corresponding CN StatefulSet. These values must be identical.

2. Validate Running Pod Count

Compare the replicas field in the StarRocksCluster CRD with the number of running and ready CN pods. These values must match.

3. Confirm Revision Hash Match

Ensure that the controller-revision-hash label on all running CN pods exactly matches the status.updateRevision field of the CN StatefulSet.

4. Perform DROP COMPUTE NODE

If all three conditions are met, the operator will compare the list of compute nodes registered in the Frontend (FE) cluster against the current running CN pods. Initiate the DROP COMPUTE NODE operation for any nodes that are no longer present in the pod list.

Checklist

For operator, please complete the following checklist:

run make generate to generate the code.
run golangci-lint run to check the code style.
run make test to run UT.
run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

make sure you have updated the values.yaml
file of starrocks chart.
In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent
chart( kube-starrocks chart).

Signed-off-by: yandongxiao <[email protected]>

…ing in CN controller Signed-off-by: yandongxiao <[email protected]>

yandongxiao force-pushed the feature/detect-scale-in branch 2 times, most recently from 53c9cc0 to c5bf1b2 Compare July 21, 2025 09:20

yandongxiao changed the title ~~[Enhancement] Remove the compute node from FE~~ [Enhancement] Detect scale-in operation and drop CN node from FE Jul 21, 2025

yandongxiao changed the title ~~[Enhancement] Detect scale-in operation and drop CN node from FE~~ [Enhancement] Detect scale-in and drop CN node from FE Jul 21, 2025

yandongxiao marked this pull request as ready for review July 21, 2025 09:35

yandongxiao requested a review from kevincai July 21, 2025 09:35

yandongxiao added 2 commits July 22, 2025 15:15

[Enhancement] Detect the scale-in operation and remove CN node from FE

0ba4f8d

Signed-off-by: yandongxiao <[email protected]>

fix lint error

82fd0a9

Signed-off-by: yandongxiao <[email protected]>

yandongxiao force-pushed the feature/detect-scale-in branch from 171b9c6 to 82fd0a9 Compare July 22, 2025 07:16

[Enhancement] Add shared-data mode check and refine loop variable nam…

f1de694

…ing in CN controller Signed-off-by: yandongxiao <[email protected]>

kevincai approved these changes Jul 23, 2025

View reviewed changes

yandongxiao merged commit 2d52369 into StarRocks:main Jul 23, 2025
4 of 5 checks passed

yandongxiao mentioned this pull request Jul 28, 2025

During the scale-down process, a large number of errors appeared in the FE log #375

Closed

yandongxiao mentioned this pull request Sep 12, 2025

Compute Node does not gracefully exit and deregister from FE Node on shutdown StarRocks/starrocks#62915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Detect scale-in and drop CN node from FE #663

[Enhancement] Detect scale-in and drop CN node from FE #663

Uh oh!

yandongxiao commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Enhancement] Detect scale-in and drop CN node from FE #663

[Enhancement] Detect scale-in and drop CN node from FE #663

Uh oh!

Conversation

yandongxiao commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

How

1. Verify Replica Consistency

2. Validate Running Pod Count

3. Confirm Revision Hash Match

4. Perform DROP COMPUTE NODE

Checklist

Uh oh!

Uh oh!

Uh oh!

yandongxiao commented Jul 17, 2025 •

edited

Loading