Skip to content

Conversation

yandongxiao
Copy link
Collaborator

@yandongxiao yandongxiao commented Jul 17, 2025

Description

Fixes: #550

Background

When a scale-in operation happens:

For BE with shared-nothing deployment, the cleanup actions include:

  1. Decommission operation (time-consuming)
  2. Stop pod operation (need to wait for termination of BE pod)
  3. Drop BE from SR

For CN with shared-data deployment, the cleanup actions include:

  1. Stop pod operation (need to wait for termination of CN pod)
  2. Drop CN from SR

Among them, Decommission and Stop pod are time-consuming operations, and the Operator cannot wait for the operation to complete in a blocking manner.

In summary, the cleanup actions Stop pod and Drop node are likely not in the same tuning loop. Therefore, we cannot rely on whether the current operation is a scaling-in operation to execute these logics.

How

At the end of each tuning cycle, the operator performs the following validation steps:

1. Verify Replica Consistency

Compare the replicas field in the StarRocksCN Custom Resource Definition (CRD) with the spec.replicas field of the corresponding CN StatefulSet. These values must be identical.

2. Validate Running Pod Count

Compare the replicas field in the StarRocksCluster CRD with the number of running and ready CN pods. These values must match.

3. Confirm Revision Hash Match

Ensure that the controller-revision-hash label on all running CN pods exactly matches the status.updateRevision field of the CN StatefulSet.

4. Perform DROP COMPUTE NODE

If all three conditions are met, the operator will compare the list of compute nodes registered in the Frontend (FE) cluster against the current running CN pods. Initiate the DROP COMPUTE NODE operation for any nodes that are no longer present in the pod list.

Checklist

For operator, please complete the following checklist:

  • run make generate to generate the code.
  • run golangci-lint run to check the code style.
  • run make test to run UT.
  • run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

  • make sure you have updated the values.yaml
    file of starrocks chart.
  • In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent
    chart( kube-starrocks chart).

@yandongxiao yandongxiao force-pushed the feature/detect-scale-in branch 2 times, most recently from 53c9cc0 to c5bf1b2 Compare July 21, 2025 09:20
@yandongxiao yandongxiao changed the title [Enhancement] Remove the compute node from FE [Enhancement] Detect scale-in operation and drop CN node from FE Jul 21, 2025
@yandongxiao yandongxiao changed the title [Enhancement] Detect scale-in operation and drop CN node from FE [Enhancement] Detect scale-in and drop CN node from FE Jul 21, 2025
@yandongxiao yandongxiao marked this pull request as ready for review July 21, 2025 09:35
@yandongxiao yandongxiao requested a review from kevincai July 21, 2025 09:35
@yandongxiao yandongxiao force-pushed the feature/detect-scale-in branch from 171b9c6 to 82fd0a9 Compare July 22, 2025 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

‌‌‌‌‌The FE Leader keeps reporting an UnknownHostException exception
2 participants