Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Reconciliation Tracing #8143

Open
wants to merge 130 commits into
base: main
Choose a base branch
from
Open

Conversation

free6om
Copy link
Contributor

@free6om free6om commented Sep 12, 2024

Background

"Reconciliation Tracing" refers to the process of tracking the reconciliation workflow of an Operator or Controller and presenting it in a task flow format. This allows for a more intuitive observation of the reconciliation process and helps in problem identification. This feature can be considered an observability tool for Operators or Controllers.

Goals and Motivation

The tracking of the reconciliation process has two extremes. One is to simply observe whether the current status (usually the Status section) of the primary resource aligns with the desired state (typically the Spec section), which represents the highest-level observation.

The other direction is to observe changes related to the primary resource and all associated secondary resources to gain a comprehensive understanding of the entire reconciliation process. The former aspect has been sufficiently considered in the design of each API or Custom Resource Definition (CRD), while this proposal focuses on the latter.

The goal of this proposal is to track and display the reconciliation process of KubeBlocks to facilitate a detailed understanding of the reconciliation process or to identify issues. Specifically, it includes the following objectives:

  1. Track and display the ongoing reconciliation process to understand detailed steps and identify bottlenecks.
  2. Predict and display upcoming reconciliation processes to grasp the future reconciliation flow.
  3. Support a dry-run feature, which answers the question: If a specific update is made to the Cluster Spec, what will the subsequent reconciliation process and results look like?

Proposal Design

Mental Model

Object Tree

The reconciliation process revolves around the object tree and the changes within it. An object tree consists of a primary object and all its associated secondary objects. For a KB Cluster, the structure of its object tree is as follows:
image

Object Change

The reconciliation process ultimately manifests as the addition, removal, or state change of nodes within the object tree. By recording these changes and presenting them along a timeline, we can obtain a clear and detailed view of the reconciliation process.

The recorded content includes two types: one is the changes to the objects themselves within the object tree, and the other is the events associated with those objects in the object tree.

Reconciliation Cycle

Throughout the entire lifecycle of a Cluster, from creation to eventual destruction, it undergoes multiple reconciliation loops. At the end of each reconciliation loop, the Cluster either reaches a terminal state or does not. We refer to all reconciliations occurring between two instances of the terminal state as a reconciliation cycle.

Metaphorically, the object tree can be viewed as a tree in nature, and what we aim to describe is the growth and changes of this tree over the course of a year (or one growth cycle).

API Design

For detailed design, please refer to the pull request (PR). Below is an example:

apiVersion: trace.kubeblocks.io/v1
kind: ReconciliationTrace
metadata:
  finalizers:
  - trace.kubeblocks.io/finalizer
  name: redis-cluster-tracing
  namespace: default
spec:
  targetObject:
    name: redis-cluster
    namespace: default
  dryRun:
    desiredSpec: |
      componentSpecs:
      - name: redis
        replicas: 3
  stateEvaluationExpression:
    celExpression:
      expression: "has(object.status.phase) && object.status.phase == \"Running\""
  locale: zh_CN
status:
  currentState:
    changes:
    - changeType: Creation
      description: Creation
      objectReference:
        apiVersion: apps.kubeblocks.io/v1alpha1
        kind: Configuration
        name: redis-cluster-redis-sentinel
        namespace: default
        resourceVersion: "1020694"
        uid: 8d522d4d-8bcd-41a7-96f3-1e71af6d70d4
      revision: 1020694
      timestamp: "2024-09-29T05:59:06Z"
    - changeType: Event
      description: no available pod to execute action postProvision
      eventAttributes:
        reason: Warning
        type: Warning
      objectReference:
        apiVersion: apps.kubeblocks.io/v1
        kind: Component
        name: redis-cluster-redis
        namespace: default
        resourceVersion: "1020816"
        uid: 225108bb-7a86-4c94-ae74-6639a1c30e8e
      revision: 1072157
      timestamp: "2024-09-29T13:46:53Z"
    objectTree:
      primary:
        apiVersion: apps.kubeblocks.io/v1
        kind: Cluster
        name: redis-cluster
        namespace: default
        resourceVersion: "1020985"
        uid: b6293a70-cd42-4a03-b4ec-95268683495d
      secondaries:
      - primary:
          apiVersion: apps.kubeblocks.io/v1
          kind: Component
          name: redis-cluster-redis
          namespace: default
          resourceVersion: "1020816"
          uid: 225108bb-7a86-4c94-ae74-6639a1c30e8e
        secondaries:
        - primary:
            apiVersion: workloads.kubeblocks.io/v1
            kind: InstanceSet
            name: redis-cluster-redis
            namespace: default
            resourceVersion: "1040780"
            uid: 95adb8be-326a-44f1-807f-cbfed6f67ec5
          secondaries:
          - primary:
              apiVersion: v1
              kind: Pod
              name: redis-cluster-redis-0
              namespace: default
              resourceVersion: "1040377"
              uid: da071f7b-04e7-423c-975e-b81c7df9f063
          - primary:
              apiVersion: v1
              kind: Pod
              name: redis-cluster-redis-1
              namespace: default
              resourceVersion: "1021064"
              uid: 6dc16aae-bfcd-45f3-b991-7f58338dd84d
          - primary:
              apiVersion: v1
              kind: Service
              name: redis-cluster-redis-headless
              namespace: default
              resourceVersion: "1020732"
              uid: 0b02a3ef-4b50-4963-9c91-69556d47555e
    summary:
      objectSummaries:
      - changeSummary: {}
        objectType:
          apiVersion: apps.kubeblocks.io/v1
          kind: Cluster
        total: 1
      - changeSummary:
          added: 2
        objectType:
          apiVersion: apps.kubeblocks.io/v1
          kind: Component
        total: 2
  dryRunResult:
    desiredSpecRevision: 557db4f88f
    message: no available pod to execute action postProvision
    observedTargetGeneration: 2
    phase: Failed
    plan:
      changes:
      - changeType: Update
        description: Update
        objectReference:
          apiVersion: apps.kubeblocks.io/v1
          kind: Cluster
          name: redis-cluster
          namespace: default
          resourceVersion: "1051770"
          uid: b6293a70-cd42-4a03-b4ec-95268683495d
        revision: 1051770
        timestamp: "2024-09-29T05:59:07Z"
      - changeType: Update
        description: Update
        objectReference:
          apiVersion: apps.kubeblocks.io/v1
          kind: Component
          name: redis-cluster-redis
          namespace: default
          resourceVersion: "1051778"
          uid: 225108bb-7a86-4c94-ae74-6639a1c30e8e
        revision: 1051778
        timestamp: "2024-09-29T05:59:09Z"
      objectTree:
        primary:
          apiVersion: apps.kubeblocks.io/v1
          kind: Cluster
          name: redis-cluster
          namespace: default
          resourceVersion: "1051771"
          uid: b6293a70-cd42-4a03-b4ec-95268683495d
        secondaries:
        - primary:
            apiVersion: apps.kubeblocks.io/v1
            kind: Component
            name: redis-cluster-redis-sentinel
            namespace: default
            resourceVersion: "1020987"
            uid: caa95b87-c2d6-428b-b0c7-c4083a20d1e1
      summary:
        objectSummaries:
        - changeSummary:
            updated: 1
          objectType:
            apiVersion: apps.kubeblocks.io/v1
            kind: Cluster
          total: 1
        - changeSummary:
            updated: 1
          objectType:
            apiVersion: apps.kubeblocks.io/v1
            kind: Component
          total: 2
    reason: ReconcileError
    specDiff: "  map[string]any{\n  \t\"componentSpecs\": []any{\n- \t\tmap[string]any{\n-
      \t\t\t\"componentDef\":    string(\"redis-7-1.0.0-alpha.0\"),\n- \t\t\t\"disableExporter\":
      bool(true),\n- \t\t\t\"name\":            string(\"redis\"),\n- \t\t\t\"replicas\":
      \       int64(2),\n- \t\t\t\"resources\": map[string]any{\n- \t\t\t\t\"limits\":
      \  map[string]any{\"cpu\": string(\"500m\"), \"memory\": string(\"512Mi\")},\n-
      \t\t\t\t\"requests\": map[string]any{\"cpu\": string(\"500m\"), \"memory\":
      string(\"512Mi\")},\n- \t\t\t},\n- \t\t\t\"serviceAccountName\":   string(\"kb-redis-cluster\"),\n-
      \t\t\t\"serviceVersion\":       string(\"7.2.4\"),\n- \t\t\t\"volumeClaimTemplates\":
      []any{map[string]any{\"name\": string(\"data\"), \"spec\": map[string]any{...}}},\n-
      \t\t},\n- \t\tmap[string]any{\n- \t\t\t\"componentDef\":    string(\"redis-sentinel-7-1.0.0-alpha.0\"),\n-
      \t\t\t\"disableExporter\": bool(false),\n- \t\t\t\"name\":            string(\"redis-sentinel\"),\n-
      \t\t\t\"replicas\":        int64(3),\n- \t\t\t\"resources\": map[string]any{\n-
      \t\t\t\t\"limits\":   map[string]any{\"cpu\": string(\"500m\"), \"memory\":
      string(\"512Mi\")},\n- \t\t\t\t\"requests\": map[string]any{\"cpu\": string(\"500m\"),
      \"memory\": string(\"512Mi\")},\n- \t\t\t},\n- \t\t\t\"serviceAccountName\":
      \  string(\"kb-redis-cluster\"),\n- \t\t\t\"serviceVersion\":       string(\"7.2.4\"),\n-
      \t\t\t\"volumeClaimTemplates\": []any{map[string]any{\"name\": string(\"data\"),
      \"spec\": map[string]any{...}}},\n- \t\t},\n+ \t\tmap[string]any{\"name\": string(\"redis\"),
      \"replicas\": float64(3)},\n  \t},\n- \t\"terminationPolicy\": string(\"Delete\"),\n
      \ }\n"
  initialObjectTree:
    primary:
      apiVersion: apps.kubeblocks.io/v1
      kind: Cluster
      name: redis-cluster
      namespace: default
      resourceVersion: "1020985"
      uid: b6293a70-cd42-4a03-b4ec-95268683495d

kbcli Interaction Design

Command Line

Command Path: kbcli trace
Subcommands:

$ kbcli trace --help
list - list all traces
create <trace-name> --cluster-name [cluster-name] --depth [depth] --locale [locale] --cel-state-evaluation-expression [expression]
update <trace-name> --depth [depth] --locale [locale] --cel-state-evaluation-expression [expression]
delete <trace-name>
watch <trace-name>

$ kbcli cluster create/update --dry-run=trace

TUI Display

In the Trace subcommand, the watch command provides real-time tracking and display of the reconciliation process, presented in a TUI format within kbcli. The layout is as follows:
image

Implementation of the Proposal

Reconciliation Process Tracking

Tracking the reconciliation process involves monitoring all primary and secondary resources related to a KB Cluster, as well as associated events, and meticulously recording each change. The basic flow is as follows:
image

Reconciliation Process Prediction

Prediction refers to forecasting all reconciliation processes that will occur from the current state until the terminal state is reached. The format and approach for this are similar to process tracking, described through changes in the object tree.

The key difference is that the prediction mechanism builds a simulated environment, where the current state (the Cluster and all its secondary objects) serves as input until the simulation reaches a terminal state. All object changes that occur during this period are recorded as part of the reconciliation process prediction. The basic flow is as follows:
image

Predictions occur in two scenarios:

When there is a change in the Cluster Spec, initiating a new reconciliation cycle, the upcoming reconciliation process is predicted and updated in trace.status.desiredState.
During a dry run, the desired specification from the dry run is first applied to the current Cluster object in the simulated environment, after which the subsequent reconciliation process is predicted and updated in trace.status.dryRunResult.

kubectl Display

In kubectl, the current reconciliation progress can be viewed in real-time using the command:

kubectl get trace <name> -w

For example:

$ kubectl get trace redis -w
NAME    AGE     TARGET_NS   TARGET_NAME     API_VERSION             KIND        NAMESPACE   NAME                  CHANGE
redis   5d21h   default     redis-cluster   apps.kubeblocks.io/v1   Component   default     redis-cluster-redis   no available pod to execute action postProvision

Screenshot:
image

kbcli Display

Through kbcli, more detailed information can be displayed. For example, the following shows the tracking of the reconciliation process for a Redis cluster:
image

@free6om free6om added this to the Release 1.0.0 milestone Sep 12, 2024
@free6om free6om self-assigned this Sep 12, 2024
@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines. label Sep 12, 2024
@free6om
Copy link
Contributor Author

free6om commented Sep 12, 2024

why use these two markers:

// +kubebuilder:pruning:PreserveUnknownFields
// +kubebuilder:validation:Schemaless

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 63.01887% with 980 lines in your changes missing coverage. Please review.

Project coverage is 60.63%. Comparing base (fe24a3a) to head (0ee3972).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
controllers/trace/reconciler_tree.go 63.79% 168 Missing and 46 partials ⚠️
controllers/trace/util.go 74.16% 110 Missing and 52 partials ⚠️
controllers/trace/informer_manager.go 0.00% 119 Missing ⚠️
controllers/trace/mock_client.go 59.42% 71 Missing and 28 partials ⚠️
controllers/trace/desired_state_handler.go 53.62% 46 Missing and 18 partials ⚠️
controllers/trace/current_state_handler.go 57.25% 39 Missing and 14 partials ⚠️
controllers/trace/plan_generator.go 58.18% 33 Missing and 13 partials ⚠️
controllers/trace/dry_run_handler.go 57.94% 31 Missing and 14 partials ⚠️
controllers/trace/object_tree_root_finder.go 61.29% 28 Missing and 8 partials ⚠️
controllers/trace/type.go 66.19% 16 Missing and 8 partials ⚠️
... and 12 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8143      +/-   ##
==========================================
+ Coverage   60.35%   60.63%   +0.28%     
==========================================
  Files         357      376      +19     
  Lines       42555    45195    +2640     
==========================================
+ Hits        25684    27404    +1720     
- Misses      14559    15241     +682     
- Partials     2312     2550     +238     
Flag Coverage Δ
unittests 60.63% <63.01%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@free6om free6om marked this pull request as ready for review October 21, 2024 07:35
@free6om free6om requested review from JashBook, ahjing99 and a team as code owners October 21, 2024 07:35
@free6om free6om changed the title feat: Support Reconciliation Trace feat: Support Reconciliation Tracing Oct 21, 2024
Copy link
Contributor

@zjx20 zjx20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be continued.

controllers/trace/type.go Outdated Show resolved Hide resolved
controllers/trace/type.go Outdated Show resolved Hide resolved
controllers/trace/util.go Outdated Show resolved Hide resolved
controllers/trace/util.go Show resolved Hide resolved
controllers/trace/util.go Show resolved Hide resolved
controllers/trace/change_capture_store.go Show resolved Hide resolved
controllers/trace/change_capture_store.go Show resolved Hide resolved
controllers/trace/object_revision_store.go Outdated Show resolved Hide resolved
controllers/trace/object_revision_store.go Outdated Show resolved Hide resolved
controllers/trace/informer_manager.go Outdated Show resolved Hide resolved
controllers/trace/mock_client.go Show resolved Hide resolved
controllers/trace/mock_client.go Outdated Show resolved Hide resolved
controllers/trace/mock_client.go Outdated Show resolved Hide resolved
controllers/trace/mock_client.go Show resolved Hide resolved
controllers/trace/reconciler_tree.go Outdated Show resolved Hide resolved
controllers/trace/reconciler_tree.go Outdated Show resolved Hide resolved
controllers/trace/reconciler_tree.go Outdated Show resolved Hide resolved
controllers/trace/desired_state_handler.go Outdated Show resolved Hide resolved
apis/trace/v1/reconciliationtrace_types.go Show resolved Hide resolved
// Unmarshal the patched JSON back into the object
return json.Unmarshal(patchedJSON, obj)
// Unmarshal the patched JSON into a new clean object
newObj := obj.DeepCopyObject().(client.Object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use reflection to create a new zeroed object directly, to avoid this deep-copying.

@@ -1865,6 +1865,8 @@ controllers:
enabled: true
experimental:
enabled: false
trace:
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting it to false for now and re-enabling it when the performance issue is solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants