Skip to content

Fix race condition in cardinality status updates#10

Closed
haarchri wants to merge 2 commits intokubernetes-sigs:mainfrom
haarchri:fix/race-cardinality-status
Closed

Fix race condition in cardinality status updates#10
haarchri wants to merge 2 commits intokubernetes-sigs:mainfrom
haarchri:fix/race-cardinality-status

Conversation

@haarchri
Copy link
Copy Markdown
Contributor

@haarchri haarchri commented Mar 4, 2026

What does this PR do?

stacked on top of #9

  • The cardinality goroutine and emitSuccess both do Get-Modify-UpdateStatus on the same RMM resource concurrently. With fake clients (no optimistic concurrency), the last writer silently overwrites the other's changes, resulting in Cardinality: nil in tests.
  • Add an emitDone channel in handleEvent that gets closed after emitSuccess completes. The cardinality goroutine waits on this channel before calling persistCardinalityStatus, serializing the two status updates.
  • Add synced atomic.Bool to StoreType, set to true after Replace() finishes. The cardinality goroutine now polls for IsSynced() on all stores instead of storeTotal > 0, so cardinality is calculated after the reflector has listed all objects rather than after just the first one shows up.
  • Rewrite validateStatusOutput in the golden rule tests to poll until the full expected status matches using cmp.Diff, instead of returning as soon as the Processed condition exists. The cardinality update is async by design so the test needs to wait for it.

Why is this change needed?

failing e2e tests

Fixes #

Checklist

  • Tests added or updated
  • Documentation updated (if applicable)
  • make verify_codegen passes (if generated code was changed)
  • make lint passes

Test

make test_e2e
=== RUN   TestCardinalityMetrics
=== PAUSE TestCardinalityMetrics
=== RUN   TestGoldenRules
=== PAUSE TestGoldenRules
=== CONT  TestGoldenRules
=== CONT  TestCardinalityMetrics
I0304 15:09:37.657849     603 collector.go:77] "Registered external collectors" collectors=[]
I0304 15:09:37.658142     603 collector.go:77] "Registered external collectors" collectors=[]
=== RUN   TestGoldenRules/unstructured
=== PAUSE TestGoldenRules/unstructured
=== RUN   TestGoldenRules/cel
=== PAUSE TestGoldenRules/cel
=== RUN   TestGoldenRules/starlark
=== PAUSE TestGoldenRules/starlark
=== CONT  TestGoldenRules/unstructured
=== CONT  TestGoldenRules/cel
=== CONT  TestGoldenRules/starlark
=== RUN   TestGoldenRules/starlark/resourcemetricsmonitor-basic
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-basic
=== RUN   TestGoldenRules/cel/resourcemetricsmonitor-cardinality
=== RUN   TestCardinalityMetrics/TelemetryMetrics
=== PAUSE TestCardinalityMetrics/TelemetryMetrics
=== RUN   TestCardinalityMetrics/StatusUpdate
=== PAUSE TestCardinalityMetrics/StatusUpdate
=== CONT  TestCardinalityMetrics/TelemetryMetrics
=== CONT  TestCardinalityMetrics/StatusUpdate
E0304 15:09:42.754916     603 server.go:220] "err" err="[error gathering metrics: process metrics not supported on this platform]" source="self"
=== NAME  TestCardinalityMetrics/TelemetryMetrics
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_events_processed_total{event_type="addEvent",name="cardinality-telemetry-test",namespace="default",status="success"} 1
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_family_cardinality{family="cardinality_telemetry_test",name="cardinality-telemetry-test",namespace="default",store=""} 2
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_family_cardinality_limit{family="cardinality_telemetry_test",name="cardinality-telemetry-test",namespace="default",store=""} 500
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_global_cardinality 2
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_global_cardinality_limit 0
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_resource_cardinality{name="cardinality-telemetry-test",namespace="default"} 2
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_resource_cardinality_limit{name="cardinality-telemetry-test",namespace="default"} 100000
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_resources_monitored_info{name="cardinality-telemetry-test",namespace="default"} 1
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_store_cardinality{name="cardinality-telemetry-test",namespace="default",store="samplecontroller.k8s.io/v1beta1/Bar"} 2
    cardinality_test.go:179: Found cardinality metric: resource_state_metrics_store_cardinality_limit{name="cardinality-telemetry-test",namespace="default",store="samplecontroller.k8s.io/v1beta1/Bar"} 1000
    cardinality_test.go:131: cardinality_exceeded_total metric type declaration not found (this is expected if no thresholds were exceeded)
=== RUN   TestGoldenRules/cel/resourcemetricsmonitor-map
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-family-cutoff
=== RUN   TestGoldenRules/starlark/resourcemetricsmonitor-nested
=== NAME  TestCardinalityMetrics/StatusUpdate
    cardinality_test.go:136: Cardinality status found: Total=2, ThresholdsExceeded=false
    cardinality_test.go:136: Per-store cardinality: map[samplecontroller.k8s.io/v1beta1/Bar:2]
    cardinality_test.go:136: Per-family cardinality: map[cardinality_telemetry_test:2]
    cardinality_test.go:136: Condition: Type=Processed, Status=True, Reason=EventHandlerSucceeded
    cardinality_test.go:136: Condition: Type=CardinalityCutoff, Status=False, Reason=CardinalityOK
    cardinality_test.go:136: Condition: Type=CardinalityWarning, Status=False, Reason=CardinalityOK
--- PASS: TestCardinalityMetrics (6.08s)
    --- PASS: TestCardinalityMetrics/TelemetryMetrics (0.01s)
    --- PASS: TestCardinalityMetrics/StatusUpdate (5.00s)
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-high-limit
=== RUN   TestGoldenRules/starlark/resourcemetricsmonitor-quantity
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-store-cutoff
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-warning
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-counter
=== RUN   TestGoldenRules/unstructured/resourcemetricsmonitor-gauge
--- PASS: TestGoldenRules (1.08s)
    --- PASS: TestGoldenRules/cel (10.22s)
        --- PASS: TestGoldenRules/cel/resourcemetricsmonitor-cardinality (5.11s)
        --- PASS: TestGoldenRules/cel/resourcemetricsmonitor-map (5.11s)
    --- PASS: TestGoldenRules/starlark (15.74s)
        --- PASS: TestGoldenRules/starlark/resourcemetricsmonitor-basic (5.51s)
        --- PASS: TestGoldenRules/starlark/resourcemetricsmonitor-nested (5.11s)
        --- PASS: TestGoldenRules/starlark/resourcemetricsmonitor-quantity (5.11s)
    --- PASS: TestGoldenRules/unstructured (35.79s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-basic (5.11s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-family-cutoff (5.11s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-high-limit (5.11s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-store-cutoff (5.11s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-cardinality-warning (5.12s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-counter (5.12s)
        --- PASS: TestGoldenRules/unstructured/resourcemetricsmonitor-gauge (5.12s)
PASS
ok      github.com/kubernetes-sigs/resource-state-metrics/tests (cached)
?       github.com/kubernetes-sigs/resource-state-metrics/tests/framework       [no test files]
make lint 
✔ 0 errors, 0 warnings and 0 suggestions in 12 files.
0 issues.

haarchri added 2 commits March 4, 2026 15:46
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
race condition in cardinality status updates

Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 4, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haarchri
Once this PR has been reviewed and has the lgtm label, please assign mrueg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 4, 2026
@rexagod
Copy link
Copy Markdown
Member

rexagod commented Mar 9, 2026

=== NAME  TestGoldenRules/starlark/resourcemetricsmonitor-basic
    goldenrules_test.go:239: Status mismatch (-expected +actual):
          &v1alpha1.ResourceMetricsMonitorStatus{
          	Conditions: []v1.Condition{
          		{Type: "Processed", Status: "True", Reason: "EventHandlerSucceeded"},
        - 		{Type: "CardinalityCutoff", Status: "False", Reason: "CardinalityOK"},
        - 		{Type: "CardinalityWarning", Status: "False", Reason: "CardinalityOK"},
          	},
        - 	Cardinality: &v1alpha1.CardinalityStatus{Total: 2},
        + 	Cardinality: nil,
          }
      

Encountered this, thank you for raising the patch. Reviewing.

@rexagod
Copy link
Copy Markdown
Member

rexagod commented Mar 10, 2026

Done in e04a24f.

Thank you for raising this!

@rexagod rexagod closed this Mar 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants