feat: add optional Prometheus monitoring via OperatorConfig by bennyz · Pull Request #251 · centos-automotive-suite/automotive-dev-operator

bennyz · 2026-04-23T08:43:37Z

Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor, metrics token Secret, Role, and ClusterRoleBinding when enabled

Sample Perses dashboard for reference

Summary

Related Issues

Type of Change

Testing

Unit tests pass (make test)
Linter passes (make lint)
Manifests are up to date (make manifests generate)
Tested on OpenShift cluster (if applicable)

Summary by CodeRabbit

New Features
- Configurable Prometheus monitoring (enable/disable, scrape interval with 30s default) with automatic ServiceMonitor deployment and status reporting.
- Automatic RBAC and token setup for secure metrics scraping.
- New metric exposing active in-progress builds.
Tests
- Added tests validating monitoring resource generation and active-builds gauge behavior.
Samples
- Added Perses dashboard and console UIPlugin samples for observability.

coderabbitai · 2026-04-23T08:43:50Z

📝 Walkthrough

Walkthrough

Adds optional Prometheus monitoring to OperatorConfig (new MonitoringConfig API and status flag), controller logic to deploy/cleanup ServiceMonitor, token Secret and RBAC, builder helpers and tests, sample manifests for Perses/UIPlugin, and imagebuild metrics/gauge with startup initialization and phase-bound updates.

Changes

Cohort / File(s)	Summary
API Types & Deepcopy `api/v1alpha1/operatorconfig_types.go`, `api/v1alpha1/zz_generated.deepcopy.go`	Add `MonitoringConfig` (Enabled, Interval + `GetInterval()`); add `OperatorConfigSpec.Monitoring` and `OperatorConfigStatus.MonitoringEnabled`; generated deepcopy for `MonitoringConfig`.
OperatorConfig Controller `internal/controller/operatorconfig/controller.go`	Reconcile manages monitoring lifecycle: when enabled create metrics token Secret, Role/RoleBinding/ClusterRoleBinding and a `monitoring.coreos.com/v1` ServiceMonitor; when disabled or deleting, delete ServiceMonitor (ignore NotFound/CRD-absent) and clean up Secret/RBAC; update `status.monitoringEnabled` and logs; RBAC extended for servicemonitors.
Resource Builders & Tests `internal/controller/operatorconfig/resources.go`, `internal/controller/operatorconfig/resources_test.go`	Add builders for metrics token Secret, namespace Role + RoleBinding, ClusterRoleBinding, and an unstructured ServiceMonitor (https /metrics, bearer token, tls.insecureSkipVerify, interval from MonitoringConfig with 30s default); tests validate GVK, selector, endpoint, bearer token ref, interval defaulting and RBAC rules/subjects.
Samples / Manifests `config/samples/automotive_v1_operatorconfig.yaml`, `config/samples/monitoring/perses-dashboard.yaml`, `config/samples/monitoring/uiplugin.yaml`	Add commented `spec.monitoring` sample; add Perses dashboard manifest and OpenShift UIPlugin manifest for Perses/Observability integration.
ImageBuild Metrics & Controller `internal/controller/imagebuild/controller.go`, `internal/controller/imagebuild/metrics.go`, `internal/controller/imagebuild/metrics_test.go`	Add `ActiveBuilds` Prometheus gauge, register metric, helper `adjustActiveBuildsGauge` and tests; update controller to adjust gauge on phase transitions and initialize gauge at startup by listing existing ImageBuilds.

Sequence Diagram

sequenceDiagram
    participant OC as OperatorConfig Controller
    participant K8s as Kubernetes API
    participant Prom as Prometheus

    rect rgba(100,150,200,0.5)
        Note over OC,K8s: Enable monitoring
        OC->>K8s: Create metrics token Secret
        OC->>K8s: Create Role / RoleBinding
        OC->>K8s: Create ClusterRoleBinding
        OC->>K8s: Create/Update ServiceMonitor
        OC->>K8s: Update status.monitoringEnabled = true
    end

    rect rgba(200,150,100,0.5)
        Note over Prom,K8s: Prometheus scraping
        Prom->>K8s: Discover ServiceMonitor
        Prom->>K8s: Read token Secret
        Prom->>K8s: GET /metrics (authenticated)
    end

    rect rgba(200,100,100,0.5)
        Note over OC,K8s: Disable / Cleanup monitoring
        OC->>K8s: Delete ServiceMonitor (ignore NotFound/CRD-absent)
        OC->>K8s: Delete token Secret
        OC->>K8s: Delete Role / RoleBinding / ClusterRoleBinding
        OC->>K8s: Update status.monitoringEnabled = false
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

initial jumpstarter integration #31: Modifies OperatorConfig API types and deepcopy logic similar to this PR.

Suggested reviewers

bkhizgiy

Poem

🐰 I stitched a metric, soft and neat,

Secrets tucked where tokens meet.
ServiceMonitors hum at night,
RBAC set so scrapers bite.
A happy rabbit hops in light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding optional Prometheus monitoring via OperatorConfig, which is the core feature across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/controller/operatorconfig/controller.go (1)

235-238: ⚠️ Potential issue | 🟠 Major

Clean up monitoring resources during OperatorConfig deletion.

The finalizer only calls cleanupOSBuilds, but deployServiceMonitor creates a Secret, Role, RoleBinding, ClusterRoleBinding, and ServiceMonitor without owner references. Deleting the OperatorConfig will orphan those resources.

Proposed fix

 		log.Info("Handling deletion")
 		if err := r.cleanupOSBuilds(ctx, config); err != nil {
 			log.Error(err, "Failed to cleanup OSBuilds")
 			return ctrl.Result{}, err
 		}
+		if err := r.cleanupServiceMonitor(ctx, config); err != nil {
+			log.Error(err, "Failed to cleanup ServiceMonitor")
+			return ctrl.Result{}, err
+		}
 		log.Info("Removing finalizer")

Also applies to: 1031-1054

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 235 - 238, The
finalizer only calls cleanupOSBuilds but never removes monitoring resources
created by deployServiceMonitor, so implement and call a cleanup function (e.g.,
r.cleanupMonitoringResources(ctx, config)) alongside r.cleanupOSBuilds in the
deletion path to delete the Secret, Role, RoleBinding, ClusterRoleBinding and
ServiceMonitor that were created without ownerRefs; locate the deletion logic
near where cleanupOSBuilds is invoked (the finalizer block) and ensure
cleanupMonitoringResources uses the same client/context to find and delete the
resources by the same names/labels used in deployServiceMonitor so they are not
orphaned on OperatorConfig deletion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/operatorconfig_types.go`:
- Around line 445-448: The Interval field on MonitoringConfig currently lacks
CRD-level validation; add a kubebuilder validation pattern annotation above the
Interval field (on MonitoringConfig.Interval) using the Prometheus duration
regex provided (e.g.,
^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$)
so invalid durations are rejected by the CRD, then run make generate manifests
to regenerate the CRD artifacts.

In `@internal/controller/operatorconfig/controller.go`:
- Around line 299-314: The reconcile logic is swallowing errors from
deployServiceMonitor and cleanupServiceMonitor; change the Reconcile flow so
that when r.deployServiceMonitor(ctx, config) or r.cleanupServiceMonitor(ctx,
config) returns an error you return that error (do not just log it), and only
modify config.Status.MonitoringEnabled when the operation succeeded; keep the
existing statusChanged handling but ensure you propagate the original error
instead of continuing to success so the controller can retry. Reference the
deployServiceMonitor and cleanupServiceMonitor calls and the
config.Status.MonitoringEnabled updates to locate where to return the error and
guard status mutation.

---

Outside diff comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 235-238: The finalizer only calls cleanupOSBuilds but never
removes monitoring resources created by deployServiceMonitor, so implement and
call a cleanup function (e.g., r.cleanupMonitoringResources(ctx, config))
alongside r.cleanupOSBuilds in the deletion path to delete the Secret, Role,
RoleBinding, ClusterRoleBinding and ServiceMonitor that were created without
ownerRefs; locate the deletion logic near where cleanupOSBuilds is invoked (the
finalizer block) and ensure cleanupMonitoringResources uses the same
client/context to find and delete the resources by the same names/labels used in
deployServiceMonitor so they are not orphaned on OperatorConfig deletion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c653467-8189-443a-b3bf-f33d42cb92b6

📥 Commits

Reviewing files that changed from the base of the PR and between 5ade329 and fd146d6.

⛔ Files ignored due to path filters (2)

config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
config/rbac/role.yaml is excluded by !config/rbac/role.yaml

📒 Files selected for processing (5)

api/v1alpha1/operatorconfig_types.go
api/v1alpha1/zz_generated.deepcopy.go
internal/controller/operatorconfig/controller.go
internal/controller/operatorconfig/resources.go
internal/controller/operatorconfig/resources_test.go

bennyz · 2026-04-23T09:15:36Z

@ambient-code please review

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/samples/monitoring/perses-dashboard.yaml`:
- Line 77: The panel query currently divides two raw sums and can yield NaN when
no samples exist; update the PromQL in the query field that uses ado_build_total
(status="success") and ado_build_total (total) to clamp the denominator and
provide a zero fallback—e.g. wrap the denominator with clamp_min(...,1) and
ensure the numerator falls back to 0 if empty—so replace the existing expression
with one that uses
clamp_min(sum(ado_build_total{namespace="automotive-dev-operator-system"}), 1)
for the denominator and a safe numerator
(sum(ado_build_total{namespace="automotive-dev-operator-system",status="success"})
or 0) to avoid NaN.

In `@internal/controller/imagebuild/controller.go`:
- Around line 2317-2323: ActiveBuilds can get out of sync because some code
paths patch status directly instead of using updateStatus; ensure the gauge is
always consistent by either (preferred) recomputing ActiveBuilds at startup and
after any status change from cluster state (query ImageBuild objects and set the
gauge to the count where phase == phaseBuilding) or by routing every phase
transition through a single helper that updates status and adjusts ActiveBuilds
atomically (create a transition helper used by updateStatus and the direct patch
sites), and seed the gauge on controller startup from current ImageBuilds to
avoid negative counts; refer to ActiveBuilds, updateStatus and phaseBuilding to
locate the places to change.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 1177-1179: The current metrics scrape config sets "tlsConfig" with
"insecureSkipVerify: true" — replace this by loading the cluster/serving CA
bundle and wiring it into the "tlsConfig" so TLS is validated; specifically,
remove the insecureSkipVerify entry and instead populate tlsConfig with the CA
data (e.g., a "caData" or "caFile" field) sourced from the serving CA (the
OpenShift serving CA ConfigMap/secret or the injected service-ca bundle) before
returning the structure in the code that builds the metrics scrape config (look
for the map literal keyed "tlsConfig" in this file). Ensure the CA is correctly
base64-decoded or passed in the expected form for the downstream consumer and
handle errors if the CA cannot be retrieved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eb20d32e-fc58-4b1f-8736-4eeaf2bfbac9

📥 Commits

Reviewing files that changed from the base of the PR and between fd146d6 and 68ceb6c.

⛔ Files ignored due to path filters (2)

config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
config/rbac/role.yaml is excluded by !config/rbac/role.yaml

📒 Files selected for processing (10)

api/v1alpha1/operatorconfig_types.go
api/v1alpha1/zz_generated.deepcopy.go
config/samples/automotive_v1_operatorconfig.yaml
config/samples/monitoring/perses-dashboard.yaml
config/samples/monitoring/uiplugin.yaml
internal/controller/imagebuild/controller.go
internal/controller/imagebuild/metrics.go
internal/controller/operatorconfig/controller.go
internal/controller/operatorconfig/resources.go
internal/controller/operatorconfig/resources_test.go

✅ Files skipped from review due to trivial changes (3)

config/samples/automotive_v1_operatorconfig.yaml
config/samples/monitoring/uiplugin.yaml
internal/controller/operatorconfig/resources_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

api/v1alpha1/operatorconfig_types.go

coderabbitai · 2026-04-23T10:27:42Z

+				"tlsConfig": map[string]interface{}{
+					"insecureSkipVerify": true,
+				},


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and examine the lines in question wc -l internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 143

🏁 Script executed:

# Read the specific lines and surrounding context (1170-1190) to understand the usage sed -n '1170,1190p' internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 393

🏁 Script executed:

# Let's get more context to understand what this configuration is for sed -n '1150,1200p' internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 979

🏁 Script executed:

# Find the function containing this code to understand context better sed -n '1100,1185p' internal/controller/operatorconfig/resources.go | head -100

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 2329

🏁 Script executed:

# Search for the function name and look for any CA configurations rg "serviceMonitorTokenSecret|caBundle|caCert|servingCA" internal/controller/operatorconfig/ -A 2 -B 2

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 3229

🏁 Script executed:

# Check if there are other ServiceMonitor creations with CA configurations rg "insecureSkipVerify|ca.*Config|caFile" internal/controller/operatorconfig/ -B 3 -A 3

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 1434

Wire the serving CA into tlsConfig instead of disabling TLS verification for metrics scraping.

insecureSkipVerify: true makes bearer-token-authenticated metrics scraping vulnerable to endpoint impersonation within the cluster. In an OpenShift environment, the serving CA is available and should be properly configured in tlsConfig to verify the metrics endpoint's certificate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/resources.go` around lines 1177 - 1179, The current metrics scrape config sets "tlsConfig" with "insecureSkipVerify: true" — replace this by loading the cluster/serving CA bundle and wiring it into the "tlsConfig" so TLS is validated; specifically, remove the insecureSkipVerify entry and instead populate tlsConfig with the CA data (e.g., a "caData" or "caFile" field) sourced from the serving CA (the OpenShift serving CA ConfigMap/secret or the injected service-ca bundle) before returning the structure in the code that builds the metrics scrape config (look for the map literal keyed "tlsConfig" in this file). Ensure the CA is correctly base64-decoded or passed in the expected form for the downstream consumer and handle errors if the CA cannot be retrieved.

ambient-code

Code Review: feat: add optional Prometheus monitoring via OperatorConfig

Overall this is a solid PR that follows the existing patterns in the codebase well. The unstructured approach for ServiceMonitor is the right call to avoid a hard dependency on the Prometheus Operator CRDs, and the monitoring/RBAC resources are correctly scoped. The ActiveBuilds gauge and seeder are nice additions.

I found one bug (double-decrement of ActiveBuilds gauge), a few items worth addressing, and some minor suggestions.

Summary of findings

Bug -- double gauge decrement on build completion (blocking): checkBuildProgress calls adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) directly for the success path, but the Cancelled and Failed paths go through updateStatus, which also calls adjustActiveBuildsGauge. This is correctly partitioned (no overlap). However, I noticed that the Completed path in checkBuildProgress sets fresh.Status.Phase = phaseCompleted and then patches, but it does NOT go through updateStatus. On the next reconciliation triggered by the status patch, checkBuildProgress might be called again if the PipelineRun is still in the completed state. The guard at the top of checkBuildProgress should prevent re-processing, but the gauge call at line 608 happens before the event/return. Worth verifying the guard ensures this path cannot be entered twice.
Monitoring is independent of OSBuilds (design observation): Monitoring reconciliation runs regardless of whether OSBuilds is enabled. This seems intentional and correct -- the operator itself always exposes metrics. Just confirming this is the desired behavior.
ClusterRoleBinding references a pre-existing metrics-reader ClusterRole: The buildMetricsReaderClusterRoleBinding binds to a ClusterRole named metrics-reader which exists in config/rbac/metrics_reader_role.yaml and is deployed via kustomize. This means the operator creates the binding at runtime but relies on the ClusterRole having been deployed via kustomize at install time. This coupling should be documented.
Test coverage: Good unit tests for the resource builders and the gauge function. Missing a test for GetInterval() with nil receiver, and no integration test for the reconciliation flow (deploy + cleanup), though that may be acceptable for a first iteration.
Interval regex allows empty string: The pattern ^(0|(([0-9]+)y)?... starts with 0|, meaning a bare 0 is valid. This is fine per the Prometheus duration spec but worth noting.

See inline comments for specifics.

ambient-code · 2026-04-23T20:04:46Z

@@ -605,6 +606,7 @@ func (r *ImageBuildReconciler) checkBuildProgress(
 			log.Error(err, "Failed to patch status to Completed")
 			return ctrl.Result{}, err
 		}


Potential double-decrement on re-reconciliation. This adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) call runs every time checkBuildProgress reaches this code path. If the status patch at line 604 triggers a re-reconciliation (which it will, since the status changed), and the PipelineRun is still in the completed state, this function will be called again, decrementing the gauge a second time.

The success path sets fresh.Status.Phase = phaseCompleted before patching, but checkBuildProgress is called when the build's phase is Building. After the first successful pass, the build's phase will be Completed, so subsequent reconciliations should not enter checkBuildProgress at all (the caller likely gates on phase == Building). However, this relies on that external guard existing and being correct.

I would recommend moving the gauge adjustment inside a guard that checks the old phase was actually Building, similar to what updateStatus does:

if imageBuild.Status.Phase == phaseBuilding { adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) }

Alternatively, could you confirm the caller already guards checkBuildProgress to only be invoked when imageBuild.Status.Phase == phaseBuilding?

ambient-code · 2026-04-23T20:04:47Z

@@ -1891,6 +1897,28 @@ func (r *ImageBuildReconciler) SetupWithManager(mgr ctrl.Manager) error {
 	return builder.Complete(r)


Good use of manager.RunnableFunc to seed the gauge on startup. This ensures the gauge reflects reality after a controller restart.

One edge case: if the List call fails (e.g., RBAC issue during startup), this logs the error but returns nil, so the manager won't retry and ActiveBuilds stays at 0. Consider returning the error so the manager retries, or at minimum logging at Error level to make the silent failure obvious. Currently it logs at Error level which is good, but returning nil means the manager considers the runnable finished successfully.

ambient-code · 2026-04-23T20:04:47Z

+		},
+		Rules: []rbacv1.PolicyRule{
+			{
+				APIGroups:     []string{""},


The buildMetricsReaderClusterRoleBinding references a ClusterRole named metrics-reader which is defined in config/rbac/metrics_reader_role.yaml and deployed via kustomize at install time. If someone deploys the operator without the full kustomize stack (e.g., OLM bundle), will this ClusterRole exist?

I see that the kubebuilder scaffolding includes it in the RBAC config, so it should be bundled into OLM's CSV. Worth verifying this is included in the bundle manifests, or alternatively have the operator create this ClusterRole dynamically (like it does for other RBAC resources).

ambient-code · 2026-04-23T20:04:47Z

+			},
+		},
+	}
+}


Nit: The buildMetricsReaderRoleBinding grants access to prometheus-user-workload SA in openshift-user-workload-monitoring. This is OpenShift-specific. If someone enables monitoring on a vanilla Kubernetes cluster with a standalone Prometheus Operator, this binding won't work because the SA name and namespace will differ.

Since the PR description mentions this is for user-workload-monitoring on OpenShift, this is fine for the initial scope. Consider adding a comment here noting this is OpenShift-specific, and potentially gating behind detectOpenShift if vanilla k8s support is needed later.

ambient-code · 2026-04-23T20:04:47Z

+	}
+
+	clusterBinding := r.buildMetricsReaderClusterRoleBinding(config.Namespace)
+	if err := r.createOrUpdate(ctx, clusterBinding, config); err != nil {


Good: the cleanupServiceMonitor correctly handles IsNoMatchError for the unstructured ServiceMonitor deletion, so it won't fail on clusters where the Prometheus Operator CRD is not installed. This is the right pattern.

However, the symmetric case in deployServiceMonitor does NOT have this guard. If someone enables monitoring on a cluster without the Prometheus Operator CRD installed, createOrUpdate for the ServiceMonitor will fail with a NoMatch error, and the entire reconciliation will fail. Consider catching IsNoMatchError in deployServiceMonitor and either logging a warning and skipping (graceful degradation) or setting a condition on the status to indicate the CRD is missing.

ambient-code · 2026-04-23T20:04:47Z

+		"endpoints": []interface{}{
+			map[string]interface{}{
+				"path":     "/metrics",
+				"port":     "https",


Using insecureSkipVerify: true is consistent with the existing config/prometheus/monitor.yaml scaffold, so this is fine for now. The existing file has a TODO comment about this. You might want to add a similar TODO or comment here noting this should be addressed for production hardening.

ambient-code · 2026-04-23T20:04:47Z

@@ -0,0 +1,151 @@
+# Automotive Dev Operator - Perses Dashboard


Nice addition of a sample Perses dashboard. The PromQL queries look correct and reference the metrics defined in internal/controller/imagebuild/metrics.go (ado_build_active, ado_build_total, ado_build_duration_seconds_sum/count). The clamp_min(..., 1) to avoid division by zero is a good pattern.

One note: the namespace is hardcoded to automotive-dev-operator-system in all queries. If someone deploys to a different namespace, they would need to update all queries. Consider adding a note about this in the header comments.

Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor, metrics token Secret, Role, and ClusterRoleBinding when enabled. Cleanup removes all resources when disabled. Uses unstructured types to avoid hard dependency on Prometheus Operator CRDs. Assisted-by: claude-sonnet-4-6 Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/controller/imagebuild/controller.go (1)
1997-2034: ⚠️ Potential issue | 🟠 Major

Add a failing test for seedActiveBuildsGauge before implementing the seeding logic, and fix the race between the seeder and reconciler.

The coding guidelines require failing tests before implementation for .go files. There is no test for seedActiveBuildsGauge in metrics_test.go.

Additionally, a race condition exists: the seeder runs as a manager.RunnableFunc registered via mgr.Add(), executing concurrently with the reconciler after cache sync. The seeder calls ActiveBuilds.Set(active) while the reconciler concurrently calls adjustActiveBuildsGauge (which calls Inc() or Dec()). Without synchronization, interleaving these operations can leave the gauge off-by-one for the controller's lifetime. For example, if the cache at T1 lists a build as Building, but the reconciler has already processed Building → Completed (calling Dec), the seeder's Set overwrites the correct count with a stale snapshot.

Options:

Run the seed before reconciliations start (compute inline in SetupWithManager or guard with sync.Once).

Or, ensure the seeder only initializes once and skips if adjustActiveBuildsGauge has already been called.

Or document the eventual-consistency semantics explicitly.

Write the failing test first, then implement a solution that eliminates the race.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/imagebuild/controller.go` around lines 1997 - 2034, The
test suite is missing a failing test for seedActiveBuildsGauge and there is a
race between the seeder and the reconciler (seedActiveBuildsGauge calling
ActiveBuilds.Set while adjustActiveBuildsGauge may concurrently Inc/Dec), so
first add a unit test in metrics_test.go that exercises seedActiveBuildsGauge
(expecting it to set the gauge from existing ImageBuildList) and fails
initially; then fix the race by ensuring seeding happens-before reconciliations
start: move the seeding call into SetupWithManager (call the seeding logic
synchronously before registering the controller) or protect the seeder with a
sync.Once/atomic flag that prevents seedActiveBuildsGauge from calling
ActiveBuilds.Set if adjustActiveBuildsGauge already ran (or vice versa),
referencing seedActiveBuildsGauge, SetupWithManager,
mgr.Add/manager.RunnableFunc, ActiveBuilds, and adjustActiveBuildsGauge so the
test validates the chosen approach.

🧹 Nitpick comments (4)

internal/controller/operatorconfig/controller.go (2)
1039-1069: Consider parameterizing the bound SA / Prometheus namespace for non-OpenShift Prometheus Operator deployments.

deployServiceMonitor unconditionally creates a RoleBinding for prometheus-user-workload in openshift-user-workload-monitoring (via buildMetricsReaderRoleBinding in resources.go). On vanilla Kubernetes with a standalone Prometheus Operator the SA name and namespace are different, so the binding is dead RBAC. Since this PR is scoped to OpenShift user-workload-monitoring, this is fine for now, but consider:

Adding a TODO and/or guarding it behind r.detectOpenShift(ctx, ...).

Allowing MonitoringConfig to optionally specify the scrape SA name/namespace for vanilla-Prometheus support later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 1039 - 1069,
deployServiceMonitor unconditionally creates a RoleBinding via
buildMetricsReaderRoleBinding that binds to the OpenShift SA/namespace
(prometheus-user-workload / openshift-user-workload-monitoring), which is
incorrect on vanilla Kubernetes; update deployServiceMonitor to either guard
creation of that RoleBinding with an OpenShift detection call (use
r.detectOpenShift(ctx, ...) to skip building/applying that binding when not
OpenShift) and add a TODO comment referencing this, or extend the
MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA
name/namespace and prefer those values when present so vanilla Prometheus
deployments can provide the correct SA; touch deployServiceMonitor,
buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement
the guard/optional fields and ensure createOrUpdate uses the new conditional
logic.
1122-1132: Reconciler does not own/watch the new monitoring resources.

SetupWithManager already calls Owns(&corev1.ConfigMap{}), Owns(&corev1.ServiceAccount{}), etc., so other reconciled resources trigger requeues when modified. The new metrics token Secret, Role, RoleBinding, and ClusterRoleBinding are not in the Owns() chain. If a cluster admin manually edits or deletes the metrics token Secret or the metrics-reader Role/RoleBinding, the reconciler won't notice until the next periodic resync.

Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}), Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}) (or a watch with appropriate predicates) would close that gap. The unstructured ServiceMonitor requires a custom Watches(...) with the GVK — and is reasonable to skip given the optional CRD.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 1122 - 1132,
The SetupWithManager in OperatorConfigReconciler currently Owns
ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and Secret resources,
so changes to the metrics token or roles won't trigger reconciliation; update
the SetupWithManager chain in the OperatorConfigReconciler to also
Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}), Owns(&rbacv1.RoleBinding{}), and
Owns(&rbacv1.ClusterRoleBinding{}), and optionally add a custom Watches(...) for
the ServiceMonitor GVK if you want to observe that unstructured CRD (keep the
existing Complete(r) call).
internal/controller/operatorconfig/resources_test.go (1)
166-218: Optional: also assert tlsConfig content on the generated endpoint.

Tests cover the GVK, selector, bearer token reference, interval defaulting, path/port/scheme, but not tlsConfig. Given there's an open hardening discussion around insecureSkipVerify, locking down current behavior with an assertion makes future TLS changes (e.g., switching to a CA bundle) intentional rather than a silent regression.
♻️ Suggested addition
 		It("should scrape /metrics on port https", func() {
 			config := &automotivev1alpha1.MonitoringConfig{Enabled: true}
 			sm := r.buildServiceMonitor("test-ns", config)
 			endpoints := sm.Object["spec"].(map[string]any)["endpoints"].([]any)
 			ep := endpoints[0].(map[string]any)
 			Expect(ep["path"]).To(Equal("/metrics"))
 			Expect(ep["port"]).To(Equal("https"))
 			Expect(ep["scheme"]).To(Equal("https"))
+			tls := ep["tlsConfig"].(map[string]any)
+			Expect(tls["insecureSkipVerify"]).To(BeTrue())
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources_test.go` around lines 166 - 218,
Add an assertion in the buildServiceMonitor tests to verify the generated
ServiceMonitor endpoint's tlsConfig content so TLS behavior is locked down; in
the existing Describe("buildServiceMonitor") cases (e.g., where sm :=
r.buildServiceMonitor("test-ns", config) and you extract endpoints/ep), access
ep["tlsConfig"] and assert its fields (for example that insecureSkipVerify is
the expected boolean and any ca/caFile/caSecret fields match current behavior)
to prevent silent regressions when TLS handling changes.
internal/controller/operatorconfig/resources.go (1)
1072-1141: Add standard app.kubernetes.io/* labels to the new monitoring resources for consistency.

The new metrics token Secret, Role, RoleBinding, and ClusterRoleBinding builders omit the standard label set (app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of, app.kubernetes.io/managed-by) that every other resource builder in this file applies. This breaks the labeling convention used for selection, ownership tracking, and operational queries (e.g., kubectl get -l app.kubernetes.io/part-of=automotive-dev-operator).
♻️ Proposed labels (illustrative)
 func (r *OperatorConfigReconciler) buildMetricsTokenSecret(namespace string) *corev1.Secret {
 	return &corev1.Secret{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      serviceMonitorTokenSecret,
 			Namespace: namespace,
+			Labels: map[string]string{
+				"app.kubernetes.io/name":      "automotive-dev-operator",
+				"app.kubernetes.io/component": "monitoring",
+				"app.kubernetes.io/part-of":   "automotive-dev-operator",
+			},
 			Annotations: map[string]string{
 				"kubernetes.io/service-account.name": "ado-operator",
 			},
 		},
 		Type: corev1.SecretTypeServiceAccountToken,
 	}
 }
Apply the same label block to buildMetricsReaderRole, buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources.go` around lines 1072 - 1141,
Add the standard app.kubernetes.io/* label block (app.kubernetes.io/name,
app.kubernetes.io/component, app.kubernetes.io/part-of,
app.kubernetes.io/managed-by) to the new monitoring resource
builders—buildMetricsTokenSecret, buildMetricsReaderRole,
buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by
copying the same label map used by the other resource builders in this file (use
the identical keys and values for name/component/part-of/managed-by so these
resources follow the same selection and ownership conventions).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 304-312: deployServiceMonitor currently returns nil when the
Prometheus CRD is missing which causes Reconcile to incorrectly set
config.Status.MonitoringEnabled = true; change deployServiceMonitor to return
either a sentinel error (e.g., errCRDMissing) or a (deployed bool, err error)
tuple and update the Reconcile call (where r.deployServiceMonitor(ctx, config)
is invoked) to only set config.Status.MonitoringEnabled = true when the
ServiceMonitor was actually deployed (deployed == true) and to treat
errCRDMissing as non-success (do not flip the status) while surfacing a clear
condition/message; ensure you reference the deployServiceMonitor function and
the Status.MonitoringEnabled field when making the change.

---

Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 1997-2034: The test suite is missing a failing test for
seedActiveBuildsGauge and there is a race between the seeder and the reconciler
(seedActiveBuildsGauge calling ActiveBuilds.Set while adjustActiveBuildsGauge
may concurrently Inc/Dec), so first add a unit test in metrics_test.go that
exercises seedActiveBuildsGauge (expecting it to set the gauge from existing
ImageBuildList) and fails initially; then fix the race by ensuring seeding
happens-before reconciliations start: move the seeding call into
SetupWithManager (call the seeding logic synchronously before registering the
controller) or protect the seeder with a sync.Once/atomic flag that prevents
seedActiveBuildsGauge from calling ActiveBuilds.Set if adjustActiveBuildsGauge
already ran (or vice versa), referencing seedActiveBuildsGauge,
SetupWithManager, mgr.Add/manager.RunnableFunc, ActiveBuilds, and
adjustActiveBuildsGauge so the test validates the chosen approach.

---

Nitpick comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 1039-1069: deployServiceMonitor unconditionally creates a
RoleBinding via buildMetricsReaderRoleBinding that binds to the OpenShift
SA/namespace (prometheus-user-workload / openshift-user-workload-monitoring),
which is incorrect on vanilla Kubernetes; update deployServiceMonitor to either
guard creation of that RoleBinding with an OpenShift detection call (use
r.detectOpenShift(ctx, ...) to skip building/applying that binding when not
OpenShift) and add a TODO comment referencing this, or extend the
MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA
name/namespace and prefer those values when present so vanilla Prometheus
deployments can provide the correct SA; touch deployServiceMonitor,
buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement
the guard/optional fields and ensure createOrUpdate uses the new conditional
logic.
- Around line 1122-1132: The SetupWithManager in OperatorConfigReconciler
currently Owns ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and
Secret resources, so changes to the metrics token or roles won't trigger
reconciliation; update the SetupWithManager chain in the
OperatorConfigReconciler to also Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}),
Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}), and
optionally add a custom Watches(...) for the ServiceMonitor GVK if you want to
observe that unstructured CRD (keep the existing Complete(r) call).

In `@internal/controller/operatorconfig/resources_test.go`:
- Around line 166-218: Add an assertion in the buildServiceMonitor tests to
verify the generated ServiceMonitor endpoint's tlsConfig content so TLS behavior
is locked down; in the existing Describe("buildServiceMonitor") cases (e.g.,
where sm := r.buildServiceMonitor("test-ns", config) and you extract
endpoints/ep), access ep["tlsConfig"] and assert its fields (for example that
insecureSkipVerify is the expected boolean and any ca/caFile/caSecret fields
match current behavior) to prevent silent regressions when TLS handling changes.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 1072-1141: Add the standard app.kubernetes.io/* label block
(app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of,
app.kubernetes.io/managed-by) to the new monitoring resource
builders—buildMetricsTokenSecret, buildMetricsReaderRole,
buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by
copying the same label map used by the other resource builders in this file (use
the identical keys and values for name/component/part-of/managed-by so these
resources follow the same selection and ownership conventions).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da0003fd-4d87-4ce9-83c3-1669ffe99f49

📥 Commits

Reviewing files that changed from the base of the PR and between 68ceb6c and cab2bd8.

⛔ Files ignored due to path filters (2)

config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
config/rbac/role.yaml is excluded by !config/rbac/role.yaml

📒 Files selected for processing (11)

api/v1alpha1/operatorconfig_types.go
api/v1alpha1/zz_generated.deepcopy.go
config/samples/automotive_v1_operatorconfig.yaml
config/samples/monitoring/perses-dashboard.yaml
config/samples/monitoring/uiplugin.yaml
internal/controller/imagebuild/controller.go
internal/controller/imagebuild/metrics.go
internal/controller/imagebuild/metrics_test.go
internal/controller/operatorconfig/controller.go
internal/controller/operatorconfig/resources.go
internal/controller/operatorconfig/resources_test.go

✅ Files skipped from review due to trivial changes (3)

config/samples/monitoring/perses-dashboard.yaml
config/samples/automotive_v1_operatorconfig.yaml
config/samples/monitoring/uiplugin.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

internal/controller/imagebuild/metrics.go

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread api/v1alpha1/operatorconfig_types.go

Comment thread internal/controller/operatorconfig/controller.go

bennyz force-pushed the worktree-monitoring-config branch from fd146d6 to 1c0d960 Compare April 23, 2026 09:06

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

bennyz force-pushed the worktree-monitoring-config branch 2 times, most recently from 978f149 to af84282 Compare April 23, 2026 12:00

ambient-code Bot reviewed Apr 23, 2026

View reviewed changes

bennyz force-pushed the worktree-monitoring-config branch from af84282 to 3745a86 Compare April 26, 2026 06:25

bennyz added 2 commits April 26, 2026 11:41

add active builds count

e3fa7ae

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6

bennyz force-pushed the worktree-monitoring-config branch from 3745a86 to cab2bd8 Compare April 26, 2026 10:16

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread internal/controller/operatorconfig/controller.go

add sample perses dashboard

24ca715

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6

bennyz force-pushed the worktree-monitoring-config branch from cab2bd8 to 24ca715 Compare April 26, 2026 10:31

bennyz requested a review from bkhizgiy April 26, 2026 11:55

bkhizgiy approved these changes Apr 26, 2026

View reviewed changes

bennyz merged commit 54daf84 into centos-automotive-suite:main Apr 26, 2026
4 checks passed

bennyz deleted the worktree-monitoring-config branch April 26, 2026 14:02

This was referenced Apr 26, 2026

initial tracing and log ingestion setup #260

Merged

feat: add sample observability dashboard #273

Open

coderabbitai Bot mentioned this pull request May 7, 2026

pre-seed metrics from imagebuilds #278

Merged

10 tasks

		@@ -1891,6 +1897,28 @@ func (r *ImageBuildReconciler) SetupWithManager(mgr ctrl.Manager) error {
		return builder.Complete(r)

		@@ -0,0 +1,151 @@
		# Automotive Dev Operator - Perses Dashboard

Conversation

bennyz commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Type of Change

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bennyz commented Apr 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot left a comment

Choose a reason for hiding this comment

Code Review: feat: add optional Prometheus monitoring via OperatorConfig

Summary of findings

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ambient-code Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bennyz commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading