Skip to content

feat: add optional Prometheus monitoring via OperatorConfig#251

Merged
bennyz merged 3 commits into
centos-automotive-suite:mainfrom
bennyz:worktree-monitoring-config
Apr 26, 2026
Merged

feat: add optional Prometheus monitoring via OperatorConfig#251
bennyz merged 3 commits into
centos-automotive-suite:mainfrom
bennyz:worktree-monitoring-config

Conversation

@bennyz

@bennyz bennyz commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor, metrics token Secret, Role, and ClusterRoleBinding when enabled

Sample Perses dashboard for reference

Summary

Related Issues

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • CI/CD improvement
  • Refactoring

Testing

  • Unit tests pass (make test)
  • Linter passes (make lint)
  • Manifests are up to date (make manifests generate)
  • Tested on OpenShift cluster (if applicable)

Summary by CodeRabbit

  • New Features

    • Configurable Prometheus monitoring (enable/disable, scrape interval with 30s default) with automatic ServiceMonitor deployment and status reporting.
    • Automatic RBAC and token setup for secure metrics scraping.
    • New metric exposing active in-progress builds.
  • Tests

    • Added tests validating monitoring resource generation and active-builds gauge behavior.
  • Samples

    • Added Perses dashboard and console UIPlugin samples for observability.

@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds optional Prometheus monitoring to OperatorConfig (new MonitoringConfig API and status flag), controller logic to deploy/cleanup ServiceMonitor, token Secret and RBAC, builder helpers and tests, sample manifests for Perses/UIPlugin, and imagebuild metrics/gauge with startup initialization and phase-bound updates.

Changes

Cohort / File(s) Summary
API Types & Deepcopy
api/v1alpha1/operatorconfig_types.go, api/v1alpha1/zz_generated.deepcopy.go
Add MonitoringConfig (Enabled, Interval + GetInterval()); add OperatorConfigSpec.Monitoring and OperatorConfigStatus.MonitoringEnabled; generated deepcopy for MonitoringConfig.
OperatorConfig Controller
internal/controller/operatorconfig/controller.go
Reconcile manages monitoring lifecycle: when enabled create metrics token Secret, Role/RoleBinding/ClusterRoleBinding and a monitoring.coreos.com/v1 ServiceMonitor; when disabled or deleting, delete ServiceMonitor (ignore NotFound/CRD-absent) and clean up Secret/RBAC; update status.monitoringEnabled and logs; RBAC extended for servicemonitors.
Resource Builders & Tests
internal/controller/operatorconfig/resources.go, internal/controller/operatorconfig/resources_test.go
Add builders for metrics token Secret, namespace Role + RoleBinding, ClusterRoleBinding, and an unstructured ServiceMonitor (https /metrics, bearer token, tls.insecureSkipVerify, interval from MonitoringConfig with 30s default); tests validate GVK, selector, endpoint, bearer token ref, interval defaulting and RBAC rules/subjects.
Samples / Manifests
config/samples/automotive_v1_operatorconfig.yaml, config/samples/monitoring/perses-dashboard.yaml, config/samples/monitoring/uiplugin.yaml
Add commented spec.monitoring sample; add Perses dashboard manifest and OpenShift UIPlugin manifest for Perses/Observability integration.
ImageBuild Metrics & Controller
internal/controller/imagebuild/controller.go, internal/controller/imagebuild/metrics.go, internal/controller/imagebuild/metrics_test.go
Add ActiveBuilds Prometheus gauge, register metric, helper adjustActiveBuildsGauge and tests; update controller to adjust gauge on phase transitions and initialize gauge at startup by listing existing ImageBuilds.

Sequence Diagram

sequenceDiagram
    participant OC as OperatorConfig Controller
    participant K8s as Kubernetes API
    participant Prom as Prometheus

    rect rgba(100,150,200,0.5)
        Note over OC,K8s: Enable monitoring
        OC->>K8s: Create metrics token Secret
        OC->>K8s: Create Role / RoleBinding
        OC->>K8s: Create ClusterRoleBinding
        OC->>K8s: Create/Update ServiceMonitor
        OC->>K8s: Update status.monitoringEnabled = true
    end

    rect rgba(200,150,100,0.5)
        Note over Prom,K8s: Prometheus scraping
        Prom->>K8s: Discover ServiceMonitor
        Prom->>K8s: Read token Secret
        Prom->>K8s: GET /metrics (authenticated)
    end

    rect rgba(200,100,100,0.5)
        Note over OC,K8s: Disable / Cleanup monitoring
        OC->>K8s: Delete ServiceMonitor (ignore NotFound/CRD-absent)
        OC->>K8s: Delete token Secret
        OC->>K8s: Delete Role / RoleBinding / ClusterRoleBinding
        OC->>K8s: Update status.monitoringEnabled = false
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • bkhizgiy

Poem

🐰 I stitched a metric, soft and neat,

Secrets tucked where tokens meet.
ServiceMonitors hum at night,
RBAC set so scrapers bite.
A happy rabbit hops in light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding optional Prometheus monitoring via OperatorConfig, which is the core feature across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/operatorconfig/controller.go (1)

235-238: ⚠️ Potential issue | 🟠 Major

Clean up monitoring resources during OperatorConfig deletion.

The finalizer only calls cleanupOSBuilds, but deployServiceMonitor creates a Secret, Role, RoleBinding, ClusterRoleBinding, and ServiceMonitor without owner references. Deleting the OperatorConfig will orphan those resources.

Proposed fix
 		log.Info("Handling deletion")
 		if err := r.cleanupOSBuilds(ctx, config); err != nil {
 			log.Error(err, "Failed to cleanup OSBuilds")
 			return ctrl.Result{}, err
 		}
+		if err := r.cleanupServiceMonitor(ctx, config); err != nil {
+			log.Error(err, "Failed to cleanup ServiceMonitor")
+			return ctrl.Result{}, err
+		}
 		log.Info("Removing finalizer")

Also applies to: 1031-1054

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 235 - 238, The
finalizer only calls cleanupOSBuilds but never removes monitoring resources
created by deployServiceMonitor, so implement and call a cleanup function (e.g.,
r.cleanupMonitoringResources(ctx, config)) alongside r.cleanupOSBuilds in the
deletion path to delete the Secret, Role, RoleBinding, ClusterRoleBinding and
ServiceMonitor that were created without ownerRefs; locate the deletion logic
near where cleanupOSBuilds is invoked (the finalizer block) and ensure
cleanupMonitoringResources uses the same client/context to find and delete the
resources by the same names/labels used in deployServiceMonitor so they are not
orphaned on OperatorConfig deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/operatorconfig_types.go`:
- Around line 445-448: The Interval field on MonitoringConfig currently lacks
CRD-level validation; add a kubebuilder validation pattern annotation above the
Interval field (on MonitoringConfig.Interval) using the Prometheus duration
regex provided (e.g.,
^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$)
so invalid durations are rejected by the CRD, then run make generate manifests
to regenerate the CRD artifacts.

In `@internal/controller/operatorconfig/controller.go`:
- Around line 299-314: The reconcile logic is swallowing errors from
deployServiceMonitor and cleanupServiceMonitor; change the Reconcile flow so
that when r.deployServiceMonitor(ctx, config) or r.cleanupServiceMonitor(ctx,
config) returns an error you return that error (do not just log it), and only
modify config.Status.MonitoringEnabled when the operation succeeded; keep the
existing statusChanged handling but ensure you propagate the original error
instead of continuing to success so the controller can retry. Reference the
deployServiceMonitor and cleanupServiceMonitor calls and the
config.Status.MonitoringEnabled updates to locate where to return the error and
guard status mutation.

---

Outside diff comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 235-238: The finalizer only calls cleanupOSBuilds but never
removes monitoring resources created by deployServiceMonitor, so implement and
call a cleanup function (e.g., r.cleanupMonitoringResources(ctx, config))
alongside r.cleanupOSBuilds in the deletion path to delete the Secret, Role,
RoleBinding, ClusterRoleBinding and ServiceMonitor that were created without
ownerRefs; locate the deletion logic near where cleanupOSBuilds is invoked (the
finalizer block) and ensure cleanupMonitoringResources uses the same
client/context to find and delete the resources by the same names/labels used in
deployServiceMonitor so they are not orphaned on OperatorConfig deletion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c653467-8189-443a-b3bf-f33d42cb92b6

📥 Commits

Reviewing files that changed from the base of the PR and between 5ade329 and fd146d6.

⛔ Files ignored due to path filters (2)
  • config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
  • config/rbac/role.yaml is excluded by !config/rbac/role.yaml
📒 Files selected for processing (5)
  • api/v1alpha1/operatorconfig_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • internal/controller/operatorconfig/controller.go
  • internal/controller/operatorconfig/resources.go
  • internal/controller/operatorconfig/resources_test.go

Comment thread api/v1alpha1/operatorconfig_types.go
Comment thread internal/controller/operatorconfig/controller.go
@bennyz bennyz force-pushed the worktree-monitoring-config branch from fd146d6 to 1c0d960 Compare April 23, 2026 09:06
@bennyz

bennyz commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@ambient-code please review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/samples/monitoring/perses-dashboard.yaml`:
- Line 77: The panel query currently divides two raw sums and can yield NaN when
no samples exist; update the PromQL in the query field that uses ado_build_total
(status="success") and ado_build_total (total) to clamp the denominator and
provide a zero fallback—e.g. wrap the denominator with clamp_min(...,1) and
ensure the numerator falls back to 0 if empty—so replace the existing expression
with one that uses
clamp_min(sum(ado_build_total{namespace="automotive-dev-operator-system"}), 1)
for the denominator and a safe numerator
(sum(ado_build_total{namespace="automotive-dev-operator-system",status="success"})
or 0) to avoid NaN.

In `@internal/controller/imagebuild/controller.go`:
- Around line 2317-2323: ActiveBuilds can get out of sync because some code
paths patch status directly instead of using updateStatus; ensure the gauge is
always consistent by either (preferred) recomputing ActiveBuilds at startup and
after any status change from cluster state (query ImageBuild objects and set the
gauge to the count where phase == phaseBuilding) or by routing every phase
transition through a single helper that updates status and adjusts ActiveBuilds
atomically (create a transition helper used by updateStatus and the direct patch
sites), and seed the gauge on controller startup from current ImageBuilds to
avoid negative counts; refer to ActiveBuilds, updateStatus and phaseBuilding to
locate the places to change.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 1177-1179: The current metrics scrape config sets "tlsConfig" with
"insecureSkipVerify: true" — replace this by loading the cluster/serving CA
bundle and wiring it into the "tlsConfig" so TLS is validated; specifically,
remove the insecureSkipVerify entry and instead populate tlsConfig with the CA
data (e.g., a "caData" or "caFile" field) sourced from the serving CA (the
OpenShift serving CA ConfigMap/secret or the injected service-ca bundle) before
returning the structure in the code that builds the metrics scrape config (look
for the map literal keyed "tlsConfig" in this file). Ensure the CA is correctly
base64-decoded or passed in the expected form for the downstream consumer and
handle errors if the CA cannot be retrieved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eb20d32e-fc58-4b1f-8736-4eeaf2bfbac9

📥 Commits

Reviewing files that changed from the base of the PR and between fd146d6 and 68ceb6c.

⛔ Files ignored due to path filters (2)
  • config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
  • config/rbac/role.yaml is excluded by !config/rbac/role.yaml
📒 Files selected for processing (10)
  • api/v1alpha1/operatorconfig_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/samples/automotive_v1_operatorconfig.yaml
  • config/samples/monitoring/perses-dashboard.yaml
  • config/samples/monitoring/uiplugin.yaml
  • internal/controller/imagebuild/controller.go
  • internal/controller/imagebuild/metrics.go
  • internal/controller/operatorconfig/controller.go
  • internal/controller/operatorconfig/resources.go
  • internal/controller/operatorconfig/resources_test.go
✅ Files skipped from review due to trivial changes (3)
  • config/samples/automotive_v1_operatorconfig.yaml
  • config/samples/monitoring/uiplugin.yaml
  • internal/controller/operatorconfig/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/v1alpha1/operatorconfig_types.go

Comment thread config/samples/monitoring/perses-dashboard.yaml Outdated
Comment thread internal/controller/imagebuild/controller.go Outdated
Comment on lines +1177 to +1179
"tlsConfig": map[string]interface{}{
"insecureSkipVerify": true,
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and examine the lines in question
wc -l internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 143


🏁 Script executed:

# Read the specific lines and surrounding context (1170-1190) to understand the usage
sed -n '1170,1190p' internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 393


🏁 Script executed:

# Let's get more context to understand what this configuration is for
sed -n '1150,1200p' internal/controller/operatorconfig/resources.go

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 979


🏁 Script executed:

# Find the function containing this code to understand context better
sed -n '1100,1185p' internal/controller/operatorconfig/resources.go | head -100

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 2329


🏁 Script executed:

# Search for the function name and look for any CA configurations
rg "serviceMonitorTokenSecret|caBundle|caCert|servingCA" internal/controller/operatorconfig/ -A 2 -B 2

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 3229


🏁 Script executed:

# Check if there are other ServiceMonitor creations with CA configurations
rg "insecureSkipVerify|ca.*Config|caFile" internal/controller/operatorconfig/ -B 3 -A 3

Repository: centos-automotive-suite/automotive-dev-operator

Length of output: 1434


Wire the serving CA into tlsConfig instead of disabling TLS verification for metrics scraping.

insecureSkipVerify: true makes bearer-token-authenticated metrics scraping vulnerable to endpoint impersonation within the cluster. In an OpenShift environment, the serving CA is available and should be properly configured in tlsConfig to verify the metrics endpoint's certificate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources.go` around lines 1177 - 1179,
The current metrics scrape config sets "tlsConfig" with "insecureSkipVerify:
true" — replace this by loading the cluster/serving CA bundle and wiring it into
the "tlsConfig" so TLS is validated; specifically, remove the insecureSkipVerify
entry and instead populate tlsConfig with the CA data (e.g., a "caData" or
"caFile" field) sourced from the serving CA (the OpenShift serving CA
ConfigMap/secret or the injected service-ca bundle) before returning the
structure in the code that builds the metrics scrape config (look for the map
literal keyed "tlsConfig" in this file). Ensure the CA is correctly
base64-decoded or passed in the expected form for the downstream consumer and
handle errors if the CA cannot be retrieved.

@bennyz bennyz force-pushed the worktree-monitoring-config branch 2 times, most recently from 978f149 to af84282 Compare April 23, 2026 12:00

@ambient-code ambient-code Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: feat: add optional Prometheus monitoring via OperatorConfig

Overall this is a solid PR that follows the existing patterns in the codebase well. The unstructured approach for ServiceMonitor is the right call to avoid a hard dependency on the Prometheus Operator CRDs, and the monitoring/RBAC resources are correctly scoped. The ActiveBuilds gauge and seeder are nice additions.

I found one bug (double-decrement of ActiveBuilds gauge), a few items worth addressing, and some minor suggestions.

Summary of findings

  1. Bug -- double gauge decrement on build completion (blocking): checkBuildProgress calls adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) directly for the success path, but the Cancelled and Failed paths go through updateStatus, which also calls adjustActiveBuildsGauge. This is correctly partitioned (no overlap). However, I noticed that the Completed path in checkBuildProgress sets fresh.Status.Phase = phaseCompleted and then patches, but it does NOT go through updateStatus. On the next reconciliation triggered by the status patch, checkBuildProgress might be called again if the PipelineRun is still in the completed state. The guard at the top of checkBuildProgress should prevent re-processing, but the gauge call at line 608 happens before the event/return. Worth verifying the guard ensures this path cannot be entered twice.

  2. Monitoring is independent of OSBuilds (design observation): Monitoring reconciliation runs regardless of whether OSBuilds is enabled. This seems intentional and correct -- the operator itself always exposes metrics. Just confirming this is the desired behavior.

  3. ClusterRoleBinding references a pre-existing metrics-reader ClusterRole: The buildMetricsReaderClusterRoleBinding binds to a ClusterRole named metrics-reader which exists in config/rbac/metrics_reader_role.yaml and is deployed via kustomize. This means the operator creates the binding at runtime but relies on the ClusterRole having been deployed via kustomize at install time. This coupling should be documented.

  4. Test coverage: Good unit tests for the resource builders and the gauge function. Missing a test for GetInterval() with nil receiver, and no integration test for the reconciliation flow (deploy + cleanup), though that may be acceptable for a first iteration.

  5. Interval regex allows empty string: The pattern ^(0|(([0-9]+)y)?... starts with 0|, meaning a bare 0 is valid. This is fine per the Prometheus duration spec but worth noting.

See inline comments for specifics.

@@ -605,6 +606,7 @@ func (r *ImageBuildReconciler) checkBuildProgress(
log.Error(err, "Failed to patch status to Completed")
return ctrl.Result{}, err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential double-decrement on re-reconciliation. This adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) call runs every time checkBuildProgress reaches this code path. If the status patch at line 604 triggers a re-reconciliation (which it will, since the status changed), and the PipelineRun is still in the completed state, this function will be called again, decrementing the gauge a second time.

The success path sets fresh.Status.Phase = phaseCompleted before patching, but checkBuildProgress is called when the build's phase is Building. After the first successful pass, the build's phase will be Completed, so subsequent reconciliations should not enter checkBuildProgress at all (the caller likely gates on phase == Building). However, this relies on that external guard existing and being correct.

I would recommend moving the gauge adjustment inside a guard that checks the old phase was actually Building, similar to what updateStatus does:

if imageBuild.Status.Phase == phaseBuilding {
    adjustActiveBuildsGauge(phaseBuilding, phaseCompleted)
}

Alternatively, could you confirm the caller already guards checkBuildProgress to only be invoked when imageBuild.Status.Phase == phaseBuilding?

@@ -1891,6 +1897,28 @@ func (r *ImageBuildReconciler) SetupWithManager(mgr ctrl.Manager) error {
return builder.Complete(r)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of manager.RunnableFunc to seed the gauge on startup. This ensures the gauge reflects reality after a controller restart.

One edge case: if the List call fails (e.g., RBAC issue during startup), this logs the error but returns nil, so the manager won't retry and ActiveBuilds stays at 0. Consider returning the error so the manager retries, or at minimum logging at Error level to make the silent failure obvious. Currently it logs at Error level which is good, but returning nil means the manager considers the runnable finished successfully.

},
Rules: []rbacv1.PolicyRule{
{
APIGroups: []string{""},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buildMetricsReaderClusterRoleBinding references a ClusterRole named metrics-reader which is defined in config/rbac/metrics_reader_role.yaml and deployed via kustomize at install time. If someone deploys the operator without the full kustomize stack (e.g., OLM bundle), will this ClusterRole exist?

I see that the kubebuilder scaffolding includes it in the RBAC config, so it should be bundled into OLM's CSV. Worth verifying this is included in the bundle manifests, or alternatively have the operator create this ClusterRole dynamically (like it does for other RBAC resources).

},
},
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The buildMetricsReaderRoleBinding grants access to prometheus-user-workload SA in openshift-user-workload-monitoring. This is OpenShift-specific. If someone enables monitoring on a vanilla Kubernetes cluster with a standalone Prometheus Operator, this binding won't work because the SA name and namespace will differ.

Since the PR description mentions this is for user-workload-monitoring on OpenShift, this is fine for the initial scope. Consider adding a comment here noting this is OpenShift-specific, and potentially gating behind detectOpenShift if vanilla k8s support is needed later.

}

clusterBinding := r.buildMetricsReaderClusterRoleBinding(config.Namespace)
if err := r.createOrUpdate(ctx, clusterBinding, config); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: the cleanupServiceMonitor correctly handles IsNoMatchError for the unstructured ServiceMonitor deletion, so it won't fail on clusters where the Prometheus Operator CRD is not installed. This is the right pattern.

However, the symmetric case in deployServiceMonitor does NOT have this guard. If someone enables monitoring on a cluster without the Prometheus Operator CRD installed, createOrUpdate for the ServiceMonitor will fail with a NoMatch error, and the entire reconciliation will fail. Consider catching IsNoMatchError in deployServiceMonitor and either logging a warning and skipping (graceful degradation) or setting a condition on the status to indicate the CRD is missing.

"endpoints": []interface{}{
map[string]interface{}{
"path": "/metrics",
"port": "https",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using insecureSkipVerify: true is consistent with the existing config/prometheus/monitor.yaml scaffold, so this is fine for now. The existing file has a TODO comment about this. You might want to add a similar TODO or comment here noting this should be addressed for production hardening.

Comment thread internal/controller/operatorconfig/controller.go
Comment thread api/v1alpha1/operatorconfig_types.go
@@ -0,0 +1,151 @@
# Automotive Dev Operator - Perses Dashboard

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition of a sample Perses dashboard. The PromQL queries look correct and reference the metrics defined in internal/controller/imagebuild/metrics.go (ado_build_active, ado_build_total, ado_build_duration_seconds_sum/count). The clamp_min(..., 1) to avoid division by zero is a good pattern.

One note: the namespace is hardcoded to automotive-dev-operator-system in all queries. If someone deploys to a different namespace, they would need to update all queries. Consider adding a note about this in the header comments.

@bennyz bennyz force-pushed the worktree-monitoring-config branch from af84282 to 3745a86 Compare April 26, 2026 06:25
bennyz added 2 commits April 26, 2026 11:41
Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor,
metrics token Secret, Role, and ClusterRoleBinding when enabled. Cleanup
removes all resources when disabled. Uses unstructured types to avoid hard
dependency on Prometheus Operator CRDs.

Assisted-by: claude-sonnet-4-6
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Assisted-by: claude-opus-4.6
@bennyz bennyz force-pushed the worktree-monitoring-config branch from 3745a86 to cab2bd8 Compare April 26, 2026 10:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/imagebuild/controller.go (1)

1997-2034: ⚠️ Potential issue | 🟠 Major

Add a failing test for seedActiveBuildsGauge before implementing the seeding logic, and fix the race between the seeder and reconciler.

The coding guidelines require failing tests before implementation for .go files. There is no test for seedActiveBuildsGauge in metrics_test.go.

Additionally, a race condition exists: the seeder runs as a manager.RunnableFunc registered via mgr.Add(), executing concurrently with the reconciler after cache sync. The seeder calls ActiveBuilds.Set(active) while the reconciler concurrently calls adjustActiveBuildsGauge (which calls Inc() or Dec()). Without synchronization, interleaving these operations can leave the gauge off-by-one for the controller's lifetime. For example, if the cache at T1 lists a build as Building, but the reconciler has already processed Building → Completed (calling Dec), the seeder's Set overwrites the correct count with a stale snapshot.

Options:

  • Run the seed before reconciliations start (compute inline in SetupWithManager or guard with sync.Once).
  • Or, ensure the seeder only initializes once and skips if adjustActiveBuildsGauge has already been called.
  • Or document the eventual-consistency semantics explicitly.

Write the failing test first, then implement a solution that eliminates the race.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/imagebuild/controller.go` around lines 1997 - 2034, The
test suite is missing a failing test for seedActiveBuildsGauge and there is a
race between the seeder and the reconciler (seedActiveBuildsGauge calling
ActiveBuilds.Set while adjustActiveBuildsGauge may concurrently Inc/Dec), so
first add a unit test in metrics_test.go that exercises seedActiveBuildsGauge
(expecting it to set the gauge from existing ImageBuildList) and fails
initially; then fix the race by ensuring seeding happens-before reconciliations
start: move the seeding call into SetupWithManager (call the seeding logic
synchronously before registering the controller) or protect the seeder with a
sync.Once/atomic flag that prevents seedActiveBuildsGauge from calling
ActiveBuilds.Set if adjustActiveBuildsGauge already ran (or vice versa),
referencing seedActiveBuildsGauge, SetupWithManager,
mgr.Add/manager.RunnableFunc, ActiveBuilds, and adjustActiveBuildsGauge so the
test validates the chosen approach.
🧹 Nitpick comments (4)
internal/controller/operatorconfig/controller.go (2)

1039-1069: Consider parameterizing the bound SA / Prometheus namespace for non-OpenShift Prometheus Operator deployments.

deployServiceMonitor unconditionally creates a RoleBinding for prometheus-user-workload in openshift-user-workload-monitoring (via buildMetricsReaderRoleBinding in resources.go). On vanilla Kubernetes with a standalone Prometheus Operator the SA name and namespace are different, so the binding is dead RBAC. Since this PR is scoped to OpenShift user-workload-monitoring, this is fine for now, but consider:

  • Adding a TODO and/or guarding it behind r.detectOpenShift(ctx, ...).
  • Allowing MonitoringConfig to optionally specify the scrape SA name/namespace for vanilla-Prometheus support later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 1039 - 1069,
deployServiceMonitor unconditionally creates a RoleBinding via
buildMetricsReaderRoleBinding that binds to the OpenShift SA/namespace
(prometheus-user-workload / openshift-user-workload-monitoring), which is
incorrect on vanilla Kubernetes; update deployServiceMonitor to either guard
creation of that RoleBinding with an OpenShift detection call (use
r.detectOpenShift(ctx, ...) to skip building/applying that binding when not
OpenShift) and add a TODO comment referencing this, or extend the
MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA
name/namespace and prefer those values when present so vanilla Prometheus
deployments can provide the correct SA; touch deployServiceMonitor,
buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement
the guard/optional fields and ensure createOrUpdate uses the new conditional
logic.

1122-1132: Reconciler does not own/watch the new monitoring resources.

SetupWithManager already calls Owns(&corev1.ConfigMap{}), Owns(&corev1.ServiceAccount{}), etc., so other reconciled resources trigger requeues when modified. The new metrics token Secret, Role, RoleBinding, and ClusterRoleBinding are not in the Owns() chain. If a cluster admin manually edits or deletes the metrics token Secret or the metrics-reader Role/RoleBinding, the reconciler won't notice until the next periodic resync.

Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}), Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}) (or a watch with appropriate predicates) would close that gap. The unstructured ServiceMonitor requires a custom Watches(...) with the GVK — and is reasonable to skip given the optional CRD.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/controller.go` around lines 1122 - 1132,
The SetupWithManager in OperatorConfigReconciler currently Owns
ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and Secret resources,
so changes to the metrics token or roles won't trigger reconciliation; update
the SetupWithManager chain in the OperatorConfigReconciler to also
Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}), Owns(&rbacv1.RoleBinding{}), and
Owns(&rbacv1.ClusterRoleBinding{}), and optionally add a custom Watches(...) for
the ServiceMonitor GVK if you want to observe that unstructured CRD (keep the
existing Complete(r) call).
internal/controller/operatorconfig/resources_test.go (1)

166-218: Optional: also assert tlsConfig content on the generated endpoint.

Tests cover the GVK, selector, bearer token reference, interval defaulting, path/port/scheme, but not tlsConfig. Given there's an open hardening discussion around insecureSkipVerify, locking down current behavior with an assertion makes future TLS changes (e.g., switching to a CA bundle) intentional rather than a silent regression.

♻️ Suggested addition
 		It("should scrape /metrics on port https", func() {
 			config := &automotivev1alpha1.MonitoringConfig{Enabled: true}
 			sm := r.buildServiceMonitor("test-ns", config)
 			endpoints := sm.Object["spec"].(map[string]any)["endpoints"].([]any)
 			ep := endpoints[0].(map[string]any)
 			Expect(ep["path"]).To(Equal("/metrics"))
 			Expect(ep["port"]).To(Equal("https"))
 			Expect(ep["scheme"]).To(Equal("https"))
+			tls := ep["tlsConfig"].(map[string]any)
+			Expect(tls["insecureSkipVerify"]).To(BeTrue())
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources_test.go` around lines 166 - 218,
Add an assertion in the buildServiceMonitor tests to verify the generated
ServiceMonitor endpoint's tlsConfig content so TLS behavior is locked down; in
the existing Describe("buildServiceMonitor") cases (e.g., where sm :=
r.buildServiceMonitor("test-ns", config) and you extract endpoints/ep), access
ep["tlsConfig"] and assert its fields (for example that insecureSkipVerify is
the expected boolean and any ca/caFile/caSecret fields match current behavior)
to prevent silent regressions when TLS handling changes.
internal/controller/operatorconfig/resources.go (1)

1072-1141: Add standard app.kubernetes.io/* labels to the new monitoring resources for consistency.

The new metrics token Secret, Role, RoleBinding, and ClusterRoleBinding builders omit the standard label set (app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of, app.kubernetes.io/managed-by) that every other resource builder in this file applies. This breaks the labeling convention used for selection, ownership tracking, and operational queries (e.g., kubectl get -l app.kubernetes.io/part-of=automotive-dev-operator).

♻️ Proposed labels (illustrative)
 func (r *OperatorConfigReconciler) buildMetricsTokenSecret(namespace string) *corev1.Secret {
 	return &corev1.Secret{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      serviceMonitorTokenSecret,
 			Namespace: namespace,
+			Labels: map[string]string{
+				"app.kubernetes.io/name":      "automotive-dev-operator",
+				"app.kubernetes.io/component": "monitoring",
+				"app.kubernetes.io/part-of":   "automotive-dev-operator",
+			},
 			Annotations: map[string]string{
 				"kubernetes.io/service-account.name": "ado-operator",
 			},
 		},
 		Type: corev1.SecretTypeServiceAccountToken,
 	}
 }

Apply the same label block to buildMetricsReaderRole, buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources.go` around lines 1072 - 1141,
Add the standard app.kubernetes.io/* label block (app.kubernetes.io/name,
app.kubernetes.io/component, app.kubernetes.io/part-of,
app.kubernetes.io/managed-by) to the new monitoring resource
builders—buildMetricsTokenSecret, buildMetricsReaderRole,
buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by
copying the same label map used by the other resource builders in this file (use
the identical keys and values for name/component/part-of/managed-by so these
resources follow the same selection and ownership conventions).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 304-312: deployServiceMonitor currently returns nil when the
Prometheus CRD is missing which causes Reconcile to incorrectly set
config.Status.MonitoringEnabled = true; change deployServiceMonitor to return
either a sentinel error (e.g., errCRDMissing) or a (deployed bool, err error)
tuple and update the Reconcile call (where r.deployServiceMonitor(ctx, config)
is invoked) to only set config.Status.MonitoringEnabled = true when the
ServiceMonitor was actually deployed (deployed == true) and to treat
errCRDMissing as non-success (do not flip the status) while surfacing a clear
condition/message; ensure you reference the deployServiceMonitor function and
the Status.MonitoringEnabled field when making the change.

---

Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 1997-2034: The test suite is missing a failing test for
seedActiveBuildsGauge and there is a race between the seeder and the reconciler
(seedActiveBuildsGauge calling ActiveBuilds.Set while adjustActiveBuildsGauge
may concurrently Inc/Dec), so first add a unit test in metrics_test.go that
exercises seedActiveBuildsGauge (expecting it to set the gauge from existing
ImageBuildList) and fails initially; then fix the race by ensuring seeding
happens-before reconciliations start: move the seeding call into
SetupWithManager (call the seeding logic synchronously before registering the
controller) or protect the seeder with a sync.Once/atomic flag that prevents
seedActiveBuildsGauge from calling ActiveBuilds.Set if adjustActiveBuildsGauge
already ran (or vice versa), referencing seedActiveBuildsGauge,
SetupWithManager, mgr.Add/manager.RunnableFunc, ActiveBuilds, and
adjustActiveBuildsGauge so the test validates the chosen approach.

---

Nitpick comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 1039-1069: deployServiceMonitor unconditionally creates a
RoleBinding via buildMetricsReaderRoleBinding that binds to the OpenShift
SA/namespace (prometheus-user-workload / openshift-user-workload-monitoring),
which is incorrect on vanilla Kubernetes; update deployServiceMonitor to either
guard creation of that RoleBinding with an OpenShift detection call (use
r.detectOpenShift(ctx, ...) to skip building/applying that binding when not
OpenShift) and add a TODO comment referencing this, or extend the
MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA
name/namespace and prefer those values when present so vanilla Prometheus
deployments can provide the correct SA; touch deployServiceMonitor,
buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement
the guard/optional fields and ensure createOrUpdate uses the new conditional
logic.
- Around line 1122-1132: The SetupWithManager in OperatorConfigReconciler
currently Owns ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and
Secret resources, so changes to the metrics token or roles won't trigger
reconciliation; update the SetupWithManager chain in the
OperatorConfigReconciler to also Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}),
Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}), and
optionally add a custom Watches(...) for the ServiceMonitor GVK if you want to
observe that unstructured CRD (keep the existing Complete(r) call).

In `@internal/controller/operatorconfig/resources_test.go`:
- Around line 166-218: Add an assertion in the buildServiceMonitor tests to
verify the generated ServiceMonitor endpoint's tlsConfig content so TLS behavior
is locked down; in the existing Describe("buildServiceMonitor") cases (e.g.,
where sm := r.buildServiceMonitor("test-ns", config) and you extract
endpoints/ep), access ep["tlsConfig"] and assert its fields (for example that
insecureSkipVerify is the expected boolean and any ca/caFile/caSecret fields
match current behavior) to prevent silent regressions when TLS handling changes.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 1072-1141: Add the standard app.kubernetes.io/* label block
(app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of,
app.kubernetes.io/managed-by) to the new monitoring resource
builders—buildMetricsTokenSecret, buildMetricsReaderRole,
buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by
copying the same label map used by the other resource builders in this file (use
the identical keys and values for name/component/part-of/managed-by so these
resources follow the same selection and ownership conventions).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da0003fd-4d87-4ce9-83c3-1669ffe99f49

📥 Commits

Reviewing files that changed from the base of the PR and between 68ceb6c and cab2bd8.

⛔ Files ignored due to path filters (2)
  • config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yaml is excluded by !config/crd/bases/**
  • config/rbac/role.yaml is excluded by !config/rbac/role.yaml
📒 Files selected for processing (11)
  • api/v1alpha1/operatorconfig_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/samples/automotive_v1_operatorconfig.yaml
  • config/samples/monitoring/perses-dashboard.yaml
  • config/samples/monitoring/uiplugin.yaml
  • internal/controller/imagebuild/controller.go
  • internal/controller/imagebuild/metrics.go
  • internal/controller/imagebuild/metrics_test.go
  • internal/controller/operatorconfig/controller.go
  • internal/controller/operatorconfig/resources.go
  • internal/controller/operatorconfig/resources_test.go
✅ Files skipped from review due to trivial changes (3)
  • config/samples/monitoring/perses-dashboard.yaml
  • config/samples/automotive_v1_operatorconfig.yaml
  • config/samples/monitoring/uiplugin.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/imagebuild/metrics.go

Comment thread internal/controller/operatorconfig/controller.go
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Assisted-by: claude-opus-4.6
@bennyz bennyz force-pushed the worktree-monitoring-config branch from cab2bd8 to 24ca715 Compare April 26, 2026 10:31
@bennyz bennyz requested a review from bkhizgiy April 26, 2026 11:55
@bennyz bennyz merged commit 54daf84 into centos-automotive-suite:main Apr 26, 2026
4 checks passed
@bennyz bennyz deleted the worktree-monitoring-config branch April 26, 2026 14:02
@coderabbitai coderabbitai Bot mentioned this pull request May 7, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants