feat: add optional Prometheus monitoring via OperatorConfig#251
Conversation
📝 WalkthroughWalkthroughAdds optional Prometheus monitoring to OperatorConfig (new MonitoringConfig API and status flag), controller logic to deploy/cleanup ServiceMonitor, token Secret and RBAC, builder helpers and tests, sample manifests for Perses/UIPlugin, and imagebuild metrics/gauge with startup initialization and phase-bound updates. Changes
Sequence DiagramsequenceDiagram
participant OC as OperatorConfig Controller
participant K8s as Kubernetes API
participant Prom as Prometheus
rect rgba(100,150,200,0.5)
Note over OC,K8s: Enable monitoring
OC->>K8s: Create metrics token Secret
OC->>K8s: Create Role / RoleBinding
OC->>K8s: Create ClusterRoleBinding
OC->>K8s: Create/Update ServiceMonitor
OC->>K8s: Update status.monitoringEnabled = true
end
rect rgba(200,150,100,0.5)
Note over Prom,K8s: Prometheus scraping
Prom->>K8s: Discover ServiceMonitor
Prom->>K8s: Read token Secret
Prom->>K8s: GET /metrics (authenticated)
end
rect rgba(200,100,100,0.5)
Note over OC,K8s: Disable / Cleanup monitoring
OC->>K8s: Delete ServiceMonitor (ignore NotFound/CRD-absent)
OC->>K8s: Delete token Secret
OC->>K8s: Delete Role / RoleBinding / ClusterRoleBinding
OC->>K8s: Update status.monitoringEnabled = false
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/operatorconfig/controller.go (1)
235-238:⚠️ Potential issue | 🟠 MajorClean up monitoring resources during OperatorConfig deletion.
The finalizer only calls
cleanupOSBuilds, butdeployServiceMonitorcreates a Secret, Role, RoleBinding, ClusterRoleBinding, and ServiceMonitor without owner references. Deleting theOperatorConfigwill orphan those resources.Proposed fix
log.Info("Handling deletion") if err := r.cleanupOSBuilds(ctx, config); err != nil { log.Error(err, "Failed to cleanup OSBuilds") return ctrl.Result{}, err } + if err := r.cleanupServiceMonitor(ctx, config); err != nil { + log.Error(err, "Failed to cleanup ServiceMonitor") + return ctrl.Result{}, err + } log.Info("Removing finalizer")Also applies to: 1031-1054
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/controller.go` around lines 235 - 238, The finalizer only calls cleanupOSBuilds but never removes monitoring resources created by deployServiceMonitor, so implement and call a cleanup function (e.g., r.cleanupMonitoringResources(ctx, config)) alongside r.cleanupOSBuilds in the deletion path to delete the Secret, Role, RoleBinding, ClusterRoleBinding and ServiceMonitor that were created without ownerRefs; locate the deletion logic near where cleanupOSBuilds is invoked (the finalizer block) and ensure cleanupMonitoringResources uses the same client/context to find and delete the resources by the same names/labels used in deployServiceMonitor so they are not orphaned on OperatorConfig deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/operatorconfig_types.go`:
- Around line 445-448: The Interval field on MonitoringConfig currently lacks
CRD-level validation; add a kubebuilder validation pattern annotation above the
Interval field (on MonitoringConfig.Interval) using the Prometheus duration
regex provided (e.g.,
^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$)
so invalid durations are rejected by the CRD, then run make generate manifests
to regenerate the CRD artifacts.
In `@internal/controller/operatorconfig/controller.go`:
- Around line 299-314: The reconcile logic is swallowing errors from
deployServiceMonitor and cleanupServiceMonitor; change the Reconcile flow so
that when r.deployServiceMonitor(ctx, config) or r.cleanupServiceMonitor(ctx,
config) returns an error you return that error (do not just log it), and only
modify config.Status.MonitoringEnabled when the operation succeeded; keep the
existing statusChanged handling but ensure you propagate the original error
instead of continuing to success so the controller can retry. Reference the
deployServiceMonitor and cleanupServiceMonitor calls and the
config.Status.MonitoringEnabled updates to locate where to return the error and
guard status mutation.
---
Outside diff comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 235-238: The finalizer only calls cleanupOSBuilds but never
removes monitoring resources created by deployServiceMonitor, so implement and
call a cleanup function (e.g., r.cleanupMonitoringResources(ctx, config))
alongside r.cleanupOSBuilds in the deletion path to delete the Secret, Role,
RoleBinding, ClusterRoleBinding and ServiceMonitor that were created without
ownerRefs; locate the deletion logic near where cleanupOSBuilds is invoked (the
finalizer block) and ensure cleanupMonitoringResources uses the same
client/context to find and delete the resources by the same names/labels used in
deployServiceMonitor so they are not orphaned on OperatorConfig deletion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5c653467-8189-443a-b3bf-f33d42cb92b6
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/rbac/role.yamlis excluded by!config/rbac/role.yaml
📒 Files selected for processing (5)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/zz_generated.deepcopy.gointernal/controller/operatorconfig/controller.gointernal/controller/operatorconfig/resources.gointernal/controller/operatorconfig/resources_test.go
fd146d6 to
1c0d960
Compare
|
@ambient-code please review |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/samples/monitoring/perses-dashboard.yaml`:
- Line 77: The panel query currently divides two raw sums and can yield NaN when
no samples exist; update the PromQL in the query field that uses ado_build_total
(status="success") and ado_build_total (total) to clamp the denominator and
provide a zero fallback—e.g. wrap the denominator with clamp_min(...,1) and
ensure the numerator falls back to 0 if empty—so replace the existing expression
with one that uses
clamp_min(sum(ado_build_total{namespace="automotive-dev-operator-system"}), 1)
for the denominator and a safe numerator
(sum(ado_build_total{namespace="automotive-dev-operator-system",status="success"})
or 0) to avoid NaN.
In `@internal/controller/imagebuild/controller.go`:
- Around line 2317-2323: ActiveBuilds can get out of sync because some code
paths patch status directly instead of using updateStatus; ensure the gauge is
always consistent by either (preferred) recomputing ActiveBuilds at startup and
after any status change from cluster state (query ImageBuild objects and set the
gauge to the count where phase == phaseBuilding) or by routing every phase
transition through a single helper that updates status and adjusts ActiveBuilds
atomically (create a transition helper used by updateStatus and the direct patch
sites), and seed the gauge on controller startup from current ImageBuilds to
avoid negative counts; refer to ActiveBuilds, updateStatus and phaseBuilding to
locate the places to change.
In `@internal/controller/operatorconfig/resources.go`:
- Around line 1177-1179: The current metrics scrape config sets "tlsConfig" with
"insecureSkipVerify: true" — replace this by loading the cluster/serving CA
bundle and wiring it into the "tlsConfig" so TLS is validated; specifically,
remove the insecureSkipVerify entry and instead populate tlsConfig with the CA
data (e.g., a "caData" or "caFile" field) sourced from the serving CA (the
OpenShift serving CA ConfigMap/secret or the injected service-ca bundle) before
returning the structure in the code that builds the metrics scrape config (look
for the map literal keyed "tlsConfig" in this file). Ensure the CA is correctly
base64-decoded or passed in the expected form for the downstream consumer and
handle errors if the CA cannot be retrieved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eb20d32e-fc58-4b1f-8736-4eeaf2bfbac9
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/rbac/role.yamlis excluded by!config/rbac/role.yaml
📒 Files selected for processing (10)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/zz_generated.deepcopy.goconfig/samples/automotive_v1_operatorconfig.yamlconfig/samples/monitoring/perses-dashboard.yamlconfig/samples/monitoring/uiplugin.yamlinternal/controller/imagebuild/controller.gointernal/controller/imagebuild/metrics.gointernal/controller/operatorconfig/controller.gointernal/controller/operatorconfig/resources.gointernal/controller/operatorconfig/resources_test.go
✅ Files skipped from review due to trivial changes (3)
- config/samples/automotive_v1_operatorconfig.yaml
- config/samples/monitoring/uiplugin.yaml
- internal/controller/operatorconfig/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- api/v1alpha1/operatorconfig_types.go
| "tlsConfig": map[string]interface{}{ | ||
| "insecureSkipVerify": true, | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and examine the lines in question
wc -l internal/controller/operatorconfig/resources.goRepository: centos-automotive-suite/automotive-dev-operator
Length of output: 143
🏁 Script executed:
# Read the specific lines and surrounding context (1170-1190) to understand the usage
sed -n '1170,1190p' internal/controller/operatorconfig/resources.goRepository: centos-automotive-suite/automotive-dev-operator
Length of output: 393
🏁 Script executed:
# Let's get more context to understand what this configuration is for
sed -n '1150,1200p' internal/controller/operatorconfig/resources.goRepository: centos-automotive-suite/automotive-dev-operator
Length of output: 979
🏁 Script executed:
# Find the function containing this code to understand context better
sed -n '1100,1185p' internal/controller/operatorconfig/resources.go | head -100Repository: centos-automotive-suite/automotive-dev-operator
Length of output: 2329
🏁 Script executed:
# Search for the function name and look for any CA configurations
rg "serviceMonitorTokenSecret|caBundle|caCert|servingCA" internal/controller/operatorconfig/ -A 2 -B 2Repository: centos-automotive-suite/automotive-dev-operator
Length of output: 3229
🏁 Script executed:
# Check if there are other ServiceMonitor creations with CA configurations
rg "insecureSkipVerify|ca.*Config|caFile" internal/controller/operatorconfig/ -B 3 -A 3Repository: centos-automotive-suite/automotive-dev-operator
Length of output: 1434
Wire the serving CA into tlsConfig instead of disabling TLS verification for metrics scraping.
insecureSkipVerify: true makes bearer-token-authenticated metrics scraping vulnerable to endpoint impersonation within the cluster. In an OpenShift environment, the serving CA is available and should be properly configured in tlsConfig to verify the metrics endpoint's certificate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/operatorconfig/resources.go` around lines 1177 - 1179,
The current metrics scrape config sets "tlsConfig" with "insecureSkipVerify:
true" — replace this by loading the cluster/serving CA bundle and wiring it into
the "tlsConfig" so TLS is validated; specifically, remove the insecureSkipVerify
entry and instead populate tlsConfig with the CA data (e.g., a "caData" or
"caFile" field) sourced from the serving CA (the OpenShift serving CA
ConfigMap/secret or the injected service-ca bundle) before returning the
structure in the code that builds the metrics scrape config (look for the map
literal keyed "tlsConfig" in this file). Ensure the CA is correctly
base64-decoded or passed in the expected form for the downstream consumer and
handle errors if the CA cannot be retrieved.
978f149 to
af84282
Compare
There was a problem hiding this comment.
Code Review: feat: add optional Prometheus monitoring via OperatorConfig
Overall this is a solid PR that follows the existing patterns in the codebase well. The unstructured approach for ServiceMonitor is the right call to avoid a hard dependency on the Prometheus Operator CRDs, and the monitoring/RBAC resources are correctly scoped. The ActiveBuilds gauge and seeder are nice additions.
I found one bug (double-decrement of ActiveBuilds gauge), a few items worth addressing, and some minor suggestions.
Summary of findings
-
Bug -- double gauge decrement on build completion (blocking):
checkBuildProgresscallsadjustActiveBuildsGauge(phaseBuilding, phaseCompleted)directly for the success path, but the Cancelled and Failed paths go throughupdateStatus, which also callsadjustActiveBuildsGauge. This is correctly partitioned (no overlap). However, I noticed that the Completed path incheckBuildProgresssetsfresh.Status.Phase = phaseCompletedand then patches, but it does NOT go throughupdateStatus. On the next reconciliation triggered by the status patch,checkBuildProgressmight be called again if the PipelineRun is still in the completed state. The guard at the top ofcheckBuildProgressshould prevent re-processing, but the gauge call at line 608 happens before the event/return. Worth verifying the guard ensures this path cannot be entered twice. -
Monitoring is independent of OSBuilds (design observation): Monitoring reconciliation runs regardless of whether OSBuilds is enabled. This seems intentional and correct -- the operator itself always exposes metrics. Just confirming this is the desired behavior.
-
ClusterRoleBinding references a pre-existing
metrics-readerClusterRole: ThebuildMetricsReaderClusterRoleBindingbinds to a ClusterRole namedmetrics-readerwhich exists inconfig/rbac/metrics_reader_role.yamland is deployed via kustomize. This means the operator creates the binding at runtime but relies on the ClusterRole having been deployed via kustomize at install time. This coupling should be documented. -
Test coverage: Good unit tests for the resource builders and the gauge function. Missing a test for
GetInterval()with nil receiver, and no integration test for the reconciliation flow (deploy + cleanup), though that may be acceptable for a first iteration. -
Interval regex allows empty string: The pattern
^(0|(([0-9]+)y)?...starts with0|, meaning a bare0is valid. This is fine per the Prometheus duration spec but worth noting.
See inline comments for specifics.
| @@ -605,6 +606,7 @@ func (r *ImageBuildReconciler) checkBuildProgress( | |||
| log.Error(err, "Failed to patch status to Completed") | |||
| return ctrl.Result{}, err | |||
| } | |||
There was a problem hiding this comment.
Potential double-decrement on re-reconciliation. This adjustActiveBuildsGauge(phaseBuilding, phaseCompleted) call runs every time checkBuildProgress reaches this code path. If the status patch at line 604 triggers a re-reconciliation (which it will, since the status changed), and the PipelineRun is still in the completed state, this function will be called again, decrementing the gauge a second time.
The success path sets fresh.Status.Phase = phaseCompleted before patching, but checkBuildProgress is called when the build's phase is Building. After the first successful pass, the build's phase will be Completed, so subsequent reconciliations should not enter checkBuildProgress at all (the caller likely gates on phase == Building). However, this relies on that external guard existing and being correct.
I would recommend moving the gauge adjustment inside a guard that checks the old phase was actually Building, similar to what updateStatus does:
if imageBuild.Status.Phase == phaseBuilding {
adjustActiveBuildsGauge(phaseBuilding, phaseCompleted)
}Alternatively, could you confirm the caller already guards checkBuildProgress to only be invoked when imageBuild.Status.Phase == phaseBuilding?
| @@ -1891,6 +1897,28 @@ func (r *ImageBuildReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
| return builder.Complete(r) | |||
There was a problem hiding this comment.
Good use of manager.RunnableFunc to seed the gauge on startup. This ensures the gauge reflects reality after a controller restart.
One edge case: if the List call fails (e.g., RBAC issue during startup), this logs the error but returns nil, so the manager won't retry and ActiveBuilds stays at 0. Consider returning the error so the manager retries, or at minimum logging at Error level to make the silent failure obvious. Currently it logs at Error level which is good, but returning nil means the manager considers the runnable finished successfully.
| }, | ||
| Rules: []rbacv1.PolicyRule{ | ||
| { | ||
| APIGroups: []string{""}, |
There was a problem hiding this comment.
The buildMetricsReaderClusterRoleBinding references a ClusterRole named metrics-reader which is defined in config/rbac/metrics_reader_role.yaml and deployed via kustomize at install time. If someone deploys the operator without the full kustomize stack (e.g., OLM bundle), will this ClusterRole exist?
I see that the kubebuilder scaffolding includes it in the RBAC config, so it should be bundled into OLM's CSV. Worth verifying this is included in the bundle manifests, or alternatively have the operator create this ClusterRole dynamically (like it does for other RBAC resources).
| }, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
Nit: The buildMetricsReaderRoleBinding grants access to prometheus-user-workload SA in openshift-user-workload-monitoring. This is OpenShift-specific. If someone enables monitoring on a vanilla Kubernetes cluster with a standalone Prometheus Operator, this binding won't work because the SA name and namespace will differ.
Since the PR description mentions this is for user-workload-monitoring on OpenShift, this is fine for the initial scope. Consider adding a comment here noting this is OpenShift-specific, and potentially gating behind detectOpenShift if vanilla k8s support is needed later.
| } | ||
|
|
||
| clusterBinding := r.buildMetricsReaderClusterRoleBinding(config.Namespace) | ||
| if err := r.createOrUpdate(ctx, clusterBinding, config); err != nil { |
There was a problem hiding this comment.
Good: the cleanupServiceMonitor correctly handles IsNoMatchError for the unstructured ServiceMonitor deletion, so it won't fail on clusters where the Prometheus Operator CRD is not installed. This is the right pattern.
However, the symmetric case in deployServiceMonitor does NOT have this guard. If someone enables monitoring on a cluster without the Prometheus Operator CRD installed, createOrUpdate for the ServiceMonitor will fail with a NoMatch error, and the entire reconciliation will fail. Consider catching IsNoMatchError in deployServiceMonitor and either logging a warning and skipping (graceful degradation) or setting a condition on the status to indicate the CRD is missing.
| "endpoints": []interface{}{ | ||
| map[string]interface{}{ | ||
| "path": "/metrics", | ||
| "port": "https", |
There was a problem hiding this comment.
Using insecureSkipVerify: true is consistent with the existing config/prometheus/monitor.yaml scaffold, so this is fine for now. The existing file has a TODO comment about this. You might want to add a similar TODO or comment here noting this should be addressed for production hardening.
| @@ -0,0 +1,151 @@ | |||
| # Automotive Dev Operator - Perses Dashboard | |||
There was a problem hiding this comment.
Nice addition of a sample Perses dashboard. The PromQL queries look correct and reference the metrics defined in internal/controller/imagebuild/metrics.go (ado_build_active, ado_build_total, ado_build_duration_seconds_sum/count). The clamp_min(..., 1) to avoid division by zero is a good pattern.
One note: the namespace is hardcoded to automotive-dev-operator-system in all queries. If someone deploys to a different namespace, they would need to update all queries. Consider adding a note about this in the header comments.
af84282 to
3745a86
Compare
Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor, metrics token Secret, Role, and ClusterRoleBinding when enabled. Cleanup removes all resources when disabled. Uses unstructured types to avoid hard dependency on Prometheus Operator CRDs. Assisted-by: claude-sonnet-4-6 Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6
3745a86 to
cab2bd8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/imagebuild/controller.go (1)
1997-2034:⚠️ Potential issue | 🟠 MajorAdd a failing test for
seedActiveBuildsGaugebefore implementing the seeding logic, and fix the race between the seeder and reconciler.The coding guidelines require failing tests before implementation for
.gofiles. There is no test forseedActiveBuildsGaugeinmetrics_test.go.Additionally, a race condition exists: the seeder runs as a
manager.RunnableFuncregistered viamgr.Add(), executing concurrently with the reconciler after cache sync. The seeder callsActiveBuilds.Set(active)while the reconciler concurrently callsadjustActiveBuildsGauge(which callsInc()orDec()). Without synchronization, interleaving these operations can leave the gauge off-by-one for the controller's lifetime. For example, if the cache at T1 lists a build asBuilding, but the reconciler has already processedBuilding → Completed(callingDec), the seeder'sSetoverwrites the correct count with a stale snapshot.Options:
- Run the seed before reconciliations start (compute inline in
SetupWithManageror guard withsync.Once).- Or, ensure the seeder only initializes once and skips if
adjustActiveBuildsGaugehas already been called.- Or document the eventual-consistency semantics explicitly.
Write the failing test first, then implement a solution that eliminates the race.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/imagebuild/controller.go` around lines 1997 - 2034, The test suite is missing a failing test for seedActiveBuildsGauge and there is a race between the seeder and the reconciler (seedActiveBuildsGauge calling ActiveBuilds.Set while adjustActiveBuildsGauge may concurrently Inc/Dec), so first add a unit test in metrics_test.go that exercises seedActiveBuildsGauge (expecting it to set the gauge from existing ImageBuildList) and fails initially; then fix the race by ensuring seeding happens-before reconciliations start: move the seeding call into SetupWithManager (call the seeding logic synchronously before registering the controller) or protect the seeder with a sync.Once/atomic flag that prevents seedActiveBuildsGauge from calling ActiveBuilds.Set if adjustActiveBuildsGauge already ran (or vice versa), referencing seedActiveBuildsGauge, SetupWithManager, mgr.Add/manager.RunnableFunc, ActiveBuilds, and adjustActiveBuildsGauge so the test validates the chosen approach.
🧹 Nitpick comments (4)
internal/controller/operatorconfig/controller.go (2)
1039-1069: Consider parameterizing the bound SA / Prometheus namespace for non-OpenShift Prometheus Operator deployments.
deployServiceMonitorunconditionally creates aRoleBindingforprometheus-user-workloadinopenshift-user-workload-monitoring(viabuildMetricsReaderRoleBindinginresources.go). On vanilla Kubernetes with a standalone Prometheus Operator the SA name and namespace are different, so the binding is dead RBAC. Since this PR is scoped to OpenShift user-workload-monitoring, this is fine for now, but consider:
- Adding a TODO and/or guarding it behind
r.detectOpenShift(ctx, ...).- Allowing
MonitoringConfigto optionally specify the scrape SAname/namespacefor vanilla-Prometheus support later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/controller.go` around lines 1039 - 1069, deployServiceMonitor unconditionally creates a RoleBinding via buildMetricsReaderRoleBinding that binds to the OpenShift SA/namespace (prometheus-user-workload / openshift-user-workload-monitoring), which is incorrect on vanilla Kubernetes; update deployServiceMonitor to either guard creation of that RoleBinding with an OpenShift detection call (use r.detectOpenShift(ctx, ...) to skip building/applying that binding when not OpenShift) and add a TODO comment referencing this, or extend the MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA name/namespace and prefer those values when present so vanilla Prometheus deployments can provide the correct SA; touch deployServiceMonitor, buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement the guard/optional fields and ensure createOrUpdate uses the new conditional logic.
1122-1132: Reconciler does not own/watch the new monitoring resources.
SetupWithManageralready callsOwns(&corev1.ConfigMap{}),Owns(&corev1.ServiceAccount{}), etc., so other reconciled resources trigger requeues when modified. The new metrics tokenSecret,Role,RoleBinding, andClusterRoleBindingare not in theOwns()chain. If a cluster admin manually edits or deletes the metrics token Secret or the metrics-reader Role/RoleBinding, the reconciler won't notice until the next periodic resync.
Owns(&corev1.Secret{}),Owns(&rbacv1.Role{}),Owns(&rbacv1.RoleBinding{}), andOwns(&rbacv1.ClusterRoleBinding{})(or a watch with appropriate predicates) would close that gap. The unstructuredServiceMonitorrequires a customWatches(...)with the GVK — and is reasonable to skip given the optional CRD.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/controller.go` around lines 1122 - 1132, The SetupWithManager in OperatorConfigReconciler currently Owns ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and Secret resources, so changes to the metrics token or roles won't trigger reconciliation; update the SetupWithManager chain in the OperatorConfigReconciler to also Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}), Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}), and optionally add a custom Watches(...) for the ServiceMonitor GVK if you want to observe that unstructured CRD (keep the existing Complete(r) call).internal/controller/operatorconfig/resources_test.go (1)
166-218: Optional: also asserttlsConfigcontent on the generated endpoint.Tests cover the GVK, selector, bearer token reference, interval defaulting, path/port/scheme, but not
tlsConfig. Given there's an open hardening discussion aroundinsecureSkipVerify, locking down current behavior with an assertion makes future TLS changes (e.g., switching to a CA bundle) intentional rather than a silent regression.♻️ Suggested addition
It("should scrape /metrics on port https", func() { config := &automotivev1alpha1.MonitoringConfig{Enabled: true} sm := r.buildServiceMonitor("test-ns", config) endpoints := sm.Object["spec"].(map[string]any)["endpoints"].([]any) ep := endpoints[0].(map[string]any) Expect(ep["path"]).To(Equal("/metrics")) Expect(ep["port"]).To(Equal("https")) Expect(ep["scheme"]).To(Equal("https")) + tls := ep["tlsConfig"].(map[string]any) + Expect(tls["insecureSkipVerify"]).To(BeTrue()) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/resources_test.go` around lines 166 - 218, Add an assertion in the buildServiceMonitor tests to verify the generated ServiceMonitor endpoint's tlsConfig content so TLS behavior is locked down; in the existing Describe("buildServiceMonitor") cases (e.g., where sm := r.buildServiceMonitor("test-ns", config) and you extract endpoints/ep), access ep["tlsConfig"] and assert its fields (for example that insecureSkipVerify is the expected boolean and any ca/caFile/caSecret fields match current behavior) to prevent silent regressions when TLS handling changes.internal/controller/operatorconfig/resources.go (1)
1072-1141: Add standardapp.kubernetes.io/*labels to the new monitoring resources for consistency.The new metrics token Secret, Role, RoleBinding, and ClusterRoleBinding builders omit the standard label set (
app.kubernetes.io/name,app.kubernetes.io/component,app.kubernetes.io/part-of,app.kubernetes.io/managed-by) that every other resource builder in this file applies. This breaks the labeling convention used for selection, ownership tracking, and operational queries (e.g.,kubectl get -l app.kubernetes.io/part-of=automotive-dev-operator).♻️ Proposed labels (illustrative)
func (r *OperatorConfigReconciler) buildMetricsTokenSecret(namespace string) *corev1.Secret { return &corev1.Secret{ ObjectMeta: metav1.ObjectMeta{ Name: serviceMonitorTokenSecret, Namespace: namespace, + Labels: map[string]string{ + "app.kubernetes.io/name": "automotive-dev-operator", + "app.kubernetes.io/component": "monitoring", + "app.kubernetes.io/part-of": "automotive-dev-operator", + }, Annotations: map[string]string{ "kubernetes.io/service-account.name": "ado-operator", }, }, Type: corev1.SecretTypeServiceAccountToken, } }Apply the same label block to
buildMetricsReaderRole,buildMetricsReaderRoleBinding, andbuildMetricsReaderClusterRoleBinding.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/resources.go` around lines 1072 - 1141, Add the standard app.kubernetes.io/* label block (app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of, app.kubernetes.io/managed-by) to the new monitoring resource builders—buildMetricsTokenSecret, buildMetricsReaderRole, buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by copying the same label map used by the other resource builders in this file (use the identical keys and values for name/component/part-of/managed-by so these resources follow the same selection and ownership conventions).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 304-312: deployServiceMonitor currently returns nil when the
Prometheus CRD is missing which causes Reconcile to incorrectly set
config.Status.MonitoringEnabled = true; change deployServiceMonitor to return
either a sentinel error (e.g., errCRDMissing) or a (deployed bool, err error)
tuple and update the Reconcile call (where r.deployServiceMonitor(ctx, config)
is invoked) to only set config.Status.MonitoringEnabled = true when the
ServiceMonitor was actually deployed (deployed == true) and to treat
errCRDMissing as non-success (do not flip the status) while surfacing a clear
condition/message; ensure you reference the deployServiceMonitor function and
the Status.MonitoringEnabled field when making the change.
---
Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 1997-2034: The test suite is missing a failing test for
seedActiveBuildsGauge and there is a race between the seeder and the reconciler
(seedActiveBuildsGauge calling ActiveBuilds.Set while adjustActiveBuildsGauge
may concurrently Inc/Dec), so first add a unit test in metrics_test.go that
exercises seedActiveBuildsGauge (expecting it to set the gauge from existing
ImageBuildList) and fails initially; then fix the race by ensuring seeding
happens-before reconciliations start: move the seeding call into
SetupWithManager (call the seeding logic synchronously before registering the
controller) or protect the seeder with a sync.Once/atomic flag that prevents
seedActiveBuildsGauge from calling ActiveBuilds.Set if adjustActiveBuildsGauge
already ran (or vice versa), referencing seedActiveBuildsGauge,
SetupWithManager, mgr.Add/manager.RunnableFunc, ActiveBuilds, and
adjustActiveBuildsGauge so the test validates the chosen approach.
---
Nitpick comments:
In `@internal/controller/operatorconfig/controller.go`:
- Around line 1039-1069: deployServiceMonitor unconditionally creates a
RoleBinding via buildMetricsReaderRoleBinding that binds to the OpenShift
SA/namespace (prometheus-user-workload / openshift-user-workload-monitoring),
which is incorrect on vanilla Kubernetes; update deployServiceMonitor to either
guard creation of that RoleBinding with an OpenShift detection call (use
r.detectOpenShift(ctx, ...) to skip building/applying that binding when not
OpenShift) and add a TODO comment referencing this, or extend the
MonitoringConfig (used in buildServiceMonitor) to optionally accept scrape SA
name/namespace and prefer those values when present so vanilla Prometheus
deployments can provide the correct SA; touch deployServiceMonitor,
buildMetricsReaderRoleBinding and the MonitoringConfig definition to implement
the guard/optional fields and ensure createOrUpdate uses the new conditional
logic.
- Around line 1122-1132: The SetupWithManager in OperatorConfigReconciler
currently Owns ConfigMap/ServiceAccount/etc but omits the monitoring RBAC and
Secret resources, so changes to the metrics token or roles won't trigger
reconciliation; update the SetupWithManager chain in the
OperatorConfigReconciler to also Owns(&corev1.Secret{}), Owns(&rbacv1.Role{}),
Owns(&rbacv1.RoleBinding{}), and Owns(&rbacv1.ClusterRoleBinding{}), and
optionally add a custom Watches(...) for the ServiceMonitor GVK if you want to
observe that unstructured CRD (keep the existing Complete(r) call).
In `@internal/controller/operatorconfig/resources_test.go`:
- Around line 166-218: Add an assertion in the buildServiceMonitor tests to
verify the generated ServiceMonitor endpoint's tlsConfig content so TLS behavior
is locked down; in the existing Describe("buildServiceMonitor") cases (e.g.,
where sm := r.buildServiceMonitor("test-ns", config) and you extract
endpoints/ep), access ep["tlsConfig"] and assert its fields (for example that
insecureSkipVerify is the expected boolean and any ca/caFile/caSecret fields
match current behavior) to prevent silent regressions when TLS handling changes.
In `@internal/controller/operatorconfig/resources.go`:
- Around line 1072-1141: Add the standard app.kubernetes.io/* label block
(app.kubernetes.io/name, app.kubernetes.io/component, app.kubernetes.io/part-of,
app.kubernetes.io/managed-by) to the new monitoring resource
builders—buildMetricsTokenSecret, buildMetricsReaderRole,
buildMetricsReaderRoleBinding, and buildMetricsReaderClusterRoleBinding—by
copying the same label map used by the other resource builders in this file (use
the identical keys and values for name/component/part-of/managed-by so these
resources follow the same selection and ownership conventions).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: da0003fd-4d87-4ce9-83c3-1669ffe99f49
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/rbac/role.yamlis excluded by!config/rbac/role.yaml
📒 Files selected for processing (11)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/zz_generated.deepcopy.goconfig/samples/automotive_v1_operatorconfig.yamlconfig/samples/monitoring/perses-dashboard.yamlconfig/samples/monitoring/uiplugin.yamlinternal/controller/imagebuild/controller.gointernal/controller/imagebuild/metrics.gointernal/controller/imagebuild/metrics_test.gointernal/controller/operatorconfig/controller.gointernal/controller/operatorconfig/resources.gointernal/controller/operatorconfig/resources_test.go
✅ Files skipped from review due to trivial changes (3)
- config/samples/monitoring/perses-dashboard.yaml
- config/samples/automotive_v1_operatorconfig.yaml
- config/samples/monitoring/uiplugin.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/controller/imagebuild/metrics.go
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6
cab2bd8 to
24ca715
Compare
Add MonitoringConfig to OperatorConfigSpec that deploys a ServiceMonitor, metrics token Secret, Role, and ClusterRoleBinding when enabled
Sample Perses dashboard for reference
Summary
Related Issues
Type of Change
Testing
make test)make lint)make manifests generate)Summary by CodeRabbit
New Features
Tests
Samples