optimizations and improvements to build process#185
Conversation
📝 WalkthroughWalkthroughEnables the manager metrics endpoint by default, adds Prometheus collectors and controller logic to record build and phase timings from Tekton results, emits build timing from the build script (with async metadata capture), and adds tests and task/pipeline results for build timing. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as Build Script
participant Tekton as Tekton Task
participant PR as PipelineRun
participant Controller as ImageBuild Controller
participant Metrics as Prometheus Registry
Script->>Script: record BUILD_START_TIME
Script->>Script: start async metadata resolution (aib/skopeo)
Script->>Tekton: write `/tekton/results/build-timing` (JSON)
Tekton->>PR: persist results in PipelineRun.status.results
Controller->>PR: detect ImageBuild completion/failure
Controller->>PR: read `build-timing` result (if present)
Controller->>Controller: compute durations and labels
Controller->>Metrics: increment `BuildTotal` and observe `BuildDuration`
Controller->>Metrics: observe `BuildPhaseDuration` for setup/build/post_build
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/controller/imagebuild/controller.go (1)
382-390:⚠️ Potential issue | 🟠 MajorRecord failure metrics after persisting
phaseFailed.At this point
CompletionTimeis still nil, so failed builds never contribute toBuildDuration. It also incrementsBuildTotalbefore the status transition is stored, which means a transientupdateStatusfailure will double-count on the retry. MoverecordBuildMetricsafter a successful failure status update and use the refreshed object.Suggested ordering
// Build failed - cleanup transient secrets r.cleanupTransientSecrets(ctx, imageBuild, r.Log) - recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure) if err := r.updateStatus(ctx, imageBuild, phaseFailed, r.pipelineRunFailureDetail(ctx, pipelineRun)); err != nil { log.Error(err, "Failed to update status to Failed") return ctrl.Result{}, err } + + failed := &automotivev1alpha1.ImageBuild{} + if err := r.Get(ctx, types.NamespacedName{Name: imageBuild.Name, Namespace: imageBuild.Namespace}, failed); err == nil { + recordBuildMetrics(failed, pipelineRun, buildStatusFailure) + } return ctrl.Result{}, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/imagebuild/controller.go` around lines 382 - 390, The metrics are recorded before the failure status is persisted which causes BuildDuration to be nil and risks double-counting on update retries; move the call to recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure) to after the successful r.updateStatus(...) call (use the refreshed imageBuild returned/loaded after update), keep the r.cleanupTransientSecrets(ctx, imageBuild, r.Log) and error handling for r.updateStatus as-is, and ensure recordBuildMetrics uses the up-to-date CompletionTime on the imageBuild and only runs after r.updateStatus returns no error.internal/controller/operatorconfig/resources.go (1)
503-529:⚠️ Potential issue | 🟠 MajorThis build-controller endpoint still isn't discoverable by Prometheus.
This only turns on
:8443inside the pod. The generated build-controller resources here still don't expose a metrics Service/port, and the monitor shown inconfig/prometheus/monitor.yamlscrapes a different target (port: httpson the existing operator monitor path). As written, the new image-build metrics never become scrapeable from the generated build-controller deployment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/operatorconfig/resources.go` around lines 503 - 529, The build-controller container ("manager" in the deployment spec created using getOperatorImage) currently only exposes the health port (8081) so Prometheus can't scrape :8443; add a ContainerPort entry to the "manager" container with Name "https" (or the exact port name your Prometheus monitor expects), ContainerPort 8443 and Protocol TCP, and then ensure you create a Service for the generated build-controller Deployment that exposes port "https" mapping to targetPort 8443 with the same selector/labels as the Deployment so the existing ServiceMonitor/prometheus monitor can discover and scrape the metrics endpoint; update any monitor targetPort names if necessary to match the port name chosen.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 566-572: The script no longer preserves a fallback artifact for
regular-file outputs: when compress_file fails the later search only looks in
$WORKSPACE_PATH and can miss /output/${exportFile}; fix by either copying
regular files into the workspace like the directory case (restore the cp -rv
"/output/${exportFile}" "$WORKSPACE_PATH/" and set
DISK_IMAGE_SOURCE="$WORKSPACE_PATH" for non-directory files) or make
compress_file failures fatal (exit non-zero) so the task fails instead of
continuing without an artifact; update the logic around exportFile,
DISK_IMAGE_SOURCE and compress_file to implement one of these fixes.
- Around line 267-270: The current logic assumes AIB_IMAGE_REF is tag-based and
unconditionally strips the tag into AIB_BASE, which corrupts
already-digest-pinned refs (repo@sha256:...) and can produce
repo@sha256@sha256:...; change the write logic so if AIB_IMAGE_REF already
contains an '@' (digest form) you preserve and write AIB_IMAGE_REF (or use
AIB_IMAGE_REF directly) instead of computing AIB_BASE, otherwise compute
AIB_BASE from AIB_IMAGE_REF by removing a trailing :tag (as done now) and write
"${AIB_BASE}@${AIB_DIGEST}"; implement the check using AIB_IMAGE_REF, AIB_DIGEST
and AIB_BASE and write to /tmp/aib-pinned.txt accordingly to avoid double-@
outputs.
- Around line 399-402: The subshell cannot reliably wait on AIB_METADATA_PID (a
sibling process) so replace the wait $AIB_METADATA_PID call with a short polling
loop that waits for the metadata files to be written (e.g., /tmp/aib-version.txt
and /tmp/aib-pinned.txt) with a sensible timeout; then read those files into
_AIB_VERSION and _AIB_IMAGE_PINNED as currently done. Update the block that
currently uses wait $AIB_METADATA_PID and cat /tmp/aib-*.txt to instead poll for
file existence (or non-empty content), time out on failure, and then set
_AIB_VERSION and _AIB_IMAGE_PINNED from those files.
In `@internal/common/tasks/scripts/common.sh`:
- Around line 9-17: The background PATCH to the
"automotive.sdv.cloud.redhat.com/progress" annotation (used by emit_progress)
can arrive out-of-order; fix by serializing updates: in emit_progress, remove
the backgrounding (&) and run the curl synchronously or implement a simple
sequence guard (e.g., a small per-pod counter file locked with flock that you
increment and include as a prefix/field in the annotation payload) so each PATCH
carries a monotonically increasing sequence/timestamp; ensure the curl uses that
sequence value so the consumer can ignore stale lower-numbered updates.
In `@internal/controller/imagebuild/controller.go`:
- Around line 1618-1648: recordBuildMetrics is using free-form spec fields
(distro, target, format, arch) as Prometheus label values which allows unbounded
cardinality; change the metric labeling to use only bounded/enum-safe labels or
normalized buckets: replace direct uses of imageBuild.Spec.GetDistro(),
GetTarget(), GetExportFormat(), and Architecture with either (a)
validated/whitelisted values mapped to a small fixed set (e.g.,
normalizeDistro(distro) that returns known names or "other"), or (b) drop them
from labels entirely and only keep safe labels (e.g., mode and status). Update
all calls to BuildTotal.WithLabelValues, BuildDuration.WithLabelValues and
BuildPhaseDuration.WithLabelValues in recordBuildMetrics to use the
normalized/whitelisted values (or the reduced label list) and ensure any helper
normalize* functions are used consistently before labeling.
---
Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 382-390: The metrics are recorded before the failure status is
persisted which causes BuildDuration to be nil and risks double-counting on
update retries; move the call to recordBuildMetrics(imageBuild, pipelineRun,
buildStatusFailure) to after the successful r.updateStatus(...) call (use the
refreshed imageBuild returned/loaded after update), keep the
r.cleanupTransientSecrets(ctx, imageBuild, r.Log) and error handling for
r.updateStatus as-is, and ensure recordBuildMetrics uses the up-to-date
CompletionTime on the imageBuild and only runs after r.updateStatus returns no
error.
In `@internal/controller/operatorconfig/resources.go`:
- Around line 503-529: The build-controller container ("manager" in the
deployment spec created using getOperatorImage) currently only exposes the
health port (8081) so Prometheus can't scrape :8443; add a ContainerPort entry
to the "manager" container with Name "https" (or the exact port name your
Prometheus monitor expects), ContainerPort 8443 and Protocol TCP, and then
ensure you create a Service for the generated build-controller Deployment that
exposes port "https" mapping to targetPort 8443 with the same selector/labels as
the Deployment so the existing ServiceMonitor/prometheus monitor can discover
and scrape the metrics endpoint; update any monitor targetPort names if
necessary to match the port name chosen.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 52c1cf6e-ab3e-42e9-8049-addf534512f1
📒 Files selected for processing (9)
cmd/main.goconfig/manager/manager.yamlinternal/common/tasks/scripts/build_image.shinternal/common/tasks/scripts/common.shinternal/common/tasks/tasks.gointernal/controller/imagebuild/controller.gointernal/controller/imagebuild/metrics.gointernal/controller/imagebuild/metrics_test.gointernal/controller/operatorconfig/resources.go
5434ab1 to
3674d31
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/common/tasks/scripts/build_image.sh (1)
561-562: Handlepushdfailure.Per Shellcheck (SC2164),
pushdcan fail silently. Add an error guard to avoid operating in the wrong directory.💡 Suggested fix
-pushd /output +pushd /output || { echo "Error: Failed to change to /output directory"; exit 1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/common/tasks/scripts/build_image.sh` around lines 561 - 562, The script currently calls pushd /output without checking for failure which can leave subsequent commands running in the wrong directory; after the pushd /output invocation (the line with "pushd /output") add an explicit error guard that tests pushd's exit status and aborts (or prints an error and exits non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH" so the script never continues when changing directory failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 561-562: The script currently calls pushd /output without checking
for failure which can leave subsequent commands running in the wrong directory;
after the pushd /output invocation (the line with "pushd /output") add an
explicit error guard that tests pushd's exit status and aborts (or prints an
error and exits non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH"
so the script never continues when changing directory failed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a38d19c4-39e4-4ab8-a232-a8752b778f83
📒 Files selected for processing (9)
cmd/main.goconfig/manager/manager.yamlinternal/common/tasks/scripts/build_image.shinternal/common/tasks/scripts/common.shinternal/common/tasks/tasks.gointernal/controller/imagebuild/controller.gointernal/controller/imagebuild/metrics.gointernal/controller/imagebuild/metrics_test.gointernal/controller/operatorconfig/resources.go
✅ Files skipped from review due to trivial changes (1)
- cmd/main.go
🚧 Files skipped from review as they are similar to previous changes (5)
- config/manager/manager.yaml
- internal/common/tasks/scripts/common.sh
- internal/controller/operatorconfig/resources.go
- internal/common/tasks/tasks.go
- internal/controller/imagebuild/metrics.go
- cleanups and small performance improvements Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assited-by: claude-opus-4.6
3674d31 to
23f5776
Compare
Summary by CodeRabbit
New Features
Configuration Changes
Improvements
Tests