optimizations and improvements to build process by bennyz · Pull Request #185 · centos-automotive-suite/automotive-dev-operator

bennyz · 2026-03-29T07:33:16Z

cleanups and small performance improvements
measure time for simpler comparisons

Summary by CodeRabbit

New Features
- Prometheus build metrics: total counts, overall duration, and per-phase timing; build timing is exported for pipeline visibility.
Configuration Changes
- Metrics endpoint enabled by default and now listens on port 8443.
Improvements
- Build captures timing/metadata asynchronously and overlaps container push with disk image creation.
- Progress updates are non-blocking, output verbosity reduced, and in-memory volume handling for build outputs improved.
Tests
- Unit tests added to validate metrics and timing behavior.

coderabbitai · 2026-03-29T07:33:29Z

📝 Walkthrough

Walkthrough

Enables the manager metrics endpoint by default, adds Prometheus collectors and controller logic to record build and phase timings from Tekton results, emits build timing from the build script (with async metadata capture), and adds tests and task/pipeline results for build timing.

Changes

Cohort / File(s)	Summary
Metrics Endpoint Config `cmd/main.go`, `config/manager/manager.yaml`, `internal/controller/operatorconfig/resources.go`	Default `--metrics-bind-address` changed from `0` to `:8443`; manager container now exposes port `8443` so metrics are enabled by default.
Build Script & Progress `internal/common/tasks/scripts/build_image.sh`, `internal/common/tasks/scripts/common.sh`	Capture build start time and write structured JSON timing result to `/tekton/results/build-timing`; move AIB metadata resolution to background, run some push/annotation steps asynchronously, and make `emit_progress` non-blocking.
Tekton Task/Pipeline Results `internal/common/tasks/tasks.go`	Added Tekton `TaskResult` `build-timing` and corresponding Pipeline `Result`; when using memory volumes, `output-dir` is converted to an in-memory EmptyDir alongside other memory-backed volumes.
Controller Metrics Implementation `internal/controller/imagebuild/metrics.go`, `internal/controller/imagebuild/controller.go`	New Prometheus collectors (`BuildDuration`, `BuildPhaseDuration`, `BuildTotal`) registered; added `recordBuildMetrics()` and parsing of `build-timing` from PipelineRun to increment counters and observe histograms for total and phase durations.
Metrics Tests `internal/controller/imagebuild/metrics_test.go`	New unit tests validating `recordBuildMetrics` behavior for success/failure, presence/absence/malformed `build-timing`, and histogram/counter observations.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Build Script
    participant Tekton as Tekton Task
    participant PR as PipelineRun
    participant Controller as ImageBuild Controller
    participant Metrics as Prometheus Registry

    Script->>Script: record BUILD_START_TIME
    Script->>Script: start async metadata resolution (aib/skopeo)
    Script->>Tekton: write `/tekton/results/build-timing` (JSON)
    Tekton->>PR: persist results in PipelineRun.status.results

    Controller->>PR: detect ImageBuild completion/failure
    Controller->>PR: read `build-timing` result (if present)
    Controller->>Controller: compute durations and labels
    Controller->>Metrics: increment `BuildTotal` and observe `BuildDuration`
    Controller->>Metrics: observe `BuildPhaseDuration` for setup/build/post_build

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

bkhizgiy

Poem

🐰
I hopped through logs at morning light,
Timed the build, then ran from sight.
Async whispers, metadata spun,
Metrics bloom when work is done.
Hop—collect! —the pipeline's fun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main purpose of the PR across multiple domains (metrics enablement, build timing, async operations, and performance optimizations).
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/controller/imagebuild/controller.go (1)
382-390: ⚠️ Potential issue | 🟠 Major

Record failure metrics after persisting phaseFailed.

At this point CompletionTime is still nil, so failed builds never contribute to BuildDuration. It also increments BuildTotal before the status transition is stored, which means a transient updateStatus failure will double-count on the retry. Move recordBuildMetrics after a successful failure status update and use the refreshed object.
Suggested ordering
 	// Build failed - cleanup transient secrets
 	r.cleanupTransientSecrets(ctx, imageBuild, r.Log)
-	recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure)

 	if err := r.updateStatus(ctx, imageBuild, phaseFailed, r.pipelineRunFailureDetail(ctx, pipelineRun)); err != nil {
 		log.Error(err, "Failed to update status to Failed")
 		return ctrl.Result{}, err
 	}
+
+	failed := &automotivev1alpha1.ImageBuild{}
+	if err := r.Get(ctx, types.NamespacedName{Name: imageBuild.Name, Namespace: imageBuild.Namespace}, failed); err == nil {
+		recordBuildMetrics(failed, pipelineRun, buildStatusFailure)
+	}
 	return ctrl.Result{}, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/imagebuild/controller.go` around lines 382 - 390, The
metrics are recorded before the failure status is persisted which causes
BuildDuration to be nil and risks double-counting on update retries; move the
call to recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure) to after
the successful r.updateStatus(...) call (use the refreshed imageBuild
returned/loaded after update), keep the r.cleanupTransientSecrets(ctx,
imageBuild, r.Log) and error handling for r.updateStatus as-is, and ensure
recordBuildMetrics uses the up-to-date CompletionTime on the imageBuild and only
runs after r.updateStatus returns no error.
internal/controller/operatorconfig/resources.go (1)
503-529: ⚠️ Potential issue | 🟠 Major

This build-controller endpoint still isn't discoverable by Prometheus.

This only turns on :8443 inside the pod. The generated build-controller resources here still don't expose a metrics Service/port, and the monitor shown in config/prometheus/monitor.yaml scrapes a different target (port: https on the existing operator monitor path). As written, the new image-build metrics never become scrapeable from the generated build-controller deployment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources.go` around lines 503 - 529, The
build-controller container ("manager" in the deployment spec created using
getOperatorImage) currently only exposes the health port (8081) so Prometheus
can't scrape :8443; add a ContainerPort entry to the "manager" container with
Name "https" (or the exact port name your Prometheus monitor expects),
ContainerPort 8443 and Protocol TCP, and then ensure you create a Service for
the generated build-controller Deployment that exposes port "https" mapping to
targetPort 8443 with the same selector/labels as the Deployment so the existing
ServiceMonitor/prometheus monitor can discover and scrape the metrics endpoint;
update any monitor targetPort names if necessary to match the port name chosen.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 566-572: The script no longer preserves a fallback artifact for
regular-file outputs: when compress_file fails the later search only looks in
$WORKSPACE_PATH and can miss /output/${exportFile}; fix by either copying
regular files into the workspace like the directory case (restore the cp -rv
"/output/${exportFile}" "$WORKSPACE_PATH/" and set
DISK_IMAGE_SOURCE="$WORKSPACE_PATH" for non-directory files) or make
compress_file failures fatal (exit non-zero) so the task fails instead of
continuing without an artifact; update the logic around exportFile,
DISK_IMAGE_SOURCE and compress_file to implement one of these fixes.
- Around line 267-270: The current logic assumes AIB_IMAGE_REF is tag-based and
unconditionally strips the tag into AIB_BASE, which corrupts
already-digest-pinned refs (repo@sha256:...) and can produce
repo@sha256@sha256:...; change the write logic so if AIB_IMAGE_REF already
contains an '@' (digest form) you preserve and write AIB_IMAGE_REF (or use
AIB_IMAGE_REF directly) instead of computing AIB_BASE, otherwise compute
AIB_BASE from AIB_IMAGE_REF by removing a trailing :tag (as done now) and write
"${AIB_BASE}@${AIB_DIGEST}"; implement the check using AIB_IMAGE_REF, AIB_DIGEST
and AIB_BASE and write to /tmp/aib-pinned.txt accordingly to avoid double-@
outputs.
- Around line 399-402: The subshell cannot reliably wait on AIB_METADATA_PID (a
sibling process) so replace the wait $AIB_METADATA_PID call with a short polling
loop that waits for the metadata files to be written (e.g., /tmp/aib-version.txt
and /tmp/aib-pinned.txt) with a sensible timeout; then read those files into
_AIB_VERSION and _AIB_IMAGE_PINNED as currently done. Update the block that
currently uses wait $AIB_METADATA_PID and cat /tmp/aib-*.txt to instead poll for
file existence (or non-empty content), time out on failure, and then set
_AIB_VERSION and _AIB_IMAGE_PINNED from those files.

In `@internal/common/tasks/scripts/common.sh`:
- Around line 9-17: The background PATCH to the
"automotive.sdv.cloud.redhat.com/progress" annotation (used by emit_progress)
can arrive out-of-order; fix by serializing updates: in emit_progress, remove
the backgrounding (&) and run the curl synchronously or implement a simple
sequence guard (e.g., a small per-pod counter file locked with flock that you
increment and include as a prefix/field in the annotation payload) so each PATCH
carries a monotonically increasing sequence/timestamp; ensure the curl uses that
sequence value so the consumer can ignore stale lower-numbered updates.

In `@internal/controller/imagebuild/controller.go`:
- Around line 1618-1648: recordBuildMetrics is using free-form spec fields
(distro, target, format, arch) as Prometheus label values which allows unbounded
cardinality; change the metric labeling to use only bounded/enum-safe labels or
normalized buckets: replace direct uses of imageBuild.Spec.GetDistro(),
GetTarget(), GetExportFormat(), and Architecture with either (a)
validated/whitelisted values mapped to a small fixed set (e.g.,
normalizeDistro(distro) that returns known names or "other"), or (b) drop them
from labels entirely and only keep safe labels (e.g., mode and status). Update
all calls to BuildTotal.WithLabelValues, BuildDuration.WithLabelValues and
BuildPhaseDuration.WithLabelValues in recordBuildMetrics to use the
normalized/whitelisted values (or the reduced label list) and ensure any helper
normalize* functions are used consistently before labeling.

---

Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 382-390: The metrics are recorded before the failure status is
persisted which causes BuildDuration to be nil and risks double-counting on
update retries; move the call to recordBuildMetrics(imageBuild, pipelineRun,
buildStatusFailure) to after the successful r.updateStatus(...) call (use the
refreshed imageBuild returned/loaded after update), keep the
r.cleanupTransientSecrets(ctx, imageBuild, r.Log) and error handling for
r.updateStatus as-is, and ensure recordBuildMetrics uses the up-to-date
CompletionTime on the imageBuild and only runs after r.updateStatus returns no
error.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 503-529: The build-controller container ("manager" in the
deployment spec created using getOperatorImage) currently only exposes the
health port (8081) so Prometheus can't scrape :8443; add a ContainerPort entry
to the "manager" container with Name "https" (or the exact port name your
Prometheus monitor expects), ContainerPort 8443 and Protocol TCP, and then
ensure you create a Service for the generated build-controller Deployment that
exposes port "https" mapping to targetPort 8443 with the same selector/labels as
the Deployment so the existing ServiceMonitor/prometheus monitor can discover
and scrape the metrics endpoint; update any monitor targetPort names if
necessary to match the port name chosen.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 52c1cf6e-ab3e-42e9-8049-addf534512f1

📥 Commits

Reviewing files that changed from the base of the PR and between a8beb42 and 5434ab1.

📒 Files selected for processing (9)

cmd/main.go
config/manager/manager.yaml
internal/common/tasks/scripts/build_image.sh
internal/common/tasks/scripts/common.sh
internal/common/tasks/tasks.go
internal/controller/imagebuild/controller.go
internal/controller/imagebuild/metrics.go
internal/controller/imagebuild/metrics_test.go
internal/controller/operatorconfig/resources.go

coderabbitai

🧹 Nitpick comments (1)

internal/common/tasks/scripts/build_image.sh (1)

561-562: Handle pushd failure.

Per Shellcheck (SC2164), pushd can fail silently. Add an error guard to avoid operating in the wrong directory.

💡 Suggested fix

-pushd /output
+pushd /output || { echo "Error: Failed to change to /output directory"; exit 1; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/common/tasks/scripts/build_image.sh` around lines 561 - 562, The
script currently calls pushd /output without checking for failure which can
leave subsequent commands running in the wrong directory; after the pushd
/output invocation (the line with "pushd /output") add an explicit error guard
that tests pushd's exit status and aborts (or prints an error and exits
non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH" so the script
never continues when changing directory failed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 561-562: The script currently calls pushd /output without checking
for failure which can leave subsequent commands running in the wrong directory;
after the pushd /output invocation (the line with "pushd /output") add an
explicit error guard that tests pushd's exit status and aborts (or prints an
error and exits non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH"
so the script never continues when changing directory failed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a38d19c4-39e4-4ab8-a232-a8752b778f83

📥 Commits

Reviewing files that changed from the base of the PR and between 5434ab1 and 3674d31.

📒 Files selected for processing (9)

cmd/main.go
config/manager/manager.yaml
internal/common/tasks/scripts/build_image.sh
internal/common/tasks/scripts/common.sh
internal/common/tasks/tasks.go
internal/controller/imagebuild/controller.go
internal/controller/imagebuild/metrics.go
internal/controller/imagebuild/metrics_test.go
internal/controller/operatorconfig/resources.go

✅ Files skipped from review due to trivial changes (1)

cmd/main.go

🚧 Files skipped from review as they are similar to previous changes (5)

config/manager/manager.yaml
internal/common/tasks/scripts/common.sh
internal/controller/operatorconfig/resources.go
internal/common/tasks/tasks.go
internal/controller/imagebuild/metrics.go

- cleanups and small performance improvements Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assited-by: claude-opus-4.6

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

bennyz force-pushed the bootc-build-optimization branch from 5434ab1 to 3674d31 Compare March 29, 2026 08:34

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

optimizations and improvements to build process

23f5776

- cleanups and small performance improvements Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assited-by: claude-opus-4.6

bennyz force-pushed the bootc-build-optimization branch from 3674d31 to 23f5776 Compare March 29, 2026 08:43

bennyz requested a review from bkhizgiy March 29, 2026 09:50

bkhizgiy reviewed Mar 29, 2026

View reviewed changes

Comment thread internal/common/tasks/scripts/build_image.sh

bkhizgiy approved these changes Mar 29, 2026

View reviewed changes

bennyz merged commit c8a1025 into centos-automotive-suite:main Mar 29, 2026
4 checks passed

This was referenced Mar 29, 2026

add toggle for PVC usage in builds #187

Merged

remove tar based upload in favor of cat based #189

Merged

This was referenced Apr 12, 2026

add Tekton Chains type hints and osbuild manifest attestation #197

Merged

feat: add SoftwareBuild CRD for multi-OS software builds #199

Open

This was referenced Apr 23, 2026

feat: add optional Prometheus monitoring via OperatorConfig #251

Merged

initial tracing and log ingestion setup #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizations and improvements to build process#185

optimizations and improvements to build process#185
bennyz merged 1 commit into
centos-automotive-suite:mainfrom
bennyz:bootc-build-optimization

bennyz commented Mar 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bennyz commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bennyz commented Mar 29, 2026 •

edited

Loading

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading