Skip to content

optimizations and improvements to build process#185

Merged
bennyz merged 1 commit into
centos-automotive-suite:mainfrom
bennyz:bootc-build-optimization
Mar 29, 2026
Merged

optimizations and improvements to build process#185
bennyz merged 1 commit into
centos-automotive-suite:mainfrom
bennyz:bootc-build-optimization

Conversation

@bennyz

@bennyz bennyz commented Mar 29, 2026

Copy link
Copy Markdown
Contributor
  • cleanups and small performance improvements
  • measure time for simpler comparisons

Summary by CodeRabbit

  • New Features

    • Prometheus build metrics: total counts, overall duration, and per-phase timing; build timing is exported for pipeline visibility.
  • Configuration Changes

    • Metrics endpoint enabled by default and now listens on port 8443.
  • Improvements

    • Build captures timing/metadata asynchronously and overlaps container push with disk image creation.
    • Progress updates are non-blocking, output verbosity reduced, and in-memory volume handling for build outputs improved.
  • Tests

    • Unit tests added to validate metrics and timing behavior.

@coderabbitai

coderabbitai Bot commented Mar 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Enables the manager metrics endpoint by default, adds Prometheus collectors and controller logic to record build and phase timings from Tekton results, emits build timing from the build script (with async metadata capture), and adds tests and task/pipeline results for build timing.

Changes

Cohort / File(s) Summary
Metrics Endpoint Config
cmd/main.go, config/manager/manager.yaml, internal/controller/operatorconfig/resources.go
Default --metrics-bind-address changed from 0 to :8443; manager container now exposes port 8443 so metrics are enabled by default.
Build Script & Progress
internal/common/tasks/scripts/build_image.sh, internal/common/tasks/scripts/common.sh
Capture build start time and write structured JSON timing result to /tekton/results/build-timing; move AIB metadata resolution to background, run some push/annotation steps asynchronously, and make emit_progress non-blocking.
Tekton Task/Pipeline Results
internal/common/tasks/tasks.go
Added Tekton TaskResult build-timing and corresponding Pipeline Result; when using memory volumes, output-dir is converted to an in-memory EmptyDir alongside other memory-backed volumes.
Controller Metrics Implementation
internal/controller/imagebuild/metrics.go, internal/controller/imagebuild/controller.go
New Prometheus collectors (BuildDuration, BuildPhaseDuration, BuildTotal) registered; added recordBuildMetrics() and parsing of build-timing from PipelineRun to increment counters and observe histograms for total and phase durations.
Metrics Tests
internal/controller/imagebuild/metrics_test.go
New unit tests validating recordBuildMetrics behavior for success/failure, presence/absence/malformed build-timing, and histogram/counter observations.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Build Script
    participant Tekton as Tekton Task
    participant PR as PipelineRun
    participant Controller as ImageBuild Controller
    participant Metrics as Prometheus Registry

    Script->>Script: record BUILD_START_TIME
    Script->>Script: start async metadata resolution (aib/skopeo)
    Script->>Tekton: write `/tekton/results/build-timing` (JSON)
    Tekton->>PR: persist results in PipelineRun.status.results

    Controller->>PR: detect ImageBuild completion/failure
    Controller->>PR: read `build-timing` result (if present)
    Controller->>Controller: compute durations and labels
    Controller->>Metrics: increment `BuildTotal` and observe `BuildDuration`
    Controller->>Metrics: observe `BuildPhaseDuration` for setup/build/post_build
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • bkhizgiy

Poem

🐰
I hopped through logs at morning light,
Timed the build, then ran from sight.
Async whispers, metadata spun,
Metrics bloom when work is done.
Hop—collect! —the pipeline's fun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main purpose of the PR across multiple domains (metrics enablement, build timing, async operations, and performance optimizations).
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/controller/imagebuild/controller.go (1)

382-390: ⚠️ Potential issue | 🟠 Major

Record failure metrics after persisting phaseFailed.

At this point CompletionTime is still nil, so failed builds never contribute to BuildDuration. It also increments BuildTotal before the status transition is stored, which means a transient updateStatus failure will double-count on the retry. Move recordBuildMetrics after a successful failure status update and use the refreshed object.

Suggested ordering
 	// Build failed - cleanup transient secrets
 	r.cleanupTransientSecrets(ctx, imageBuild, r.Log)
-	recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure)

 	if err := r.updateStatus(ctx, imageBuild, phaseFailed, r.pipelineRunFailureDetail(ctx, pipelineRun)); err != nil {
 		log.Error(err, "Failed to update status to Failed")
 		return ctrl.Result{}, err
 	}
+
+	failed := &automotivev1alpha1.ImageBuild{}
+	if err := r.Get(ctx, types.NamespacedName{Name: imageBuild.Name, Namespace: imageBuild.Namespace}, failed); err == nil {
+		recordBuildMetrics(failed, pipelineRun, buildStatusFailure)
+	}
 	return ctrl.Result{}, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/imagebuild/controller.go` around lines 382 - 390, The
metrics are recorded before the failure status is persisted which causes
BuildDuration to be nil and risks double-counting on update retries; move the
call to recordBuildMetrics(imageBuild, pipelineRun, buildStatusFailure) to after
the successful r.updateStatus(...) call (use the refreshed imageBuild
returned/loaded after update), keep the r.cleanupTransientSecrets(ctx,
imageBuild, r.Log) and error handling for r.updateStatus as-is, and ensure
recordBuildMetrics uses the up-to-date CompletionTime on the imageBuild and only
runs after r.updateStatus returns no error.
internal/controller/operatorconfig/resources.go (1)

503-529: ⚠️ Potential issue | 🟠 Major

This build-controller endpoint still isn't discoverable by Prometheus.

This only turns on :8443 inside the pod. The generated build-controller resources here still don't expose a metrics Service/port, and the monitor shown in config/prometheus/monitor.yaml scrapes a different target (port: https on the existing operator monitor path). As written, the new image-build metrics never become scrapeable from the generated build-controller deployment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/operatorconfig/resources.go` around lines 503 - 529, The
build-controller container ("manager" in the deployment spec created using
getOperatorImage) currently only exposes the health port (8081) so Prometheus
can't scrape :8443; add a ContainerPort entry to the "manager" container with
Name "https" (or the exact port name your Prometheus monitor expects),
ContainerPort 8443 and Protocol TCP, and then ensure you create a Service for
the generated build-controller Deployment that exposes port "https" mapping to
targetPort 8443 with the same selector/labels as the Deployment so the existing
ServiceMonitor/prometheus monitor can discover and scrape the metrics endpoint;
update any monitor targetPort names if necessary to match the port name chosen.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 566-572: The script no longer preserves a fallback artifact for
regular-file outputs: when compress_file fails the later search only looks in
$WORKSPACE_PATH and can miss /output/${exportFile}; fix by either copying
regular files into the workspace like the directory case (restore the cp -rv
"/output/${exportFile}" "$WORKSPACE_PATH/" and set
DISK_IMAGE_SOURCE="$WORKSPACE_PATH" for non-directory files) or make
compress_file failures fatal (exit non-zero) so the task fails instead of
continuing without an artifact; update the logic around exportFile,
DISK_IMAGE_SOURCE and compress_file to implement one of these fixes.
- Around line 267-270: The current logic assumes AIB_IMAGE_REF is tag-based and
unconditionally strips the tag into AIB_BASE, which corrupts
already-digest-pinned refs (repo@sha256:...) and can produce
repo@sha256@sha256:...; change the write logic so if AIB_IMAGE_REF already
contains an '@' (digest form) you preserve and write AIB_IMAGE_REF (or use
AIB_IMAGE_REF directly) instead of computing AIB_BASE, otherwise compute
AIB_BASE from AIB_IMAGE_REF by removing a trailing :tag (as done now) and write
"${AIB_BASE}@${AIB_DIGEST}"; implement the check using AIB_IMAGE_REF, AIB_DIGEST
and AIB_BASE and write to /tmp/aib-pinned.txt accordingly to avoid double-@
outputs.
- Around line 399-402: The subshell cannot reliably wait on AIB_METADATA_PID (a
sibling process) so replace the wait $AIB_METADATA_PID call with a short polling
loop that waits for the metadata files to be written (e.g., /tmp/aib-version.txt
and /tmp/aib-pinned.txt) with a sensible timeout; then read those files into
_AIB_VERSION and _AIB_IMAGE_PINNED as currently done. Update the block that
currently uses wait $AIB_METADATA_PID and cat /tmp/aib-*.txt to instead poll for
file existence (or non-empty content), time out on failure, and then set
_AIB_VERSION and _AIB_IMAGE_PINNED from those files.

In `@internal/common/tasks/scripts/common.sh`:
- Around line 9-17: The background PATCH to the
"automotive.sdv.cloud.redhat.com/progress" annotation (used by emit_progress)
can arrive out-of-order; fix by serializing updates: in emit_progress, remove
the backgrounding (&) and run the curl synchronously or implement a simple
sequence guard (e.g., a small per-pod counter file locked with flock that you
increment and include as a prefix/field in the annotation payload) so each PATCH
carries a monotonically increasing sequence/timestamp; ensure the curl uses that
sequence value so the consumer can ignore stale lower-numbered updates.

In `@internal/controller/imagebuild/controller.go`:
- Around line 1618-1648: recordBuildMetrics is using free-form spec fields
(distro, target, format, arch) as Prometheus label values which allows unbounded
cardinality; change the metric labeling to use only bounded/enum-safe labels or
normalized buckets: replace direct uses of imageBuild.Spec.GetDistro(),
GetTarget(), GetExportFormat(), and Architecture with either (a)
validated/whitelisted values mapped to a small fixed set (e.g.,
normalizeDistro(distro) that returns known names or "other"), or (b) drop them
from labels entirely and only keep safe labels (e.g., mode and status). Update
all calls to BuildTotal.WithLabelValues, BuildDuration.WithLabelValues and
BuildPhaseDuration.WithLabelValues in recordBuildMetrics to use the
normalized/whitelisted values (or the reduced label list) and ensure any helper
normalize* functions are used consistently before labeling.

---

Outside diff comments:
In `@internal/controller/imagebuild/controller.go`:
- Around line 382-390: The metrics are recorded before the failure status is
persisted which causes BuildDuration to be nil and risks double-counting on
update retries; move the call to recordBuildMetrics(imageBuild, pipelineRun,
buildStatusFailure) to after the successful r.updateStatus(...) call (use the
refreshed imageBuild returned/loaded after update), keep the
r.cleanupTransientSecrets(ctx, imageBuild, r.Log) and error handling for
r.updateStatus as-is, and ensure recordBuildMetrics uses the up-to-date
CompletionTime on the imageBuild and only runs after r.updateStatus returns no
error.

In `@internal/controller/operatorconfig/resources.go`:
- Around line 503-529: The build-controller container ("manager" in the
deployment spec created using getOperatorImage) currently only exposes the
health port (8081) so Prometheus can't scrape :8443; add a ContainerPort entry
to the "manager" container with Name "https" (or the exact port name your
Prometheus monitor expects), ContainerPort 8443 and Protocol TCP, and then
ensure you create a Service for the generated build-controller Deployment that
exposes port "https" mapping to targetPort 8443 with the same selector/labels as
the Deployment so the existing ServiceMonitor/prometheus monitor can discover
and scrape the metrics endpoint; update any monitor targetPort names if
necessary to match the port name chosen.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 52c1cf6e-ab3e-42e9-8049-addf534512f1

📥 Commits

Reviewing files that changed from the base of the PR and between a8beb42 and 5434ab1.

📒 Files selected for processing (9)
  • cmd/main.go
  • config/manager/manager.yaml
  • internal/common/tasks/scripts/build_image.sh
  • internal/common/tasks/scripts/common.sh
  • internal/common/tasks/tasks.go
  • internal/controller/imagebuild/controller.go
  • internal/controller/imagebuild/metrics.go
  • internal/controller/imagebuild/metrics_test.go
  • internal/controller/operatorconfig/resources.go

Comment thread internal/common/tasks/scripts/build_image.sh Outdated
Comment thread internal/common/tasks/scripts/build_image.sh
Comment thread internal/common/tasks/scripts/build_image.sh
Comment thread internal/common/tasks/scripts/common.sh
Comment thread internal/controller/imagebuild/controller.go
@bennyz bennyz force-pushed the bootc-build-optimization branch from 5434ab1 to 3674d31 Compare March 29, 2026 08:34

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/common/tasks/scripts/build_image.sh (1)

561-562: Handle pushd failure.

Per Shellcheck (SC2164), pushd can fail silently. Add an error guard to avoid operating in the wrong directory.

💡 Suggested fix
-pushd /output
+pushd /output || { echo "Error: Failed to change to /output directory"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/common/tasks/scripts/build_image.sh` around lines 561 - 562, The
script currently calls pushd /output without checking for failure which can
leave subsequent commands running in the wrong directory; after the pushd
/output invocation (the line with "pushd /output") add an explicit error guard
that tests pushd's exit status and aborts (or prints an error and exits
non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH" so the script
never continues when changing directory failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/common/tasks/scripts/build_image.sh`:
- Around line 561-562: The script currently calls pushd /output without checking
for failure which can leave subsequent commands running in the wrong directory;
after the pushd /output invocation (the line with "pushd /output") add an
explicit error guard that tests pushd's exit status and aborts (or prints an
error and exits non‑zero) if it fails before running mkdir -p "$WORKSPACE_PATH"
so the script never continues when changing directory failed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a38d19c4-39e4-4ab8-a232-a8752b778f83

📥 Commits

Reviewing files that changed from the base of the PR and between 5434ab1 and 3674d31.

📒 Files selected for processing (9)
  • cmd/main.go
  • config/manager/manager.yaml
  • internal/common/tasks/scripts/build_image.sh
  • internal/common/tasks/scripts/common.sh
  • internal/common/tasks/tasks.go
  • internal/controller/imagebuild/controller.go
  • internal/controller/imagebuild/metrics.go
  • internal/controller/imagebuild/metrics_test.go
  • internal/controller/operatorconfig/resources.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/main.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • config/manager/manager.yaml
  • internal/common/tasks/scripts/common.sh
  • internal/controller/operatorconfig/resources.go
  • internal/common/tasks/tasks.go
  • internal/controller/imagebuild/metrics.go

- cleanups and small performance improvements

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Assited-by: claude-opus-4.6
@bennyz bennyz force-pushed the bootc-build-optimization branch from 3674d31 to 23f5776 Compare March 29, 2026 08:43
@bennyz bennyz requested a review from bkhizgiy March 29, 2026 09:50
Comment thread internal/common/tasks/scripts/build_image.sh
@bennyz bennyz merged commit c8a1025 into centos-automotive-suite:main Mar 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants