Skip to content

Add HostLease controller for bare metal management via Ironic#1

Open
DanNiESh wants to merge 2 commits intoosac-project:mainfrom
DanNiESh:main
Open

Add HostLease controller for bare metal management via Ironic#1
DanNiESh wants to merge 2 commits intoosac-project:mainfrom
DanNiESh:main

Conversation

@DanNiESh
Copy link
Copy Markdown

@DanNiESh DanNiESh commented Apr 3, 2026

Summary

Summary by CodeRabbit

  • New Features

    • Kubernetes controller for managing OpenStack baremetal nodes through Ironic API
    • Kubernetes deployment manifests with RBAC, metrics exposure, and pod security policies
    • Makefile with build, test, deploy, and Docker commands
  • Tests

    • End-to-end test suite validating controller deployment and metrics functionality
  • Documentation

    • Comprehensive README with local testing and deployment instructions
  • Chores

    • Project setup: Go modules, GitHub Actions workflows, Docker containerization, linting configuration

Sets up the foundation for a Kubernetes operator using controller-runtime v0.21.0, including RBAC configurations, metrics server, webhooks support, CI/CD workflows, and e2e testing infrastructure.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request introduces a complete Kubernetes controller project for managing OpenStack bare metal node power states through Ironic. It includes application infrastructure (Dockerfile, Makefile, go.mod), a controller implementation reconciling HostLease custom resources with Ironic nodes, comprehensive Kubernetes deployment manifests (RBAC, metrics exposure, network policies), GitHub Actions workflows for CI/CD automation, and end-to-end test suites. Configuration files for linting, containerization, and version control standards complete the foundational setup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The pull request title directly and clearly describes the main change: introducing a HostLease controller for managing bare metal nodes via Ironic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (16)
.devcontainer/post-install.sh (1)

4-15: Unpinned tool versions and missing checksum verification.

Downloading latest for kind/kubebuilder (and dynamic stable for kubectl) makes the DevContainer non-reproducible:

  • Builds may break unexpectedly when upstream releases a new version.
  • No integrity check means a compromised download could go unnoticed.

Consider pinning versions (e.g., via environment variables or constants) and verifying SHA256 checksums. The Makefile already lacks version pinning (see KIND_CLUSTER in Makefile:68-86), so this is an opportunity to centralize version management.

Example with pinned versions and checksum
 #!/bin/bash
-set -x
+set -ex
 
-curl -Lo ./kind https://kind.sigs.k8s.io/dl/latest/kind-linux-amd64
+KIND_VERSION="${KIND_VERSION:-v0.23.0}"
+curl -Lo ./kind "https://kind.sigs.k8s.io/dl/${KIND_VERSION}/kind-linux-amd64"
+echo "<expected_sha256>  ./kind" | sha256sum -c -
 chmod +x ./kind
 mv ./kind /usr/local/bin/kind
 
-curl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/linux/amd64
+KUBEBUILDER_VERSION="${KUBEBUILDER_VERSION:-v4.0.0}"
+curl -L -o kubebuilder "https://go.kubebuilder.io/dl/${KUBEBUILDER_VERSION}/linux/amd64"
 chmod +x kubebuilder
 mv kubebuilder /usr/local/bin/
 
-KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
+KUBECTL_VERSION="${KUBECTL_VERSION:-$(curl -L -s https://dl.k8s.io/release/stable.txt)}"
 curl -LO "https://dl.k8s.io/release/$KUBECTL_VERSION/bin/linux/amd64/kubectl"
+curl -LO "https://dl.k8s.io/release/$KUBECTL_VERSION/bin/linux/amd64/kubectl.sha256"
+echo "$(cat kubectl.sha256)  kubectl" | sha256sum -c -
 chmod +x kubectl
 mv kubectl /usr/local/bin/kubectl
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.devcontainer/post-install.sh around lines 4 - 15, The script downloads
unpinned "latest" artifacts (kind, kubebuilder) and dynamically resolves
KUBECTL_VERSION without checksum verification, making builds non-reproducible
and insecure; update .devcontainer/post-install.sh to read pinned version
variables (e.g., KIND_VERSION, KUBEBUILDER_VERSION, KUBECTL_VERSION) from a
single source (or export defaults at top), use those exact versioned URLs
instead of "latest" or dynamic stable.txt, and for each artifact fetch the
corresponding SHA256 checksum file and verify it (compare computed sha256sum to
the expected value) before chmod/moving; reference the commands/variables in
this file (kind, kubebuilder, KUBECTL_VERSION) and also centralize these
versions into the Makefile or a shared env so Makefile's KIND_CLUSTER uses the
same pinned versions.
.devcontainer/devcontainer.json (1)

13-15: Deprecated VSCode setting.

terminal.integrated.shell.linux has been deprecated in favor of terminal profiles. Consider updating to avoid warnings in newer VSCode versions.

Proposed fix
       "settings": {
-        "terminal.integrated.shell.linux": "/bin/bash"
+        "terminal.integrated.defaultProfile.linux": "bash"
       },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.devcontainer/devcontainer.json around lines 13 - 15, Replace the deprecated
"terminal.integrated.shell.linux" setting with the newer terminal profile keys:
set "terminal.integrated.defaultProfile.linux" to "bash" (or your desired
profile name) and, if needed, define "terminal.integrated.profiles.linux" with a
"bash" entry pointing to "/bin/bash" so VSCode uses the profile instead of the
deprecated key; update the .devcontainer JSON to remove
"terminal.integrated.shell.linux" and add the two new keys
("terminal.integrated.defaultProfile.linux" and optional
"terminal.integrated.profiles.linux") accordingly.
.github/workflows/test.yml (2)

3-5: Consider scoping trigger branches to reduce CI noise.

The empty push: and pull_request: triggers run on all branches. For a typical workflow, you might want to limit push to specific branches (e.g., main) while keeping pull_request broad.

Example scoped triggers
 on:
-  push:
-  pull_request:
+  push:
+    branches: [main]
+  pull_request:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 3 - 5, The workflow's top-level
triggers are unscoped because the on: push: and pull_request: entries run on all
branches; update the on: block to restrict push to the main (or other release)
branches and optionally narrow pull_request if desired by adding branches under
push (e.g., branches: - main) and/or under pull_request (e.g., branches: - main
- develop) so CI only runs on the intended branches; edit the YAML's on/push and
on/pull_request entries (symbols: on, push, pull_request) to include the
branches lists.

20-23: Running go mod tidy in CI may cause unexpected failures.

If go mod tidy modifies go.mod or go.sum, it indicates the committed files are out of sync. Consider adding a check to fail explicitly if files change, rather than silently fixing them during the test run.

Add a tidy check step
       - name: Running Tests
         run: |
           go mod tidy
+          git diff --exit-code go.mod go.sum || (echo "go.mod/go.sum not tidy" && exit 1)
           make test
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 20 - 23, Replace the current single
"go mod tidy" execution in the "Running Tests" step with a two-step approach:
first run "go mod tidy" into a temporary diff check (e.g., run "git status
--porcelain" or "git diff --exit-code" after running "go mod tidy") and fail the
job if any changes are detected, or alternatively run "go mod tidy -e && git
diff --exit-code -- go.mod go.sum" to explicitly error when go.mod/go.sum are
modified; update the "Running Tests" step (the commands shown where you
currently run "go mod tidy" and "make test") to perform this check before
running "make test" so CI fails when module files are out of sync rather than
silently mutating them.
test/utils/utils.go (2)

194-202: Brittle path assumption in GetProjectDir.

The hardcoded replacement of /test/e2e assumes tests always run from that specific path. This will break if the test directory structure changes or tests are executed from a different location.

Consider a more robust approach like walking up the directory tree to find go.mod, though this is acceptable for typical Kubebuilder e2e test conventions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/utils.go` around lines 194 - 202, GetProjectDir currently
hard-codes strings.ReplaceAll(wd, "/test/e2e", "") which is brittle; change it
to walk up parent directories from os.Getwd() (using filepath.Dir) until you
find a project root indicator such as a go.mod (check with os.Stat) or reach
filesystem root, then return that directory. Update GetProjectDir to perform
this upward traversal and fallback to returning an error if no go.mod is found;
remove the hardcoded ReplaceAll usage and keep error wrapping for os.Getwd() and
os.Stat failures.

44-50: Consider handling the GetProjectDir error instead of ignoring it.

Line 45 discards the error from GetProjectDir(). If the working directory cannot be determined, subsequent operations may fail silently or behave unexpectedly. At minimum, log the error similarly to line 49.

Proposed improvement
 func Run(cmd *exec.Cmd) (string, error) {
-	dir, _ := GetProjectDir()
+	dir, err := GetProjectDir()
+	if err != nil {
+		_, _ = fmt.Fprintf(GinkgoWriter, "warning: failed to get project dir: %v\n", err)
+	}
 	cmd.Dir = dir
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/utils.go` around lines 44 - 50, In Run, don't ignore the error
from GetProjectDir(); capture the returned err from GetProjectDir() and if
non-nil log it via GinkgoWriter (similar to the existing os.Chdir error
handling), and return that error (or wrap it) so callers of Run and subsequent
os.Chdir/command execution don't proceed with an invalid cmd.Dir; update the
GetProjectDir() call in Run, and adjust error handling for cmd.Dir, os.Chdir and
the function return path accordingly.
.gitignore (1)

19-20: Orphaned negation pattern has no effect.

The negation !vendor/**/zz_generated.* only works if there's a preceding rule that ignores these files (e.g., **/zz_generated.*). Without such a rule, this line does nothing. Either add the missing ignore rule or remove this line if it's not needed.

Option A: Add the missing ignore rule
 # Kubernetes Generated files - skip generated files, except for vendored files
+**/zz_generated.*
 !vendor/**/zz_generated.*
Option B: Remove the orphaned negation
-# Kubernetes Generated files - skip generated files, except for vendored files
-!vendor/**/zz_generated.*
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 19 - 20, The negation pattern
"!vendor/**/zz_generated.*" is orphaned because there is no preceding ignore
rule like "**/zz_generated.*" to negate; either add a rule that ignores
generated files (e.g., "**/zz_generated.*" or "vendor/**/zz_generated.*") so the
negation has effect, or remove the orphaned negation line entirely; look for the
string "!vendor/**/zz_generated.*" in the .gitignore and either add the
corresponding ignore pattern or delete that negation line.
.github/workflows/lint.yml (1)

1-23: Add explicit workflow token permissions (least privilege).

Please set permissions to read-only for this lint workflow to reduce CI token blast radius.

🔐 Proposed hardening
 name: Lint
 
 on:
   push:
   pull_request:
+
+permissions:
+  contents: read
 
 jobs:
   lint:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/lint.yml around lines 1 - 23, The workflow "name: Lint" /
job "lint" currently runs with the default GITHUB_TOKEN permissions; add an
explicit least-privilege permissions block to the workflow root (not per-step)
e.g. set permissions: contents: read (and any other minimal read-only scopes you
need such as pull-requests: read) so the lint job only has read access; update
the top of the workflow immediately after the name: Lint header to include this
permissions block and ensure it applies to the "lint" job rather than leaving
default full-write token permissions.
config/default/kustomization.yaml (1)

12-34: Fix YAML comment formatting to pass linting.

Static analysis flags missing spaces after # in commented lines (e.g., #- should be # -). While these are style issues in inactive code, fixing them will satisfy the pre-commit checks.

🧹 Example fixes for some lines
-#- includeSelectors: true
+# - includeSelectors: true
-#  pairs:
+#   pairs:
-#- ../crd
+# - ../crd
-#- ../webhook
+# - ../webhook
-#- ../certmanager
+# - ../certmanager
-#- ../prometheus
+# - ../prometheus
-#- ../network-policy
+# - ../network-policy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/default/kustomization.yaml` around lines 12 - 34, Fix the YAML comment
formatting by ensuring there is a space after every '#' in this
kustomization.yaml (e.g., change "#labels:" to "# labels:" and all occurrences
of "#-" to "# -"), including commented resource entries like "# - ../rbac", "# -
../manager", "# - ../webhook", "# - ../certmanager", "# - ../prometheus", "# -
../network-policy", and comment lines such as "# [WEBHOOK]" and "#
[CERTMANAGER]"; apply the same "# - ..." spacing for the list item comments and
any other commented lines (e.g., the metrics_service.yaml comment block) so the
linter no longer flags missing space-after-# issues.
test/e2e/e2e_suite_test.go (1)

40-43: Consider parameterizing the project image.

The hardcoded projectImage differs from the Makefile default (controller:latest). While this is intentional for e2e isolation, consider making it configurable via environment variable to allow CI customization.

💡 Optional: Make image configurable
-	// projectImage is the name of the image which will be build and loaded
-	// with the code source changes to be tested.
-	projectImage = "example.com/host-management-openstack:v0.0.1"
+	// projectImage is the name of the image which will be built and loaded
+	// with the code source changes to be tested.
+	projectImage = getEnvOrDefault("E2E_PROJECT_IMAGE", "example.com/host-management-openstack:v0.0.1")
 )
+
+func getEnvOrDefault(key, defaultValue string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return defaultValue
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/e2e_suite_test.go` around lines 40 - 43, The hardcoded package-level
variable projectImage should be made configurable via an environment variable:
change initialization of projectImage to read from an env var (e.g.,
PROJECT_IMAGE or E2E_PROJECT_IMAGE) and fall back to the current literal
"example.com/host-management-openstack:v0.0.1" when the env var is not set;
update any test setup that references projectImage to use the new value so CI or
local runs can override the image without changing code.
go.mod (1)

107-107: Replace directive points to a personal fork.

The dependency on github.com/DanNiESh/osac-operator is a personal fork referenced throughout the project documentation and codebase. While this is an intentional architectural choice, it creates supply chain concerns for production. The upstream osac-operator must be updated to include required functionality and this replace directive removed before any production release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 107, The go.mod replace directive currently points to a
personal fork ("replace github.com/osac-project/osac-operator =>
github.com/DanNiESh/osac-operator v0.0.0-..."), which is unsafe for production;
update the upstream osac-operator repository to include the required changes and
then remove or change this replace directive in go.mod so it references the
official module (or a tagged upstream version) instead of the personal fork, and
ensure any references to the forked module in project docs/config are updated
accordingly.
test/e2e/e2e_test.go (1)

275-311: Error handling in serviceAccountToken is confusing.

At line 310, err refers to the error from os.WriteFile (line 284), which would have caused an early return if non-nil. By the time we reach line 310, err is always nil, making the return value misleading.

Proposed fix
 func serviceAccountToken() (string, error) {
 	const tokenRequestRawString = `{
 		"apiVersion": "authentication.k8s.io/v1",
 		"kind": "TokenRequest"
 	}`
 
 	// Temporary file to store the token request
 	secretName := fmt.Sprintf("%s-token-request", serviceAccountName)
 	tokenRequestFile := filepath.Join("/tmp", secretName)
-	err := os.WriteFile(tokenRequestFile, []byte(tokenRequestRawString), os.FileMode(0o644))
-	if err != nil {
+	if err := os.WriteFile(tokenRequestFile, []byte(tokenRequestRawString), os.FileMode(0o644)); err != nil {
 		return "", err
 	}
 
 	var out string
+	var tokenErr error
 	verifyTokenCreation := func(g Gomega) {
 		// Execute kubectl command to create the token
 		cmd := exec.Command("kubectl", "create", "--raw", fmt.Sprintf(
 			"/api/v1/namespaces/%s/serviceaccounts/%s/token",
 			namespace,
 			serviceAccountName,
 		), "-f", tokenRequestFile)
 
 		output, err := cmd.CombinedOutput()
 		g.Expect(err).NotTo(HaveOccurred())
 
 		// Parse the JSON output to extract the token
 		var token tokenRequest
 		err = json.Unmarshal(output, &token)
 		g.Expect(err).NotTo(HaveOccurred())
 
 		out = token.Status.Token
 	}
 	Eventually(verifyTokenCreation).Should(Succeed())
 
-	return out, err
+	return out, tokenErr
 }

Or simply return nil directly since errors inside Eventually cause test failure:

-	return out, err
+	return out, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/e2e_test.go` around lines 275 - 311, The function
serviceAccountToken returns the named variable err which is only set by
os.WriteFile and will always be nil when reaching the final return; change the
return to explicitly return the token string and a nil error (i.e., return out,
nil) or remove the named err and return values directly; ensure you keep the
existing verifyTokenCreation/Eventually logic and tokenRequestFile handling so
errors inside Eventually still fail the test.
Makefile (2)

22-24: Consider adding a clean target.

A clean target to remove build artifacts (bin/, dist/, cover.out, etc.) would improve developer experience.

Proposed addition
.PHONY: clean
clean: ## Remove build artifacts
	rm -rf $(LOCALBIN) dist/ cover.out Dockerfile.cross
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 22 - 24, Add a new .PHONY target named clean to the
Makefile that removes build artifacts; implement a target `clean` (referenced
alongside existing targets `all` and `build`) that runs `rm -rf` on the common
outputs (e.g., bin/ or $(LOCALBIN), dist/, cover.out, Dockerfile.cross) and add
`clean` to the .PHONY list so `make clean` reliably deletes generated artifacts
for a clean workspace.

70-76: Consider making cluster setup idempotent.

kind create cluster will fail if the cluster already exists (e.g., from a previous failed run). Consider checking for existence first or ignoring the error when the cluster already exists.

Proposed fix
 .PHONY: setup-test-e2e
 setup-test-e2e: ## Set up a Kind cluster for e2e tests if it does not exist
 	`@command` -v $(KIND) >/dev/null 2>&1 || { \
 		echo "Kind is not installed. Please install Kind manually."; \
 		exit 1; \
 	}
-	$(KIND) create cluster --name $(KIND_CLUSTER)
+	`@if` ! $(KIND) get clusters 2>/dev/null | grep -q "^$(KIND_CLUSTER)$$"; then \
+		$(KIND) create cluster --name $(KIND_CLUSTER); \
+	else \
+		echo "Kind cluster $(KIND_CLUSTER) already exists, skipping creation"; \
+	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 70 - 76, The setup-test-e2e Makefile target currently
calls `$(KIND) create cluster --name $(KIND_CLUSTER)` which fails if the cluster
already exists; modify the `setup-test-e2e` target to first check for an
existing cluster (e.g., run `$(KIND) get clusters` or `$(KIND) get cluster
--name $(KIND_CLUSTER)` and detect presence) and skip creation when present, or
run the create command and ignore the specific "already exists" error (exit
code) so repeated invocations are idempotent; update only the `setup-test-e2e`
target and use the existing `$(KIND)` and `$(KIND_CLUSTER)` variables to
implement the check or error-ignore logic.
pkg/ironic/client.go (1)

86-97: Consider setting MinVersion for defense-in-depth.

While InsecureSkipVerify intentionally disables certificate validation, setting MinVersion: tls.VersionTLS12 (or higher) would prevent downgrade attacks. This is a defense-in-depth measure even for explicitly insecure connections.

Proposed fix
 	if opts.InsecureSkipVerify {
 		if t, ok := baseTransport.(*http.Transport); ok {
 			clone := t.Clone()
-			clone.TLSClientConfig = &tls.Config{InsecureSkipVerify: true} //nolint:gosec
+			clone.TLSClientConfig = &tls.Config{
+				InsecureSkipVerify: true, //nolint:gosec
+				MinVersion:         tls.VersionTLS12,
+			}
 			baseTransport = clone
 		} else {
-			baseTransport = &http.Transport{
-				TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+			baseTransport = &http.Transport{
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true, //nolint:gosec
+					MinVersion:         tls.VersionTLS12,
+				},
 			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ironic/client.go` around lines 86 - 97, The TLS config created when
opts.InsecureSkipVerify is true sets InsecureSkipVerify but omits MinVersion;
update the TLSClientConfig instances in the baseTransport adjustment (the clone
assignment and the new http.Transport branch) to include MinVersion:
tls.VersionTLS12 (or higher) alongside InsecureSkipVerify to provide
defense-in-depth against protocol downgrades while still skipping cert
verification; modify the TLSClientConfig used by clone and the one passed to
&http.Transport accordingly.
internal/controller/host_controller.go (1)

62-68: Consider reusing the node from checkHostManaged when power state already matches.

The second GetNode call (line 64) is needed to refresh state after power operations, but when reconcilePower takes the "already matches" path (line 121), the original node could be reused to avoid an extra Ironic API call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/host_controller.go` around lines 62 - 68, reconcilePower
currently triggers an unconditional IronicClient.GetNode refresh after being
called; modify reconcilePower (or its caller) to return whether it performed any
power action and/or an updated node (e.g., return (*Node, bool changed, error)
or similar) and then only call IronicClient.GetNode when reconcilePower reports
it changed state; when reconcilePower reports "already matches" reuse the
original node passed in (the node from checkHostManaged) to avoid the extra
IronicClient.GetNode API call. Ensure the updated signature is used by the
caller and handle errors consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.devcontainer/post-install.sh:
- Around line 1-2: The post-install shell script currently enables tracing with
set -x but doesn't fail fast; add set -e alongside set -x at the top of the
script so the script exits immediately on any command failure (e.g., curl, mv)
to avoid continuing in a broken state—update the header where set -x is declared
to include set -e (or use set -ex) to ensure failures are not ignored.
- Line 17: Make the docker network creation idempotent: replace the
unconditional use of the docker network create -d=bridge --subnet=172.19.0.0/24
kind invocation with a guard that first checks whether the "kind" network
already exists (e.g., via docker network inspect or similar) and only creates it
if missing, or append a no-op fallback (|| true) if you accept failure; update
the script around the line containing the docker network create command
accordingly so the command no longer fails when the network already exists.

In `@cmd/main.go`:
- Line 172: Update the setupLog.Error call so the message string is corrected to
"Failed to initialize metrics certificate watcher" and remove the redundant
key-value pair "error", err from the argument list (the call to setupLog.Error
itself already accepts the error as the first parameter); locate the call to
setupLog.Error that currently reads setupLog.Error(err, "to initialize metrics
certificate watcher", "error", err) and modify it accordingly.

In `@config/manager/manager.yaml`:
- Around line 84-88: Add readOnlyRootFilesystem: true to the Pod/container
securityContext to enforce a read-only root filesystem; specifically update the
securityContext block that currently contains allowPrivilegeEscalation and
capabilities (the securityContext object alongside allowPrivilegeEscalation:
false and capabilities.drop: ["ALL"]) and add readOnlyRootFilesystem: true at
the same level to harden the container.

In `@config/prometheus/kustomization.yaml`:
- Line 8: The commented key "#patches:" on line shown violates the linter
expecting a space after the hash; update the comment to "# patches:" (i.e., add
a single space after '#') so the commented kustomize key 'patches' follows the
expected comment style and clears the pre-commit warning.

In `@config/prometheus/monitor.yaml`:
- Around line 21-22: Update the guidance comment that references the wrong patch
filename: replace the incorrect reference "servicemonitor_tls_patch.yaml" with
the correct "monitor_tls_patch.yaml" so the comment in the Prometheus monitor
config points to the actual patch file (look for the comment string mentioning
cert-manager and the patch file name in the monitor YAML).

In `@config/rbac/role.yaml`:
- Around line 9-11: The Role currently grants permissions for the "pods"
resource which is unused; remove the resources entry for "pods" from the Role in
role.yaml so only the required osac.openshift.io resources remain (hosts and
hosts/status as referenced by the kubebuilder:rbac markers and
internal/controller/host_controller.go), ensuring the Role matches actual
controller needs and follows least-privilege.

In `@Dockerfile`:
- Around line 15-18: The Dockerfile’s COPY steps currently include COPY
cmd/main.go, COPY api/, and COPY internal/ but omit the new pkg/ directory, so
builds that import pkg/ironic will fail; update the Dockerfile to also copy the
pkg directory (e.g., add a COPY for pkg/ similar to the existing COPY api/ and
COPY internal/) so that imports under pkg/ (like pkg/ironic) are available
during docker build.

In `@go.mod`:
- Line 72: The go.opentelemetry.io/otel/sdk dependency is pinned to v1.36.0
which is vulnerable; update the module version reference for
go.opentelemetry.io/otel/sdk in go.mod to v1.40.0 or later (v1.42.0
recommended), then run dependency resolution (e.g., go get/update and go mod
tidy) to refresh go.sum and rebuild/tests to ensure compatibility; verify any
API changes where your code imports go.opentelemetry.io/otel/sdk are handled
(search for imports of go.opentelemetry.io/otel/sdk in the codebase).
- Line 89: Update the grpc dependency to a safe version by replacing the current
google.golang.org/grpc v1.72.2 entry with v1.79.3 (or later) in the module
requirements and then run go get google.golang.org/grpc@v1.79.3 and go mod tidy
to update go.sum; target the module name "google.golang.org/grpc" in go.mod and
ensure the new version is reflected and the dependency tree is re-resolved.

In `@hack/boilerplate.go.txt`:
- Line 15: The file ends with the block comment terminator "*/" but lacks a
trailing newline; add a single newline character at EOF of
hack/boilerplate.go.txt (i.e., ensure the file ends with a '\n' after the "*/")
so the pre-commit hook/CI no longer fails.

In `@internal/controller/host_controller.go`:
- Around line 102-122: reconcilePower currently swallows errors from
IronicClient.SetPowerState (in HostReconciler.reconcilePower) which can lead to
incorrect status updates and no requeue; change reconcilePower to return an
error and on any SetPowerState failure return that error (instead of only
logging), update its callers (Reconcile) to handle the error by
requeueing/backing off (propagate the error up or requeue with
ctrl.Result{RequeueAfter: ...}) so transient Ironic failures trigger retries and
status won't be incorrectly refreshed.

In `@pkg/ironic/client.go`:
- Around line 104-114: When opts.HTTPClient is supplied the current code assigns
it directly (client.HTTPClient = *opts.HTTPClient) and therefore never applies
the ClientOptions.InsecureSkipVerify setting; update the client initialization
in pkg/ironic/client.go so that after copying opts.HTTPClient you inspect or
wrap its Transport (or replace with a transport derived from the local variable
transport) to respect opts.InsecureSkipVerify (e.g., adjust
TLSClientConfig.InsecureSkipVerify on opts.HTTPClient.Transport or construct a
new transport that merges transport and the provided HTTP client's settings),
and still apply the tokenRoundTripper when authToken is present so
tokenRoundTripper and InsecureSkipVerify are both honored.

In `@README.md`:
- Line 63: The README's Go version line is out of sync with go.mod: go.mod
declares "go 1.25.0" but README shows "go version v1.24.0+"; update the README
entry to match go.mod (e.g., change the Go prerequisite to "go version v1.25.0+"
or to a matching semantic text reflecting "go 1.25.0+") so the documented
requirement aligns with the go.mod "go 1.25.0" declaration.

---

Nitpick comments:
In @.devcontainer/devcontainer.json:
- Around line 13-15: Replace the deprecated "terminal.integrated.shell.linux"
setting with the newer terminal profile keys: set
"terminal.integrated.defaultProfile.linux" to "bash" (or your desired profile
name) and, if needed, define "terminal.integrated.profiles.linux" with a "bash"
entry pointing to "/bin/bash" so VSCode uses the profile instead of the
deprecated key; update the .devcontainer JSON to remove
"terminal.integrated.shell.linux" and add the two new keys
("terminal.integrated.defaultProfile.linux" and optional
"terminal.integrated.profiles.linux") accordingly.

In @.devcontainer/post-install.sh:
- Around line 4-15: The script downloads unpinned "latest" artifacts (kind,
kubebuilder) and dynamically resolves KUBECTL_VERSION without checksum
verification, making builds non-reproducible and insecure; update
.devcontainer/post-install.sh to read pinned version variables (e.g.,
KIND_VERSION, KUBEBUILDER_VERSION, KUBECTL_VERSION) from a single source (or
export defaults at top), use those exact versioned URLs instead of "latest" or
dynamic stable.txt, and for each artifact fetch the corresponding SHA256
checksum file and verify it (compare computed sha256sum to the expected value)
before chmod/moving; reference the commands/variables in this file (kind,
kubebuilder, KUBECTL_VERSION) and also centralize these versions into the
Makefile or a shared env so Makefile's KIND_CLUSTER uses the same pinned
versions.

In @.github/workflows/lint.yml:
- Around line 1-23: The workflow "name: Lint" / job "lint" currently runs with
the default GITHUB_TOKEN permissions; add an explicit least-privilege
permissions block to the workflow root (not per-step) e.g. set permissions:
contents: read (and any other minimal read-only scopes you need such as
pull-requests: read) so the lint job only has read access; update the top of the
workflow immediately after the name: Lint header to include this permissions
block and ensure it applies to the "lint" job rather than leaving default
full-write token permissions.

In @.github/workflows/test.yml:
- Around line 3-5: The workflow's top-level triggers are unscoped because the
on: push: and pull_request: entries run on all branches; update the on: block to
restrict push to the main (or other release) branches and optionally narrow
pull_request if desired by adding branches under push (e.g., branches: - main)
and/or under pull_request (e.g., branches: - main - develop) so CI only runs on
the intended branches; edit the YAML's on/push and on/pull_request entries
(symbols: on, push, pull_request) to include the branches lists.
- Around line 20-23: Replace the current single "go mod tidy" execution in the
"Running Tests" step with a two-step approach: first run "go mod tidy" into a
temporary diff check (e.g., run "git status --porcelain" or "git diff
--exit-code" after running "go mod tidy") and fail the job if any changes are
detected, or alternatively run "go mod tidy -e && git diff --exit-code -- go.mod
go.sum" to explicitly error when go.mod/go.sum are modified; update the "Running
Tests" step (the commands shown where you currently run "go mod tidy" and "make
test") to perform this check before running "make test" so CI fails when module
files are out of sync rather than silently mutating them.

In @.gitignore:
- Around line 19-20: The negation pattern "!vendor/**/zz_generated.*" is
orphaned because there is no preceding ignore rule like "**/zz_generated.*" to
negate; either add a rule that ignores generated files (e.g.,
"**/zz_generated.*" or "vendor/**/zz_generated.*") so the negation has effect,
or remove the orphaned negation line entirely; look for the string
"!vendor/**/zz_generated.*" in the .gitignore and either add the corresponding
ignore pattern or delete that negation line.

In `@config/default/kustomization.yaml`:
- Around line 12-34: Fix the YAML comment formatting by ensuring there is a
space after every '#' in this kustomization.yaml (e.g., change "#labels:" to "#
labels:" and all occurrences of "#-" to "# -"), including commented resource
entries like "# - ../rbac", "# - ../manager", "# - ../webhook", "# -
../certmanager", "# - ../prometheus", "# - ../network-policy", and comment lines
such as "# [WEBHOOK]" and "# [CERTMANAGER]"; apply the same "# - ..." spacing
for the list item comments and any other commented lines (e.g., the
metrics_service.yaml comment block) so the linter no longer flags missing
space-after-# issues.

In `@go.mod`:
- Line 107: The go.mod replace directive currently points to a personal fork
("replace github.com/osac-project/osac-operator =>
github.com/DanNiESh/osac-operator v0.0.0-..."), which is unsafe for production;
update the upstream osac-operator repository to include the required changes and
then remove or change this replace directive in go.mod so it references the
official module (or a tagged upstream version) instead of the personal fork, and
ensure any references to the forked module in project docs/config are updated
accordingly.

In `@internal/controller/host_controller.go`:
- Around line 62-68: reconcilePower currently triggers an unconditional
IronicClient.GetNode refresh after being called; modify reconcilePower (or its
caller) to return whether it performed any power action and/or an updated node
(e.g., return (*Node, bool changed, error) or similar) and then only call
IronicClient.GetNode when reconcilePower reports it changed state; when
reconcilePower reports "already matches" reuse the original node passed in (the
node from checkHostManaged) to avoid the extra IronicClient.GetNode API call.
Ensure the updated signature is used by the caller and handle errors
consistently.

In `@Makefile`:
- Around line 22-24: Add a new .PHONY target named clean to the Makefile that
removes build artifacts; implement a target `clean` (referenced alongside
existing targets `all` and `build`) that runs `rm -rf` on the common outputs
(e.g., bin/ or $(LOCALBIN), dist/, cover.out, Dockerfile.cross) and add `clean`
to the .PHONY list so `make clean` reliably deletes generated artifacts for a
clean workspace.
- Around line 70-76: The setup-test-e2e Makefile target currently calls `$(KIND)
create cluster --name $(KIND_CLUSTER)` which fails if the cluster already
exists; modify the `setup-test-e2e` target to first check for an existing
cluster (e.g., run `$(KIND) get clusters` or `$(KIND) get cluster --name
$(KIND_CLUSTER)` and detect presence) and skip creation when present, or run the
create command and ignore the specific "already exists" error (exit code) so
repeated invocations are idempotent; update only the `setup-test-e2e` target and
use the existing `$(KIND)` and `$(KIND_CLUSTER)` variables to implement the
check or error-ignore logic.

In `@pkg/ironic/client.go`:
- Around line 86-97: The TLS config created when opts.InsecureSkipVerify is true
sets InsecureSkipVerify but omits MinVersion; update the TLSClientConfig
instances in the baseTransport adjustment (the clone assignment and the new
http.Transport branch) to include MinVersion: tls.VersionTLS12 (or higher)
alongside InsecureSkipVerify to provide defense-in-depth against protocol
downgrades while still skipping cert verification; modify the TLSClientConfig
used by clone and the one passed to &http.Transport accordingly.

In `@test/e2e/e2e_suite_test.go`:
- Around line 40-43: The hardcoded package-level variable projectImage should be
made configurable via an environment variable: change initialization of
projectImage to read from an env var (e.g., PROJECT_IMAGE or E2E_PROJECT_IMAGE)
and fall back to the current literal
"example.com/host-management-openstack:v0.0.1" when the env var is not set;
update any test setup that references projectImage to use the new value so CI or
local runs can override the image without changing code.

In `@test/e2e/e2e_test.go`:
- Around line 275-311: The function serviceAccountToken returns the named
variable err which is only set by os.WriteFile and will always be nil when
reaching the final return; change the return to explicitly return the token
string and a nil error (i.e., return out, nil) or remove the named err and
return values directly; ensure you keep the existing
verifyTokenCreation/Eventually logic and tokenRequestFile handling so errors
inside Eventually still fail the test.

In `@test/utils/utils.go`:
- Around line 194-202: GetProjectDir currently hard-codes strings.ReplaceAll(wd,
"/test/e2e", "") which is brittle; change it to walk up parent directories from
os.Getwd() (using filepath.Dir) until you find a project root indicator such as
a go.mod (check with os.Stat) or reach filesystem root, then return that
directory. Update GetProjectDir to perform this upward traversal and fallback to
returning an error if no go.mod is found; remove the hardcoded ReplaceAll usage
and keep error wrapping for os.Getwd() and os.Stat failures.
- Around line 44-50: In Run, don't ignore the error from GetProjectDir();
capture the returned err from GetProjectDir() and if non-nil log it via
GinkgoWriter (similar to the existing os.Chdir error handling), and return that
error (or wrap it) so callers of Run and subsequent os.Chdir/command execution
don't proceed with an invalid cmd.Dir; update the GetProjectDir() call in Run,
and adjust error handling for cmd.Dir, os.Chdir and the function return path
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb85fc47-8146-43d7-9a95-1fe9d25c5799

📥 Commits

Reviewing files that changed from the base of the PR and between 8619c72 and ff9c1f2.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (39)
  • .devcontainer/devcontainer.json
  • .devcontainer/post-install.sh
  • .dockerignore
  • .github/workflows/lint.yml
  • .github/workflows/test.yml
  • .gitignore
  • .golangci.yml
  • Dockerfile
  • Makefile
  • README.md
  • cmd/main.go
  • config/default/cert_metrics_manager_patch.yaml
  • config/default/kustomization.yaml
  • config/default/manager_metrics_patch.yaml
  • config/default/metrics_service.yaml
  • config/manager/kustomization.yaml
  • config/manager/manager.yaml
  • config/network-policy/allow-metrics-traffic.yaml
  • config/network-policy/kustomization.yaml
  • config/prometheus/kustomization.yaml
  • config/prometheus/monitor.yaml
  • config/prometheus/monitor_tls_patch.yaml
  • config/rbac/kustomization.yaml
  • config/rbac/leader_election_role.yaml
  • config/rbac/leader_election_role_binding.yaml
  • config/rbac/metrics_auth_role.yaml
  • config/rbac/metrics_auth_role_binding.yaml
  • config/rbac/metrics_reader_role.yaml
  • config/rbac/role.yaml
  • config/rbac/role_binding.yaml
  • config/rbac/service_account.yaml
  • config/samples/v1alpha1_host.yaml
  • go.mod
  • hack/boilerplate.go.txt
  • internal/controller/host_controller.go
  • pkg/ironic/client.go
  • test/e2e/e2e_suite_test.go
  • test/e2e/e2e_test.go
  • test/utils/utils.go

Comment thread .devcontainer/post-install.sh Outdated
Comment thread .devcontainer/post-install.sh Outdated
Comment thread cmd/main.go Outdated
Comment thread config/manager/manager.yaml
Comment thread config/prometheus/kustomization.yaml Outdated
Comment thread go.mod Outdated
Comment thread hack/boilerplate.go.txt Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread pkg/ironic/client.go Outdated
Comment thread README.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
config/manager/manager.yaml (1)

84-88: ⚠️ Potential issue | 🟠 Major

Harden container filesystem by setting readOnlyRootFilesystem: true.

manager still allows a writable root filesystem; this weakens restricted-profile hardening and is flagged by Trivy (KSV-0014).

🔒 Proposed fix
         securityContext:
           allowPrivilegeEscalation: false
           capabilities:
             drop:
             - "ALL"
+          readOnlyRootFilesystem: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/manager/manager.yaml` around lines 84 - 88, The pod spec under the
manager manifest currently sets securityContext with allowPrivilegeEscalation
and capabilities.drop but lacks readOnlyRootFilesystem; update the
securityContext for the manager container (the block containing
allowPrivilegeEscalation and capabilities drop) to add readOnlyRootFilesystem:
true so the root filesystem is mounted read-only and satisfies the
restricted-profile/Trivy check.
go.mod (2)

89-89: ⚠️ Potential issue | 🔴 Critical

Previously flagged: Upgrade google.golang.org/grpc to fix critical authorization bypass.

This critical vulnerability (GHSA-p77j-4mvh-x3m3, CVE-2026-33186) was already flagged in a previous review. Please upgrade to v1.79.3 or later as suggested in the earlier comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 89, Update the google.golang.org/grpc dependency in go.mod
from v1.72.2 to v1.79.3 (or later) to address CVE-2026-33186; after changing the
version for the module entry "google.golang.org/grpc" run your Go dependency
update/tidy (e.g., go get/go mod tidy) and rebuild/tests to ensure no
regressions and that transitive dependencies are resolved.

72-72: ⚠️ Potential issue | 🟠 Major

Previously flagged: Update go.opentelemetry.io/otel/sdk to fix vulnerability.

This vulnerability (GHSA-9h8m-3fm2-qjrq) was already flagged in a previous review. Please update to v1.40.0 or later (v1.42.0 recommended) as suggested in the earlier comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 72, Update the vulnerable dependency
go.opentelemetry.io/otel/sdk in go.mod from v1.36.0 to at least v1.40.0
(preferably v1.42.0); open the go.mod entry for the module path
"go.opentelemetry.io/otel/sdk" and change the version to v1.42.0, then run go
mod tidy (or go get go.opentelemetry.io/otel/sdk@v1.42.0) to update go.sum and
ensure the upgrade is applied.
🧹 Nitpick comments (6)
Makefile (2)

137-141: Consider adding .gitignore entry for dist/ directory.

The build-installer target creates dist/install.yaml. Ensure this directory is gitignored to avoid accidentally committing generated artifacts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 137 - 141, Add a `.gitignore` entry to ignore the
generated `dist/` directory so artifacts produced by the Makefile target
`build-installer` (which writes `dist/install.yaml`) are not accidentally
committed; update the repository’s `.gitignore` to include a line for `dist/`
(and optionally `dist/*`) and commit that change alongside or before changes
that introduce the `build-installer` target.

78-85: E2e test cleanup runs unconditionally, even on test failure.

The test-e2e target runs cleanup-test-e2e after go test, but if the test fails, Make will stop and cleanup won't run. Consider using a trap or ensuring cleanup happens regardless of test outcome.

Proposed fix
 .PHONY: test-e2e
 test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
-	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
-	$(MAKE) cleanup-test-e2e
+	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v; \
+	test_exit=$$?; \
+	$(MAKE) cleanup-test-e2e; \
+	exit $$test_exit
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 78 - 85, The test-e2e Make target currently invokes
cleanup-test-e2e directly after running go test which won't run if go test
fails; change test-e2e so cleanup-test-e2e always runs regardless of test exit
status by running the tests in a shell/subshell that captures the test exit
code, then calls $(MAKE) cleanup-test-e2e, and finally exits with the original
test exit code (or alternatively set a shell trap to invoke cleanup-test-e2e on
EXIT). Update the test-e2e recipe (where test-e2e and cleanup-test-e2e are
defined) to implement this pattern so KIND_CLUSTER/KIND cleanup always executes
even on failures.
internal/controller/host_controller.go (1)

130-135: Consider adding watch predicates to filter early.

The controller watches all Host resources but filters in Reconcile(). For clusters with many Host CRs where only a subset have hostManagementClass == "openstack", adding a predicate would reduce unnecessary reconcile invocations.

Example predicate
func (r *HostReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&osacopenshiftiov1alpha1.Host{}).
        WithEventFilter(predicate.NewPredicateFuncs(func(obj client.Object) bool {
            host, ok := obj.(*osacopenshiftiov1alpha1.Host)
            if !ok {
                return false
            }
            return host.Spec.HostManagementClass == hostManagementClass
        })).
        Named("host").
        Complete(r)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/host_controller.go` around lines 130 - 135, The
SetupWithManager currently registers HostReconciler for all Host resources
causing extra reconciles; update SetupWithManager to add a WithEventFilter
predicate that uses predicate.NewPredicateFuncs to type-assert the client.Object
to *osacopenshiftiov1alpha1.Host and return only when
host.Spec.HostManagementClass == hostManagementClass (keep reference to the
existing hostManagementClass constant/variable); ensure you import
sigs.k8s.io/controller-runtime/pkg/predicate and wire the predicate into the
controller builder before Named("host") so only relevant Host CRs trigger
reconciliation.
pkg/ironic/client.go (1)

86-97: Set MinVersion in TLS config even when InsecureSkipVerify is true.

When disabling certificate verification, you should still enforce a minimum TLS version to prevent protocol downgrade attacks. The static analysis correctly flags this.

Proposed fix
 	if opts.InsecureSkipVerify {
 		if t, ok := baseTransport.(*http.Transport); ok {
 			clone := t.Clone()
-			clone.TLSClientConfig = &tls.Config{InsecureSkipVerify: true} //nolint:gosec
+			clone.TLSClientConfig = &tls.Config{
+				InsecureSkipVerify: true, //nolint:gosec
+				MinVersion:         tls.VersionTLS12,
+			}
 			baseTransport = clone
 		} else {
 			baseTransport = &http.Transport{
-				TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true, //nolint:gosec
+					MinVersion:         tls.VersionTLS12,
+				},
 			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ironic/client.go` around lines 86 - 97, The TLS config created when
opts.InsecureSkipVerify is true currently sets InsecureSkipVerify but omits
MinVersion; update the TLSConfig instances in the InsecureSkipVerify branch (the
clone TLSClientConfig assignment and the new http.Transport TLSClientConfig) to
also set a safe minimum TLS version (e.g., tls.VersionTLS12) so that even with
verification disabled the code enforces a minimum protocol version; locate the
InsecureSkipVerify handling around baseTransport in client.go where
baseTransport is cloned or recreated and add MinVersion to those
&tls.Config{...} initializations.
README.md (1)

7-7: Documentation could clarify filtering happens at reconcile time, not watch level.

The README states the controller "filters Host CRs where spec.hostManagementClass == "openstack"", which is correct but may imply watch-level filtering. Per the controller code (internal/controller/host_controller.go:83-86), filtering occurs inside Reconcile(), so every Host CR event triggers a reconcile call. This is fine for small deployments but worth noting for operators with many Host CRs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 7, Clarify in the README that the controller does not
register a watch filter but instead checks spec.hostManagementClass inside the
Reconcile loop: update the sentence about filtering to state that every Host CR
event triggers Reconcile and the controller immediately returns for hosts whose
spec.hostManagementClass != "openstack" (referencing Reconcile in
internal/controller/host_controller.go and the host filtering logic around lines
handling spec.hostManagementClass / spec.id); mention this may impact operator
load in large clusters and suggest configuring event rate controls or
predicate-based watches if needed.
cmd/main.go (1)

221-223: Consider logging at WARNING level when IRONIC_URL is not set.

When IRONIC_URL is not configured, the controller starts but silently skips all Host reconciliation (per checkHostManaged in host_controller.go:88-90). Using setupLog.Info here may cause the missing configuration to go unnoticed in production logs.

Proposed fix
 	} else {
-		setupLog.Info("IRONIC_URL not set")
+		setupLog.Info("IRONIC_URL not set; Host reconciliation will be skipped until Ironic is configured")
 	}

Alternatively, consider failing fast if IRONIC_URL is required for the controller to be useful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 221 - 223, The current info-level log call
setupLog.Info("IRONIC_URL not set") should be changed to a higher-severity
action so missing IRONIC_URL is visible: either replace it with a warning log
(e.g., setupLog.Warn or setupLog.Info with level "Warning") to surface the
misconfiguration, or fail fast by logging an error and exiting when IRONIC_URL
is required by the controller; reference the IRONIC_URL check in cmd/main.go and
consider its impact on Host reconciliation (see checkHostManaged in
host_controller.go) when choosing between warning vs. fatal behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/host_controller.go`:
- Around line 93-97: checkHostManaged currently swallows Ironic client errors
(from IronicClient.GetNode) by returning nil, which prevents transient failures
from being retried; change checkHostManaged to return (nil, error) on transient
Ironic failures (propagate the GetNode error instead of logging+returning nil)
and update the caller Reconcile to treat a non-nil error as a requeueable
failure (e.g., return ctrl.Result{RequeueAfter: 15 * time.Second}, nil) while
still returning (ctrl.Result{}, nil) when checkHostManaged returns (nil, nil) to
represent an intentionally unmanaged host; update signatures and callers
accordingly (referencing checkHostManaged, Reconcile, and IronicClient.GetNode).

---

Duplicate comments:
In `@config/manager/manager.yaml`:
- Around line 84-88: The pod spec under the manager manifest currently sets
securityContext with allowPrivilegeEscalation and capabilities.drop but lacks
readOnlyRootFilesystem; update the securityContext for the manager container
(the block containing allowPrivilegeEscalation and capabilities drop) to add
readOnlyRootFilesystem: true so the root filesystem is mounted read-only and
satisfies the restricted-profile/Trivy check.

In `@go.mod`:
- Line 89: Update the google.golang.org/grpc dependency in go.mod from v1.72.2
to v1.79.3 (or later) to address CVE-2026-33186; after changing the version for
the module entry "google.golang.org/grpc" run your Go dependency update/tidy
(e.g., go get/go mod tidy) and rebuild/tests to ensure no regressions and that
transitive dependencies are resolved.
- Line 72: Update the vulnerable dependency go.opentelemetry.io/otel/sdk in
go.mod from v1.36.0 to at least v1.40.0 (preferably v1.42.0); open the go.mod
entry for the module path "go.opentelemetry.io/otel/sdk" and change the version
to v1.42.0, then run go mod tidy (or go get
go.opentelemetry.io/otel/sdk@v1.42.0) to update go.sum and ensure the upgrade is
applied.

---

Nitpick comments:
In `@cmd/main.go`:
- Around line 221-223: The current info-level log call setupLog.Info("IRONIC_URL
not set") should be changed to a higher-severity action so missing IRONIC_URL is
visible: either replace it with a warning log (e.g., setupLog.Warn or
setupLog.Info with level "Warning") to surface the misconfiguration, or fail
fast by logging an error and exiting when IRONIC_URL is required by the
controller; reference the IRONIC_URL check in cmd/main.go and consider its
impact on Host reconciliation (see checkHostManaged in host_controller.go) when
choosing between warning vs. fatal behavior.

In `@internal/controller/host_controller.go`:
- Around line 130-135: The SetupWithManager currently registers HostReconciler
for all Host resources causing extra reconciles; update SetupWithManager to add
a WithEventFilter predicate that uses predicate.NewPredicateFuncs to type-assert
the client.Object to *osacopenshiftiov1alpha1.Host and return only when
host.Spec.HostManagementClass == hostManagementClass (keep reference to the
existing hostManagementClass constant/variable); ensure you import
sigs.k8s.io/controller-runtime/pkg/predicate and wire the predicate into the
controller builder before Named("host") so only relevant Host CRs trigger
reconciliation.

In `@Makefile`:
- Around line 137-141: Add a `.gitignore` entry to ignore the generated `dist/`
directory so artifacts produced by the Makefile target `build-installer` (which
writes `dist/install.yaml`) are not accidentally committed; update the
repository’s `.gitignore` to include a line for `dist/` (and optionally
`dist/*`) and commit that change alongside or before changes that introduce the
`build-installer` target.
- Around line 78-85: The test-e2e Make target currently invokes cleanup-test-e2e
directly after running go test which won't run if go test fails; change test-e2e
so cleanup-test-e2e always runs regardless of test exit status by running the
tests in a shell/subshell that captures the test exit code, then calls $(MAKE)
cleanup-test-e2e, and finally exits with the original test exit code (or
alternatively set a shell trap to invoke cleanup-test-e2e on EXIT). Update the
test-e2e recipe (where test-e2e and cleanup-test-e2e are defined) to implement
this pattern so KIND_CLUSTER/KIND cleanup always executes even on failures.

In `@pkg/ironic/client.go`:
- Around line 86-97: The TLS config created when opts.InsecureSkipVerify is true
currently sets InsecureSkipVerify but omits MinVersion; update the TLSConfig
instances in the InsecureSkipVerify branch (the clone TLSClientConfig assignment
and the new http.Transport TLSClientConfig) to also set a safe minimum TLS
version (e.g., tls.VersionTLS12) so that even with verification disabled the
code enforces a minimum protocol version; locate the InsecureSkipVerify handling
around baseTransport in client.go where baseTransport is cloned or recreated and
add MinVersion to those &tls.Config{...} initializations.

In `@README.md`:
- Line 7: Clarify in the README that the controller does not register a watch
filter but instead checks spec.hostManagementClass inside the Reconcile loop:
update the sentence about filtering to state that every Host CR event triggers
Reconcile and the controller immediately returns for hosts whose
spec.hostManagementClass != "openstack" (referencing Reconcile in
internal/controller/host_controller.go and the host filtering logic around lines
handling spec.hostManagementClass / spec.id); mention this may impact operator
load in large clusters and suggest configuring event rate controls or
predicate-based watches if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed769a84-2d42-47f1-89ca-036b4df6c39e

📥 Commits

Reviewing files that changed from the base of the PR and between ff9c1f2 and 8019d2f.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (12)
  • .devcontainer/devcontainer.json
  • .github/workflows/lint.yml
  • Dockerfile
  • Makefile
  • README.md
  • cmd/main.go
  • config/manager/manager.yaml
  • config/rbac/role.yaml
  • config/samples/v1alpha1_host.yaml
  • go.mod
  • internal/controller/host_controller.go
  • pkg/ironic/client.go
✅ Files skipped from review due to trivial changes (4)
  • .github/workflows/lint.yml
  • Dockerfile
  • config/samples/v1alpha1_host.yaml
  • config/rbac/role.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .devcontainer/devcontainer.json

Comment thread internal/controller/host_controller.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/ironic/client.go (1)

86-97: Consider setting TLS MinVersion for security hardening.

The TLS configurations created when InsecureSkipVerify is true don't specify MinVersion. While Go defaults to TLS 1.2 for clients, explicitly setting MinVersion: tls.VersionTLS12 (or TLS 1.3 if Ironic supports it) documents the security posture and prevents accidental downgrades.

💡 Proposed fix
 	if opts.InsecureSkipVerify {
 		if t, ok := baseTransport.(*http.Transport); ok {
 			clone := t.Clone()
-			clone.TLSClientConfig = &tls.Config{InsecureSkipVerify: true} //nolint:gosec
+			clone.TLSClientConfig = &tls.Config{
+				InsecureSkipVerify: true, //nolint:gosec
+				MinVersion:         tls.VersionTLS12,
+			}
 			baseTransport = clone
 		} else {
 			baseTransport = &http.Transport{
-				TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true, //nolint:gosec
+					MinVersion:         tls.VersionTLS12,
+				},
 			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ironic/client.go` around lines 86 - 97, The TLS configs created in the
InsecureSkipVerify branch (where baseTransport is cloned or replaced) should
explicitly set MinVersion to tls.VersionTLS12 (or tls.VersionTLS13 if supported)
to harden TLS; update the TLSClientConfig assignments in both the t.Clone()
branch and the new http.Transport branch to include MinVersion: tls.VersionTLS12
while keeping InsecureSkipVerify: true so the configuration is explicit and
consistent.
Makefile (2)

99-136: Consider adding a clean target for build artifacts.

The Makefile lacks a clean target to remove build artifacts (bin/, cover.out, Dockerfile.cross). This is a common convention that helps maintain a clean development environment.

💡 Proposed addition
.PHONY: clean
clean: ## Remove build artifacts.
	rm -rf bin/ cover.out Dockerfile.cross dist/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 99 - 136, Add a new .PHONY target named clean to the
Makefile that removes build artifacts (e.g., bin/, cover.out, Dockerfile.cross,
dist/) so developers can reset the workspace; implement it alongside existing
targets like build and docker-buildx and ensure it uses rm -rf to delete those
files/directories and is documented with a short comment (e.g., "Remove build
artifacts.").

78-81: E2E cleanup may not run on test failure.

The cleanup-test-e2e target is called after go test, but if the test fails, Make will abort and the cleanup won't execute. This leaves the Kind cluster running after failed tests.

💡 Proposed fix using trap or separate cleanup
 .PHONY: test-e2e
-test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
-	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
-	$(MAKE) cleanup-test-e2e
+test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
+	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v; \
+	exit_code=$$?; \
+	$(MAKE) cleanup-test-e2e; \
+	exit $$exit_code
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 78 - 81, The test-e2e target currently invokes go test
then calls cleanup-test-e2e directly, which is skipped if go test fails; change
test-e2e so the test command runs inside a shell that always invokes the
cleanup-test-e2e target regardless of test exit status (use a shell trap or
capture the test exit code, call cleanup-test-e2e, then re-exit with the
original code), ensuring the test-e2e flow always runs cleanup-test-e2e even on
failures; update references to the test-e2e target and cleanup-test-e2e target
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 107: The go.mod replace directive pointing to
github.com/DanNiESh/osac-operator should be clarified or updated: decide whether
the personal fork (github.com/DanNiESh/osac-operator) is the intended long‑term
dependency for this project or if it must be switched to the official
osac-project/osac-operator; then either (A) update go.mod to use the official
module path/version instead of the replace line, or (B) add an explicit comment
in the repository README and a clear code comment near the replace directive
stating that the DanNiESh fork is intentionally required (with reason and
upgrade/migration plan), and remove the replace before any production release.
Ensure you reference the go.mod replace directive and the module path
github.com/DanNiESh/osac-operator when making the change.

---

Nitpick comments:
In `@Makefile`:
- Around line 99-136: Add a new .PHONY target named clean to the Makefile that
removes build artifacts (e.g., bin/, cover.out, Dockerfile.cross, dist/) so
developers can reset the workspace; implement it alongside existing targets like
build and docker-buildx and ensure it uses rm -rf to delete those
files/directories and is documented with a short comment (e.g., "Remove build
artifacts.").
- Around line 78-81: The test-e2e target currently invokes go test then calls
cleanup-test-e2e directly, which is skipped if go test fails; change test-e2e so
the test command runs inside a shell that always invokes the cleanup-test-e2e
target regardless of test exit status (use a shell trap or capture the test exit
code, call cleanup-test-e2e, then re-exit with the original code), ensuring the
test-e2e flow always runs cleanup-test-e2e even on failures; update references
to the test-e2e target and cleanup-test-e2e target accordingly.

In `@pkg/ironic/client.go`:
- Around line 86-97: The TLS configs created in the InsecureSkipVerify branch
(where baseTransport is cloned or replaced) should explicitly set MinVersion to
tls.VersionTLS12 (or tls.VersionTLS13 if supported) to harden TLS; update the
TLSClientConfig assignments in both the t.Clone() branch and the new
http.Transport branch to include MinVersion: tls.VersionTLS12 while keeping
InsecureSkipVerify: true so the configuration is explicit and consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8100c1ab-92d3-47ac-ab77-551b326d0ec7

📥 Commits

Reviewing files that changed from the base of the PR and between 8019d2f and b7afa2d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (13)
  • .devcontainer/devcontainer.json
  • .github/workflows/lint.yml
  • Dockerfile
  • Makefile
  • README.md
  • cmd/main.go
  • config/manager/manager.yaml
  • config/rbac/role.yaml
  • config/samples/v1alpha1_host.yaml
  • go.mod
  • hack/boilerplate.go.txt
  • internal/controller/host_controller.go
  • pkg/ironic/client.go
✅ Files skipped from review due to trivial changes (6)
  • hack/boilerplate.go.txt
  • .github/workflows/lint.yml
  • .devcontainer/devcontainer.json
  • config/rbac/role.yaml
  • Dockerfile
  • config/samples/v1alpha1_host.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/host_controller.go

Comment thread go.mod Outdated
sigs.k8s.io/yaml v1.6.0 // indirect
)

replace github.com/osac-project/osac-operator => github.com/DanNiESh/osac-operator v0.0.0-20260331184032-a30a8ba2ed0e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the osac-operator PR has been merged to the official repo
gh pr view 152 --repo osac-project/osac-operator --json state,mergedAt 2>/dev/null || echo "Unable to check PR status"

Repository: osac-project/host-management-openstack

Length of output: 107


🏁 Script executed:

cat -n go.mod | sed -n '100,115p'

Repository: osac-project/host-management-openstack

Length of output: 539


🏁 Script executed:

rg "osac-operator" go.mod -A 2 -B 2

Repository: osac-project/host-management-openstack

Length of output: 520


🏁 Script executed:

rg "DanNiESh" . --type-list

Repository: osac-project/host-management-openstack

Length of output: 471


🏁 Script executed:

rg "osac-operator" . --type md -A 2 -B 2

Repository: osac-project/host-management-openstack

Length of output: 1035


🏁 Script executed:

rg "DanNiESh" . -A 2 -B 2

Repository: osac-project/host-management-openstack

Length of output: 1405


🏁 Script executed:

git log --oneline -n 20 | head -20

Repository: osac-project/host-management-openstack

Length of output: 138


🏁 Script executed:

git log -p --all -- go.mod | grep -A 10 -B 10 "DanNiESh" | head -50

Repository: osac-project/host-management-openstack

Length of output: 64


Clarify the intentionality of the personal fork dependency.

The osac-operator replace directive uses github.com/DanNiESh/osac-operator, which is already documented in the README as the canonical source for the Host CR. Confirm this fork represents the intended long-term dependency structure for this project, or clarify whether it should be replaced with an official osac-project repository before production release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 107, The go.mod replace directive pointing to
github.com/DanNiESh/osac-operator should be clarified or updated: decide whether
the personal fork (github.com/DanNiESh/osac-operator) is the intended long‑term
dependency for this project or if it must be switched to the official
osac-project/osac-operator; then either (A) update go.mod to use the official
module path/version instead of the replace line, or (B) add an explicit comment
in the repository README and a clear code comment near the replace directive
stating that the DanNiESh fork is intentionally required (with reason and
upgrade/migration plan), and remove the replace before any production release.
Ensure you reference the go.mod replace directive and the module path
github.com/DanNiESh/osac-operator when making the change.

@DanNiESh DanNiESh marked this pull request as draft April 6, 2026 16:08
@DanNiESh DanNiESh force-pushed the main branch 2 times, most recently from 0008aec to ded9bec Compare April 6, 2026 18:36
@DanNiESh DanNiESh marked this pull request as ready for review April 6, 2026 18:40
@DanNiESh DanNiESh requested review from ajamias, larsks and tzumainn April 6, 2026 18:40
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cmd/main.go (1)

172-172: ⚠️ Potential issue | 🟡 Minor

Remove redundant "error", err key-value pair.

The setupLog.Error method already accepts the error as its first argument, so the "error", err key-value pair is redundant and will log the error twice.

Proposed fix
-		setupLog.Error(err, "failed to initialize metrics certificate watcher", "error", err)
+		setupLog.Error(err, "failed to initialize metrics certificate watcher")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` at line 172, The log call to setupLog.Error is passing the error
twice; in the call to setupLog.Error (the invocation logging "failed to
initialize metrics certificate watcher") remove the redundant key-value pair
("error", err) so the error is only passed once (as the first argument to
setupLog.Error) and not duplicated in the structured key-value list.
🧹 Nitpick comments (3)
Makefile (2)

99-107: Consider adding a clean target.

Static analysis notes the absence of a clean target. Adding one improves developer experience and CI reproducibility.

Suggested clean target
.PHONY: clean
clean: ## Remove build artifacts and downloaded tools.
	rm -rf $(LOCALBIN) bin/ dist/ cover.out

Add this after the ##@ Build section.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 99 - 107, Add a new .PHONY target named "clean" to the
Makefile (after the "##@ Build" section) that removes build artifacts and
downloaded tools; implement it to delete LOCALBIN, bin/, dist/, and cover.out
(reference symbols: target name "clean", variables/paths "$(LOCALBIN)", "bin/",
"dist/", "cover.out") so developers and CI can reliably reset the workspace.

78-81: Test failure won't prevent cleanup, but consider explicit error handling.

If go test fails on line 80, make will still run cleanup-test-e2e on line 81. However, the original test failure exit code will be lost. Consider using a trap or separate CI steps to preserve the test result.

Alternative approach preserving exit code
 .PHONY: test-e2e
 test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
-	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
-	$(MAKE) cleanup-test-e2e
+	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v; \
+	test_exit=$$?; \
+	$(MAKE) cleanup-test-e2e; \
+	exit $$test_exit
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 78 - 81, The test-e2e Make target currently always
invokes cleanup-test-e2e even if the go test command fails, losing the original
exit code; update the test-e2e recipe (the target named test-e2e and the go test
invocation "KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v") to
capture the go test exit status, run cleanup-test-e2e, and then exit with the
captured status (or use a shell trap to always run cleanup-test-e2e on exit
while preserving and re-exiting with the original test exit code) so failures
are propagated to CI.
pkg/ironic/client.go (1)

87-97: Consider setting MinVersion on TLS config for defense in depth.

Static analysis flags missing MinVersion on the TLS configurations. While TLS 1.2 is the default minimum for Go clients (which is acceptable for infrastructure tooling), explicitly setting MinVersion: tls.VersionTLS12 documents intent and guards against future Go default changes.

🔒 Optional: Explicit TLS minimum version
 	if opts.InsecureSkipVerify {
 		if t, ok := baseTransport.(*http.Transport); ok {
 			clone := t.Clone()
-			clone.TLSClientConfig = &tls.Config{InsecureSkipVerify: true} //nolint:gosec
+			clone.TLSClientConfig = &tls.Config{
+				InsecureSkipVerify: true, //nolint:gosec
+				MinVersion:         tls.VersionTLS12,
+			}
 			baseTransport = clone
 		} else {
 			baseTransport = &http.Transport{
-				TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true, //nolint:gosec
+					MinVersion:         tls.VersionTLS12,
+				},
 			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ironic/client.go` around lines 87 - 97, The TLS config created when
opts.InsecureSkipVerify is true should explicitly set a minimum TLS version for
defense-in-depth: update the tls.Config assigned in both the cloned transport
(where baseTransport is cast to *http.Transport and clone.TLSClientConfig is
set) and the new &http.Transport branch to include MinVersion: tls.VersionTLS12
(preserve the existing InsecureSkipVerify and //nolint:gosec comment); target
the code around opts.InsecureSkipVerify, baseTransport, the *http.Transport
clone and the new http.Transport construction to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/manager/manager.yaml`:
- Around line 69-81: The env vars IRONIC_URL and OSAC_AUTH_TOKEN have mismatched
optional semantics causing startup failure when IRONIC_URL exists without
OSAC_AUTH_TOKEN; update either the manifest or code so their optional flags
match and/or add validation in cmd/main.go: when IRONIC_URL is provided (env
IRONIC_URL) require OSAC_AUTH_TOKEN as well (or make both optional:false), and
add a clear error message describing the dependency; alternatively document the
coupling in the YAML comment and ensure secretKeyRef optional settings for
IRONIC_URL and OSAC_AUTH_TOKEN are consistent.

---

Duplicate comments:
In `@cmd/main.go`:
- Line 172: The log call to setupLog.Error is passing the error twice; in the
call to setupLog.Error (the invocation logging "failed to initialize metrics
certificate watcher") remove the redundant key-value pair ("error", err) so the
error is only passed once (as the first argument to setupLog.Error) and not
duplicated in the structured key-value list.

---

Nitpick comments:
In `@Makefile`:
- Around line 99-107: Add a new .PHONY target named "clean" to the Makefile
(after the "##@ Build" section) that removes build artifacts and downloaded
tools; implement it to delete LOCALBIN, bin/, dist/, and cover.out (reference
symbols: target name "clean", variables/paths "$(LOCALBIN)", "bin/", "dist/",
"cover.out") so developers and CI can reliably reset the workspace.
- Around line 78-81: The test-e2e Make target currently always invokes
cleanup-test-e2e even if the go test command fails, losing the original exit
code; update the test-e2e recipe (the target named test-e2e and the go test
invocation "KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v") to
capture the go test exit status, run cleanup-test-e2e, and then exit with the
captured status (or use a shell trap to always run cleanup-test-e2e on exit
while preserving and re-exiting with the original test exit code) so failures
are propagated to CI.

In `@pkg/ironic/client.go`:
- Around line 87-97: The TLS config created when opts.InsecureSkipVerify is true
should explicitly set a minimum TLS version for defense-in-depth: update the
tls.Config assigned in both the cloned transport (where baseTransport is cast to
*http.Transport and clone.TLSClientConfig is set) and the new &http.Transport
branch to include MinVersion: tls.VersionTLS12 (preserve the existing
InsecureSkipVerify and //nolint:gosec comment); target the code around
opts.InsecureSkipVerify, baseTransport, the *http.Transport clone and the new
http.Transport construction to make this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d52286b-5eb9-41fe-a12e-90a413140b95

📥 Commits

Reviewing files that changed from the base of the PR and between b7afa2d and ded9bec.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (15)
  • .github/workflows/lint.yml
  • Dockerfile
  • Makefile
  • README.md
  • cmd/main.go
  • config/default/kustomization.yaml
  • config/manager/manager.yaml
  • config/prometheus/kustomization.yaml
  • config/prometheus/monitor.yaml
  • config/rbac/role.yaml
  • config/samples/v1alpha1_host.yaml
  • go.mod
  • hack/boilerplate.go.txt
  • internal/controller/host_controller.go
  • pkg/ironic/client.go
✅ Files skipped from review due to trivial changes (7)
  • hack/boilerplate.go.txt
  • .github/workflows/lint.yml
  • config/prometheus/kustomization.yaml
  • config/rbac/role.yaml
  • go.mod
  • config/prometheus/monitor.yaml
  • config/default/kustomization.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • Dockerfile
  • config/samples/v1alpha1_host.yaml
  • internal/controller/host_controller.go

Comment thread config/manager/manager.yaml Outdated
Copy link
Copy Markdown

@ajamias ajamias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments! Most are about code style, but overall looks good

Comment thread config/manager/manager.yaml Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread pkg/ironic/client.go Outdated
Comment thread cmd/main.go Outdated
Comment thread config/manager/manager.yaml Outdated
Comment thread config/manager/manager.yaml Outdated
Comment thread config/samples/v1alpha1_host.yaml Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread internal/controller/host_controller.go Outdated
Comment thread pkg/ironic/client.go Outdated
Comment thread pkg/ironic/client.go Outdated
Comment thread pkg/ironic/client.go Outdated
@larsks
Copy link
Copy Markdown
Member

larsks commented Apr 7, 2026

@DanNiESh take a look at https://github.com/larsks/host-management-openstack/tree/lars-example; this represents a rewrite of the ironic client implementing some of my comments from this pr.


Separately:

While testing things out locally, it became clear that we need to make sure to report errors in the cr, not just in the logs. We were having some issues with a couple of ironic nodes, and there was no way to tell from the cr why the power state wasn't changing.

Given a failure like this:

2026-04-07T12:51:40-04:00 ERROR failed to power on node {"controller": "host", "controllerGroup": "osac.openshift.io", "controllerKind": "Host", "Host": {"name":"host-sample","namespace":"default"}, "namespace": "default", "name": "host-sample", "reconcileID": "cc2f8c5d-3003-4f26-9df9-f9f2bfccb25b", "nodeID": "4e545bd1-904b-424c-b42e-129398f3790c", "error": "failed to set power state on node 4e545bd1-904b-424c-b42e-129398f3790c: Expected HTTP response code [202] when accessing [PUT https://esi.massopen.cloud:13385/v1/nodes/4e545bd1-904b-424c-b42e-129398f3790c/states/power], but got 409 instead: {"error_message": "{\"faultcode\":\"Client\",\"faultstring\":\"Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed.\",\"debuginfo\":null}"}"}

We should expose that API error ("Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed") in the status field. Maybe as a condition?

conditions:
- type: ReachedPowerState
  status: "False"
  reason: IronicApiError
  message: "Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed"

@DanNiESh DanNiESh force-pushed the main branch 2 times, most recently from 0e2625b to cb1580e Compare April 8, 2026 02:42
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
cmd/main.go (1)

171-173: ⚠️ Potential issue | 🟡 Minor

Drop the duplicated error field here.

setupLog.Error(err, ...) already records err; adding "error", err repeats the same value in the log payload.

Proposed fix
-			setupLog.Error(err, "failed to initialize metrics certificate watcher", "error", err)
+			setupLog.Error(err, "failed to initialize metrics certificate watcher")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 171 - 173, The log call is adding the same error
twice; locate the error handling block that checks err after initializing the
metrics certificate watcher (the setupLog.Error(...) followed by os.Exit(1)) and
remove the redundant key/value pair "error", err from the setupLog.Error
invocation so the call simply passes the error and the message (e.g.,
setupLog.Error(err, "failed to initialize metrics certificate watcher")), then
keep the os.Exit(1) behavior unchanged.
🧹 Nitpick comments (1)
internal/controller/hostlease_controller.go (1)

113-115: Use gophercloud power-state constants instead of raw strings.

The comparison against raw "power on" duplicates an external API value. Use string(nodes.PowerOn) and string(nodes.PowerOff) from gophercloud to keep the controller aligned with the canonical constants already used elsewhere in the codebase for Ironic power state handling.

Proposed fix
-	currentlyOn := node.PowerState == "power on"
+	currentlyOn := node.PowerState == string(nodes.PowerOn)
@@
-	poweredOn := node.PowerState == "power on"
+	poweredOn := node.PowerState == string(nodes.PowerOn)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/hostlease_controller.go` around lines 113 - 115, In
reconcilePower, avoid comparing node.PowerState to the raw string "power on";
instead use the gophercloud constants by comparing to string(nodes.PowerOn) (and
where applicable string(nodes.PowerOff)). Update the currentlyOn assignment and
any other power-state comparisons in reconcilePower to use
string(nodes.PowerOn)/string(nodes.PowerOff) so the controller uses the
canonical gophercloud power-state constants (refer to reconcilePower,
node.PowerState).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/hostlease_controller.go`:
- Around line 63-66: The IronicClient.GetNode errors returned by
checkHostManaged and the post-update refresh path are only logged and not
persisted to HostLease.status; modify the reconcile paths that call
checkHostManaged and the refresh (the blocks around checkHostManaged, the
post-update node refresh using r.ironic.GetNode, and the refresh after status
update) to call the existing condition-update helper (the same helper used
elsewhere for setting HostLease conditions) when GetNode returns an error or nil
node so the error is written into HostLease.status instead of only being logged;
ensure you use the same condition type/message format and return the same
ctrl.Result/err behavior after updating the status.
- Line 40: The current recheckInterval assignment uses
helpers.GetEnvWithDefault("HOST_RECHECK_INTERVAL", 60*time.Second) but doesn't
validate the parsed duration; update the call to use the helper's validator
parameter to reject non-positive durations (require d > 0) so that
HOST_RECHECK_INTERVAL cannot be 0s or negative—i.e., call
helpers.GetEnvWithDefault with a validator function that returns an error when
the duration is <= 0 and keep the default 60*time.Second to ensure RequeueAfter
remains functional.

In `@internal/ironic/client.go`:
- Around line 47-56: NewClient currently returns a created service client
without validating that the resolved Ironic endpoint is non-empty; after calling
clientconfig.NewServiceClient in NewClient, verify the returned service client's
Endpoint (serviceClient.Endpoint) is not empty and return a clear error (e.g.
"empty baremetal endpoint from service discovery") if it is, so startup fails
fast rather than later in reconciles; update NewClient to perform this check
before returning &Client{serviceClient: client}.

In `@Makefile`:
- Around line 78-81: The test-e2e target currently runs `go test` on its own
line so Make stops on failure and `cleanup-test-e2e` is never invoked; change
the test-e2e recipe so `go test` runs in the same shell and you always call
`$(MAKE) cleanup-test-e2e` afterward while preserving `go test`'s exit code
(e.g., capture the exit status of `go test` into a variable, run `$(MAKE)
cleanup-test-e2e`, then `exit` with the captured status). Update the `test-e2e`
recipe (referencing the test-e2e target and cleanup-test-e2e) accordingly so
Kind is always torn down even when tests fail.
- Around line 70-76: The Makefile target setup-test-e2e currently always runs
$(KIND) create cluster --name $(KIND_CLUSTER); change it to first check for Kind
installation and then check whether the cluster named $(KIND_CLUSTER) already
exists (use the Kind command that lists clusters, e.g., kind get clusters) and
only call $(KIND) create cluster --name $(KIND_CLUSTER) when the cluster is
absent; keep the existing installation check that errors if $(KIND) is missing
and print a message when the cluster already exists instead of trying to
recreate it.
- Around line 128-135: In the docker-buildx Makefile target the build step is
prefixed with a leading '-' which suppresses failures; remove the leading '-'
from the "$(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag
${IMG} -f Dockerfile.cross ." invocation so a failed build causes the make
target to fail, and likewise remove any leading '-' from the related buildx
commands ("$(CONTAINER_TOOL) buildx create --name
host-management-openstack-builder" and "$(CONTAINER_TOOL) buildx rm
host-management-openstack-builder") so their failures are not silently ignored.

In `@README.md`:
- Around line 3-8: The README currently documents the wrong CRD and fields;
update all sections that reference Host and spec.id to instead describe the
v1alpha1.HostLease resource and the reconciler's expected identifier field
spec.externalID (including filter criteria used by
internal/controller/hostlease_controller.go and any examples). Replace
instructions that say to apply a Host CR with steps to create/apply a HostLease
CR, update example manifests and CLI snippets to include spec.externalID, and
ensure any explanatory text (sections around lines 19-25, 43-55, 76-80)
consistently reflects the HostLease API and its mapping to Ironic node UUIDs.

---

Duplicate comments:
In `@cmd/main.go`:
- Around line 171-173: The log call is adding the same error twice; locate the
error handling block that checks err after initializing the metrics certificate
watcher (the setupLog.Error(...) followed by os.Exit(1)) and remove the
redundant key/value pair "error", err from the setupLog.Error invocation so the
call simply passes the error and the message (e.g., setupLog.Error(err, "failed
to initialize metrics certificate watcher")), then keep the os.Exit(1) behavior
unchanged.

---

Nitpick comments:
In `@internal/controller/hostlease_controller.go`:
- Around line 113-115: In reconcilePower, avoid comparing node.PowerState to the
raw string "power on"; instead use the gophercloud constants by comparing to
string(nodes.PowerOn) (and where applicable string(nodes.PowerOff)). Update the
currentlyOn assignment and any other power-state comparisons in reconcilePower
to use string(nodes.PowerOn)/string(nodes.PowerOff) so the controller uses the
canonical gophercloud power-state constants (refer to reconcilePower,
node.PowerState).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 239ba52d-09cc-458c-94ab-8844c1c936c2

📥 Commits

Reviewing files that changed from the base of the PR and between ded9bec and 0e2625b.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (16)
  • .github/workflows/lint.yml
  • Dockerfile
  • Makefile
  • README.md
  • cmd/main.go
  • config/default/kustomization.yaml
  • config/prometheus/kustomization.yaml
  • config/prometheus/monitor.yaml
  • config/rbac/role.yaml
  • config/samples/v1alpha1_hostlease.yaml
  • go.mod
  • hack/boilerplate.go.txt
  • internal/controller/hostlease_controller.go
  • internal/helpers/getenvwithdefault.go
  • internal/helpers/helpers.go
  • internal/ironic/client.go
✅ Files skipped from review due to trivial changes (7)
  • hack/boilerplate.go.txt
  • .github/workflows/lint.yml
  • config/prometheus/kustomization.yaml
  • config/samples/v1alpha1_hostlease.yaml
  • go.mod
  • Dockerfile
  • config/rbac/role.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/default/kustomization.yaml
  • config/prometheus/monitor.yaml

Comment thread internal/controller/hostlease_controller.go Outdated
Comment on lines +63 to +66
node, err := r.checkHostManaged(ctx, hostLease, log)
if err != nil || node == nil {
return ctrl.Result{}, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

GetNode failures still never reach HostLease.status.

Both the initial lookup and the post-update refresh only log IronicClient.GetNode errors. The reconcile then exits or requeues without persisting a condition, so users still have to inspect controller logs to see the API error. That leaves the exact debuggability gap called out in the PR discussion.

Suggested direction
 	node, err := r.checkHostManaged(ctx, hostLease, log)
 	if err != nil || node == nil {
+		if err != nil {
+			hostLease.SetStatusCondition(
+				v1alpha1.HostConditionPoweredOn,
+				metav1.ConditionFalse,
+				v1alpha1.HostConditionReasonPowerFailure,
+				err.Error(),
+			)
+			if statusErr := r.Status().Update(ctx, hostLease); statusErr != nil {
+				log.Error(statusErr, "failed to update HostLease status after Ironic lookup failure")
+			}
+		}
 		return ctrl.Result{}, err
 	}

Apply the same condition-update helper to the refresh failure path on Lines 75-79 as well.

Also applies to: 75-79, 104-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/hostlease_controller.go` around lines 63 - 66, The
IronicClient.GetNode errors returned by checkHostManaged and the post-update
refresh path are only logged and not persisted to HostLease.status; modify the
reconcile paths that call checkHostManaged and the refresh (the blocks around
checkHostManaged, the post-update node refresh using r.ironic.GetNode, and the
refresh after status update) to call the existing condition-update helper (the
same helper used elsewhere for setting HostLease conditions) when GetNode
returns an error or nil node so the error is written into HostLease.status
instead of only being logged; ensure you use the same condition type/message
format and return the same ctrl.Result/err behavior after updating the status.

Comment thread internal/ironic/client.go
Comment thread Makefile Outdated
Comment thread Makefile
Comment on lines +78 to +81
.PHONY: test-e2e
test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
$(MAKE) cleanup-test-e2e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always tear down the Kind cluster, even when e2e tests fail.

Make stops on the failing go test line, so cleanup-test-e2e is skipped and the cluster is leaked. That also makes the next run much more likely to fail at setup.

Proposed fix
 .PHONY: test-e2e
 test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
-	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
-	$(MAKE) cleanup-test-e2e
+	`@set` -e; \
+	trap '$(MAKE) cleanup-test-e2e' EXIT; \
+	KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.PHONY: test-e2e
test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
$(MAKE) cleanup-test-e2e
.PHONY: test-e2e
test-e2e: setup-test-e2e manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
`@set` -e; \
trap '$(MAKE) cleanup-test-e2e' EXIT; \
KIND_CLUSTER=$(KIND_CLUSTER) go test ./test/e2e/ -v -ginkgo.v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 78 - 81, The test-e2e target currently runs `go test`
on its own line so Make stops on failure and `cleanup-test-e2e` is never
invoked; change the test-e2e recipe so `go test` runs in the same shell and you
always call `$(MAKE) cleanup-test-e2e` afterward while preserving `go test`'s
exit code (e.g., capture the exit status of `go test` into a variable, run
`$(MAKE) cleanup-test-e2e`, then `exit` with the captured status). Update the
`test-e2e` recipe (referencing the test-e2e target and cleanup-test-e2e)
accordingly so Kind is always torn down even when tests fail.

Comment thread Makefile Outdated
Comment on lines +128 to +135
docker-buildx: ## Build and push docker image for the manager for cross-platform support
# copy existing Dockerfile and insert --platform=${BUILDPLATFORM} into Dockerfile.cross, and preserve the original Dockerfile
sed -e '1 s/\(^FROM\)/FROM --platform=\$$\{BUILDPLATFORM\}/; t' -e ' 1,// s//FROM --platform=\$$\{BUILDPLATFORM\}/' Dockerfile > Dockerfile.cross
- $(CONTAINER_TOOL) buildx create --name host-management-openstack-builder
$(CONTAINER_TOOL) buildx use host-management-openstack-builder
- $(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross .
- $(CONTAINER_TOOL) buildx rm host-management-openstack-builder
rm Dockerfile.cross
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't suppress docker buildx build failures.

The leading - on the actual build/push step turns a failed image build into a successful make target, which can silently green-light broken release artifacts.

Proposed fix
-	- $(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross .
+	$(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross .
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docker-buildx: ## Build and push docker image for the manager for cross-platform support
# copy existing Dockerfile and insert --platform=${BUILDPLATFORM} into Dockerfile.cross, and preserve the original Dockerfile
sed -e '1 s/\(^FROM\)/FROM --platform=\$$\{BUILDPLATFORM\}/; t' -e ' 1,// s//FROM --platform=\$$\{BUILDPLATFORM\}/' Dockerfile > Dockerfile.cross
- $(CONTAINER_TOOL) buildx create --name host-management-openstack-builder
$(CONTAINER_TOOL) buildx use host-management-openstack-builder
- $(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross .
- $(CONTAINER_TOOL) buildx rm host-management-openstack-builder
rm Dockerfile.cross
docker-buildx: ## Build and push docker image for the manager for cross-platform support
# copy existing Dockerfile and insert --platform=${BUILDPLATFORM} into Dockerfile.cross, and preserve the original Dockerfile
sed -e '1 s/\(^FROM\)/FROM --platform=\$$\{BUILDPLATFORM\}/; t' -e ' 1,// s//FROM --platform=\$$\{BUILDPLATFORM\}/' Dockerfile > Dockerfile.cross
- $(CONTAINER_TOOL) buildx create --name host-management-openstack-builder
$(CONTAINER_TOOL) buildx use host-management-openstack-builder
$(CONTAINER_TOOL) buildx build --push --platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross .
- $(CONTAINER_TOOL) buildx rm host-management-openstack-builder
rm Dockerfile.cross
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 128 - 135, In the docker-buildx Makefile target the
build step is prefixed with a leading '-' which suppresses failures; remove the
leading '-' from the "$(CONTAINER_TOOL) buildx build --push
--platform=$(PLATFORMS) --tag ${IMG} -f Dockerfile.cross ." invocation so a
failed build causes the make target to fail, and likewise remove any leading '-'
from the related buildx commands ("$(CONTAINER_TOOL) buildx create --name
host-management-openstack-builder" and "$(CONTAINER_TOOL) buildx rm
host-management-openstack-builder") so their failures are not silently ignored.

Comment thread README.md Outdated
@DanNiESh DanNiESh force-pushed the main branch 3 times, most recently from a5e8ffd to f570d43 Compare April 8, 2026 18:12
@DanNiESh
Copy link
Copy Markdown
Author

DanNiESh commented Apr 8, 2026

@DanNiESh take a look at https://github.com/larsks/host-management-openstack/tree/lars-example; this represents a rewrite of the ironic client implementing some of my comments from this pr.

Separately:

While testing things out locally, it became clear that we need to make sure to report errors in the cr, not just in the logs. We were having some issues with a couple of ironic nodes, and there was no way to tell from the cr why the power state wasn't changing.

Given a failure like this:

2026-04-07T12:51:40-04:00 ERROR failed to power on node {"controller": "host", "controllerGroup": "osac.openshift.io", "controllerKind": "Host", "Host": {"name":"host-sample","namespace":"default"}, "namespace": "default", "name": "host-sample", "reconcileID": "cc2f8c5d-3003-4f26-9df9-f9f2bfccb25b", "nodeID": "4e545bd1-904b-424c-b42e-129398f3790c", "error": "failed to set power state on node 4e545bd1-904b-424c-b42e-129398f3790c: Expected HTTP response code [202] when accessing [PUT https://esi.massopen.cloud:13385/v1/nodes/4e545bd1-904b-424c-b42e-129398f3790c/states/power], but got 409 instead: {"error_message": "{"faultcode":"Client","faultstring":"Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed.","debuginfo":null}"}"}

We should expose that API error ("Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed") in the status field. Maybe as a condition?

conditions:
- type: ReachedPowerState
  status: "False"
  reason: IronicApiError
  message: "Node 4e545bd1-904b-424c-b42e-129398f3790c is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed"

Thanks! I've updated the ironic client with your example code. Also added powerSynced condition to store power failure message.

@DanNiESh DanNiESh changed the title Add Host controller for bare metal management via Ironic Add HostLease controller for bare metal management via Ironic Apr 9, 2026
Copy link
Copy Markdown

@tzumainn tzumainn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments:

  • osac-operator has different contents at the top-level directory and within .github/; is there a reason not to standardize on that?
  • can you add unit tests for the go code?

Comment thread README.md Outdated
@DanNiESh
Copy link
Copy Markdown
Author

  • different

It is the scaffolding generated by operator-sdk, I'll standardize on osac-operator.

Comment thread README.md Outdated

node, err := r.IronicClient.GetNode(ctx, hostLease.Spec.ExternalID)
if err != nil {
log.Error(err, "failed to get Ironic node", "nodeID", hostLease.Spec.ExternalID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to requeue in cases like this? There could be a blip in connectivity to OpenStack, in which case "trying again" would make sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does requeue, because it returns an error. Kubernetes will automatically retry the operation with exponential backoff.

Copy link
Copy Markdown
Author

@DanNiESh DanNiESh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller explicitly requeues to poll external Ironic state. I think it will also be used for polling node provisioning state. In error cases, just like Lars said, it triggers exponential backoff.

client.Client
Scheme *runtime.Scheme
IronicClient *ironic.Client
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osac-operator uses a func New*Reconciler pattern (for example, https://github.com/osac-project/osac-operator/blob/main/internal/controller/clusterorder_controller.go#L78) and uses that function in cmd/main.go and in unit tests. Would it make sense to copy that here? It might also be a good place to do defaults for things like recheckInterval.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a constructor for custom types is in general a good idea.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@tzumainn
Copy link
Copy Markdown

@adriengentil just to check - is there any reason for this operator to use multicluster-runtime? It seems like it's needed in osac-operator for VMs, but I'm not sure it's needed here.

Comment thread internal/ironic/client.go
Comment thread Containerfile
Comment thread Containerfile
# was called. For example, if we call make docker-build in a local env which has the Apple Silicon M1 SO
# the docker BUILDPLATFORM arg will be linux/arm64 when for Apple x86 it will be linux/amd64. Therefore,
# by leaving it empty we can ensure that the container and binary shipped on it will have the same platform.
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -a -o manager cmd/main.go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why provide a default for GOOS? Shouldn't the same logic apply to GOOS that applies to GOARCH?

Copy link
Copy Markdown
Author

@DanNiESh DanNiESh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Containerfile is generated by operator-sdk init command. By reading the comment above this command, i think GOARCH is left without an explicit fallback so it follows the build platform/host architecture. OS is constrained by runtime image, arch is meant to float with build target.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than copying this from osac-operator, we should move it to a common location and then import it (so that if we make changes or fix bugs in one place we don't need to remember to make the same changes in another place).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough! I saw there is an archived repo: fulfillment-common. Do we want to setup a similar repo to store these common files?

client.Client
Scheme *runtime.Scheme
IronicClient *ironic.Client
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a constructor for custom types is in general a good idea.

Comment on lines +63 to +65
if !hostLease.DeletionTimestamp.IsZero() {
return ctrl.Result{}, nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no actions this operator needs to perform when deleting a host?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Baremetalpool controller handles HostLease finalizer? (cc: @ajamias) Baremetalpool controls HostLease lifecycle and triggers the cleanup workflow. HostLeaseController just need to perform poweroff, deprovisioning, etc.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend this controller have its own finalizer for each HostLease too. When a HostLease gets deleted, both the host inventory and management controllers need to perform their appropriate actions. For the inventory controller, its making sure the inventory host has no more references to a BareMetalPool or HostLease. For the management controller, I assume it would be something to do with a HostTeardownWorkflow. Once both controllers perform their action, each removes their finalizer on the HostLease and the resource can get deleted safely.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - think of it this way: the HostLease CR lifecycle starts with the BareMetalPool operator, then the Host Inventory operator, then the Host Management operator, with each updating the HostLease in its own way. When the HostLease CR is deleted, the operators will need to perform cleanup actions in reverse - the management operator running a teardown template, the inventory operator unmarking the host in the inventory, etc. That means each operator will want its own finalizer to ensure those operations are actually completed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, I just realized Lars had an actual specific question. For now, we're ignoring the tear down workflow - however, I think the host management operator needs to unset the host class. That will allow the Host Inventory Operator to reconcile the HostLease CR, see that it is in the deleting state, and unmark the host in the inventory. @ajamias does this match what you were thinking?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the host inventory operator specifically only matches those Host Leases without a host class set though? That means the host management operator has to unset the host class - unless there's some other matching mechanism in play?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host inventory controller reconciles only if !hostLease.DeletionTimestamp.IsZero() or hostLease.Spec.HostClass == "". When DeletionTimestamp is nonzero, if the HostLease still has the host management finalizer, don't clean up in the host inventory controller yet.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have that first condition? To me, it seems much cleaner to have a simple strict rule regarding which operator will reconcile a HostLease, with that being the value in spec.HostClass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems to me that if you include that first rule, you could run into issues where a HostLease is deleted, and the host management operator will run the teardown workflow while the host inventory operator also tries to return the backend host.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

  • added a finalizer for hostLease controller
  • during HostLease deletion, the controller unsets spec.hostClass and then removes its finalizer, allowing deletion to transition to host inventory operator


node, err := r.IronicClient.GetNode(ctx, hostLease.Spec.ExternalID)
if err != nil {
log.Error(err, "failed to get Ironic node", "nodeID", hostLease.Spec.ExternalID)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does requeue, because it returns an error. Kubernetes will automatically retry the operation with exponential backoff.


node, err = r.IronicClient.GetNode(ctx, hostLease.Spec.ExternalID)
if err != nil {
log.Error(err, "failed to refresh node after power reconciliation")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message should print out the error received from ironic. We should probably expose this failure in the status somewhere as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exposed this failure to power condition and update status.

return ctrl.Result{RequeueAfter: recheckInterval}, nil
}

return ctrl.Result{}, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously you were setting an explicit requeue interval here in order to discover changes on the ironic side of things (which otherwise won't trigger a new reconciliation loop). How are we handling that now?

Copy link
Copy Markdown
Author

@DanNiESh DanNiESh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the explicit requeue logic just above this line. It requeues to poll Ironic until power state matches desired.
This line: return ctrl.Result{}, nil is executed when desired state matches actual state and all prior steps succeeded. So resource is removed from the work queue

}

// Sync the HostLease status and update corresponding conditions
func (r *HostLeaseReconciler) syncHostLeaseStatus(hostLease *v1alpha1.HostLease, node *nodes.Node) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear how exactly this relates to reconcilePower. They both set v1alpha1.HostConditionPowerSynced, which seems odd to have that logic spread across two different methods.

Copy link
Copy Markdown
Author

@DanNiESh DanNiESh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: consolidated HostConditionPowerSynced condition sync into one method -syncHostLeaseStatus. reconcilePower only sets power and passes the errors (if any) to syncHostLeaseStatus

@DanNiESh DanNiESh force-pushed the main branch 3 times, most recently from 9b4d475 to bb2bccd Compare April 17, 2026 02:10
@DanNiESh
Copy link
Copy Markdown
Author

Some initial comments:

  • osac-operator has different contents at the top-level directory and within .github/; is there a reason not to standardize on that?
  • can you add unit tests for the go code?

Update:

  • standardized the top-level directory and .github/ files
  • added unit tests for internal/ironic/client.go and internal/controller/hostlease_controller.go

Implement a controller that watches HostLease CR (from osac-operator) and
reconciles spec.poweredOn to Ironic power on/off. Filters on
spec.hostClass == "openstack"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initial Host Management Operator (OpenStack) implementation

4 participants