Skip to content

feat(deploy): redesign metald network architecture#3818

Merged
chronark merged 2 commits intomainfrom
push-nuymlkoqsrmv
Sep 3, 2025
Merged

feat(deploy): redesign metald network architecture#3818
chronark merged 2 commits intomainfrom
push-nuymlkoqsrmv

Conversation

@imeyer
Copy link
Contributor

@imeyer imeyer commented Aug 20, 2025

What does this PR do?

Fixes # (issue)

If there is not an issue for this, please create one first. This is used to tracking purposes and also helps use understand why this PR exists

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • Enhancement (small improvements)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

  • Test A
  • Test B

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary

Summary by CodeRabbit

  • New Features
    • Optional step-based Docker build pipeline in builderd (feature flag).
    • Firecracker backend with integrated jailer and VM metrics for metald.
    • New cleanup script to remove all Deploy components.
  • Improvements
    • More robust startup/shutdown and retryable base-asset initialization.
    • Systemd units now enforce dependencies (spire-agent, assetmanagerd, metald).
  • Breaking Changes
    • metald client/CLI now use ProjectID and EnvironmentID (UserID removed); config templates removed.
  • Bug Fixes
    • Safer rootfs optimization with path validation; adjusted telemetry timeouts.
  • Documentation
    • Removed legacy guides, examples, and environment templates.
  • Chores
    • Dependency updates across modules.

@changeset-bot
Copy link

changeset-bot bot commented Aug 20, 2025

⚠️ No Changeset found

Latest commit: c2cb27f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Aug 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
dashboard Ignored Ignored Preview Sep 3, 2025 7:23pm
engineering Ignored Ignored Preview Sep 3, 2025 7:23pm

@vercel vercel bot temporarily deployed to Preview – dashboard August 20, 2025 16:12 Inactive
@vercel vercel bot temporarily deployed to Preview – engineering August 20, 2025 16:13 Inactive
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

📝 Walkthrough

Walkthrough

Removes multiple repository docs; adjusts Makefiles and systemd units; adds a Makefile target for proto generation; modifies DeployWorkflow to insert a build record; introduces builderd asset init with retry, a step-based Docker pipeline executor, shutdown tests, and metrics changes; overhauls metald: client/config/service, database (sqlc), networking model, Firecracker backend, Docker backend, health tests; adds a cleanup script.

Changes

Cohort / File(s) Summary
Repository docs cleanup
CLAUDE.md, go/deploy/CLAUDE.md, go/deploy/metald/contrib/systemd/README.md, go/deploy/metald/contrib/systemd/environment.example, go/deploy/metald/contrib/systemd/fedora-installation.md, go/deploy/metald/client/examples/*, go/deploy/metald/internal/jailer/README.md, go/deploy/metald/internal/service/CLEANUP_BENCHMARKS.md
Remove various guidance and example configuration/documentation files.
Control-plane DeployWorkflow
go/apps/ctrl/services/deployment/deploy_workflow.go
Remove partitionDB field; add Step 3 to insert build into DB; minor imports update.
Top-level Go Makefile
go/Makefile
Add target generate-builder to run buf codegen for proto/deploy/builderd.
Assetmanagerd service changes
go/deploy/assetmanagerd/Makefile, .../contrib/systemd/assetmanagerd.service, .../go.mod, .../internal/service/service.go
Simplify Makefile (remove create-user/env-example/generate, adjust all/build/install); systemd now Requires/After spire-agent; bump protobuf indirect; remove tenant context usage in queries.
Builderd build/run config
go/deploy/builderd/Makefile, .../contrib/systemd/builderd.service, .../environment.example, .../go.mod
Trim Makefile (remove create-user, lint/proto targets, quick-build, generate prereq); systemd adds Requires (spire-agent, assetmanagerd) and env flag to disable pipeline executor; delete env example; bump protobuf.
Builderd startup and shutdown
go/deploy/builderd/cmd/builderd/main.go, .../cmd/builderd/shutdown_test.go
Switch to assets.BaseAssetManager with retry/metrics; add robust shutdown with mutex/timeout/promo server handling; add concurrent/sequence/timeout shutdown tests.
Builderd assets and metrics
go/deploy/builderd/internal/assets/base.go, .../internal/observability/metrics.go, .../internal/observability/otel.go
Add MetricsRecorder, WithMetrics, InitializeBaseAssetsWithRetry; replace tenant metrics with base-asset-init metrics, adjust method signatures; reduce OTLP export timeouts to 10s.
Builderd executor (pipeline feature flag)
go/deploy/builderd/internal/config/config.go, .../internal/executor/registry.go, .../internal/executor/docker_pipeline.go, .../internal/executor/steps.go, .../internal/executor/docker_steps.go
Add UsePipelineExecutor config; register DockerPipelineExecutor when enabled; add pipeline types/steps and orchestration for Docker builds.
Builderd executor (monolithic edits)
go/deploy/builderd/internal/executor/docker.go
Remove tenant context in logs/metrics; drop TenantID from BuildResult literal.
Builderd minor imports
go/deploy/builderd/client/client.go, .../internal/assetmanager/client.go, .../internal/service/builder.go, .../internal/tenant/manager.go
Import reordering and metrics call arg updates; no behavior change otherwise.
Metald CLI and client
go/deploy/metald/client/client.go, .../client/cmd/metald-cli/main.go, .../client/config.go, .../client/vmconfig.go, .../client/example_test.go, .../client/examples/*, .../client/go.mod
Client adds ProjectID/EnvironmentID; headers updated; CLI adds flags; remove built-in config templates and file-save helpers; drop template constructor; remove examples; update toolchain and replaces.
Metald service entrypoints
go/deploy/metald/cmd/metald/main.go, .../cmd/metald-init/main.go
Replace reconciler/network manager flow; use new database handler and new proto handler; raise log level; update octal literals and cmdline parsing in metald-init.
Metald systemd and Makefile
go/deploy/metald/contrib/systemd/metald.service, .../Makefile, .../README.md, .../.gitignore
Systemd adds Requires/After deps; remove mkdir netns prestart; Makefile adds bridge targets, changes build/test flags, removes dev/metrics; doc wording updates; ignore networks-seed.sql.
Metald backend (Docker)
go/deploy/metald/internal/backend/docker/*
Use crypto/rand for ports and safe conversions; adjust proto imports; port mapping uses 32-bit; minor logging and network info omissions.
Metald backend (Firecracker new)
go/deploy/metald/internal/backend/firecracker/*
Add new Client/types; implement CreateVM/BootVM/DeleteVM/Shutdown/Pause/Resume/Reboot/GetVMInfo/GetVMMetrics; asset management helpers; kernel args and metadata helpers; config builder; validation. Remove old SDK v4 and tests.
Metald backend removals (Cloud Hypervisor)
go/deploy/metald/internal/backend/cloudhypervisor/*
Remove Cloud Hypervisor client and types.
Metald backend types
go/deploy/metald/internal/backend/types/backend.go
Update proto import path; remove VMInfo.NetworkInfo.
Metald networking (overhaul)
go/deploy/metald/internal/network/*
Remove old allocator/manager/metrics/protection/impl/tests; add simplified Manager/Config, bridge ensure, tap placeholder; introduce subnet calculation utilities and expanded VMNetwork fields; change device name generation (Bridge field).
Metald database (migration to sqlc)
go/deploy/metald/internal/database/*
Remove old Database and VMRepository; add sqlc-generated DBTX, Queries, models, querier, and queries; add Database handler constructors.
Metald service layer
go/deploy/metald/internal/service/vm.go, .../internal/service/auth.go
VMService now uses database.Querier; simplifies VM ops; CreateVm returns HostPortPair; remove ListVms and ownership checks. Auth context shifts to User/Tenant/Project/Environment with new extractors; update dev tokens and baggage keys.
Metald health tests
go/deploy/metald/internal/health/handler_test.go, .../internal/health/system_test.go
Add comprehensive health/system info test suites with a mocked backend.
Cleanup tooling
go/deploy/cleanup-unkey-deploy.sh
Add root-only teardown script cleaning services, networks, binaries, configs, data, users/groups, iptables, and cgroups with confirmations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Builderd as builderd main
  participant Assets as BaseAssetManager
  participant Metrics as BuildMetrics

  Operator->>Builderd: start
  Builderd->>Assets: NewBaseAssetManager().WithMetrics(Metrics)
  loop retry until success/cancel
    Builderd->>Assets: InitializeBaseAssetsWithRetry(ctx)
    alt retryable error
      Assets-->>Metrics: RecordBaseAssetInitRetry(attempt, reason)
    end
  end
  alt failed after retries
    Assets-->>Metrics: RecordBaseAssetInitFailure(total_attempts, error)
  end
  Builderd-->>Operator: ready
Loading
sequenceDiagram
  autonumber
  actor API
  participant ExecReg as ExecutorRegistry
  participant DPExec as DockerPipelineExecutor
  participant Pipe as BuildPipeline
  participant Steps as StepExecutors

  API->>ExecReg: select executor (docker)
  alt UsePipelineExecutor=true
    ExecReg-->>API: DockerPipelineExecutor
    API->>DPExec: Execute(request)
    DPExec->>Pipe: Execute(StepInput)
    Pipe->>Steps: PullImage → CreateContainer → ExtractMetadata → ExtractFilesystem → OptimizeRootfs → Cleanup
    Steps-->>Pipe: StepOutput propagation
    Pipe-->>DPExec: BuildResult(rootfs, metadata)
    DPExec-->>API: BuildResult
  else
    ExecReg-->>API: DockerExecutor (monolithic)
  end
Loading
sequenceDiagram
  autonumber
  actor Client
  participant Service as metald VMService
  participant DB as database.Querier
  participant Backend as Backend(Firecracker/Docker)

  Client->>Service: CreateVm(request)
  Service->>DB: AllocateNetwork()
  DB-->>Service: Network allocation
  Service->>Backend: CreateVM(config)
  Backend-->>Service: vm_id
  Service-->>Client: CreateVmResponse(state=CREATED, host_port_pair)
Loading
sequenceDiagram
  autonumber
  participant Ctrl as ctrl DeployWorkflow
  participant DB as db.Queries

  Ctrl->>Ctrl: log-deployment-pending
  Ctrl->>DB: InsertBuild(build_id, workspace_id, project_id, deployment_id, created_at)
  alt error
    DB-->>Ctrl: error
    Ctrl-->>Ctrl: abort
  else
    DB-->>Ctrl: ok
    Ctrl->>Ctrl: proceed to build version
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

Possibly related PRs

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch push-nuymlkoqsrmv

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

fix(metald): stuff

fix(metald): clean up network package

feat: overhaul metald
@imeyer imeyer marked this pull request as ready for review September 3, 2025 19:04
@imeyer imeyer enabled auto-merge September 3, 2025 19:04
@vercel vercel bot temporarily deployed to Preview – dashboard September 3, 2025 19:05 Inactive
@vercel vercel bot temporarily deployed to Preview – engineering September 3, 2025 19:05 Inactive
@chronark
Copy link
Collaborator

chronark commented Sep 3, 2025

CleanShot 2025-09-03 at 21 04 52@2x I declare bankruptcy

@imeyer
Copy link
Contributor Author

imeyer commented Sep 3, 2025

broke.. like my spirit

Copy link
Collaborator

chronark commented Sep 3, 2025

image.png

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except this file, can you revert this file, or use the one that is in main, cause we need the partition db and we don't need the insert-build step.
I'd be surprised if this actually builds successfully right now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to merge this as is after that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's weird.. my LSP didn't see that as an issue. on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Hm, I thought the side I chose was main but I think I chose the wrong side oops. Well that makes me worried for the rest of the rebase action I had yesterday but.. the changes shouldn't impact anything in production so I think we just fix as necessary and do better in the future lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you look at this and resolve the convo if it's correct @chronark

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

Thank you for following the naming conventions for pull request titles! 🙏

@imeyer
Copy link
Contributor Author

imeyer commented Sep 3, 2025

lmao i forgot to commit the changes i made while we were on the hangout

Copy link
Collaborator

chronark commented Sep 3, 2025

works for me

Copy link
Collaborator

chronark commented Sep 3, 2025

you sure about those uints? you had pretty strong opinions about keeping it as int earlier this week 😛

Copy link
Collaborator

chronark commented Sep 3, 2025

ok cool,. if CI is green, I’ll merge

@imeyer
Copy link
Contributor Author

imeyer commented Sep 3, 2025

i'm one of those "i'll address it in the next iteration" kind of people... the rabbit just tries to hold me back!!!

Copy link
Collaborator

chronark commented Sep 3, 2025

I should’ve lead with a sarcasm warning

@chronark chronark disabled auto-merge September 3, 2025 19:31
@chronark chronark merged commit f4183e2 into main Sep 3, 2025
16 checks passed
@chronark chronark deleted the push-nuymlkoqsrmv branch September 3, 2025 19:31
@imeyer
Copy link
Contributor Author

imeyer commented Sep 3, 2025

biggest yeet this side of 2025

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 181

google.golang.org/genproto/googleapis/rpc v0.0.0-20250707201910-8d1bb00bc6a7 // indirect
google.golang.org/grpc v1.73.0 // indirect
google.golang.org/protobuf v1.36.6 // indirect
google.golang.org/protobuf v1.36.8 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Proto bump looks fine; verify repo-wide alignment and regen if needed.

Confirm all modules now resolve google.golang.org/protobuf v1.36.8 and regenerate any generated protobuf code if you vendor stubs or pin protoc-gen-go. Run go mod tidy afterwards.


🏁 Script executed:

#!/bin/bash
# Check protobuf version across all modules
fd go.mod | xargs -I{} bash -lc 'echo "== {} =="; rg -n "google.golang.org/protobuf" {} || true'
# Inspect effective deps in this module
( cd go/deploy/assetmanagerd && go list -m all | rg "^google.golang.org/(grpc|protobuf|genproto)" )
# Optional: locate protoc-gen-go usage to decide if regen is needed
rg -nP "(protoc|buf).*protoc-gen-go" -g '!**/node_modules/**' -C2

Length of output: 3027


Align google.golang.org/protobuf to v1.36.8 repository-wide
Bump protobuf to v1.36.8 in the following go.mod files (currently on older versions):

  • apps/agent/go.mod (v1.36.5)
  • apps/chproxy/go.mod (v1.36.6)
  • go/deploy/billaged/go.mod (v1.36.6)
  • go/deploy/metald/go.mod (v1.36.6)
  • go/deploy/pkg/observability/interceptors/go.mod (v1.36.6)
  • go/deploy/pkg/spiffe/go.mod (v1.36.5)
  • go/deploy/pkg/telemetry/go.mod (v1.36.6)
  • go/deploy/pkg/tls/go.mod (v1.36.5)

After aligning versions, regenerate any vendored protobuf stubs or run your protoc-gen-go pipeline (none found via rg search) and run go mod tidy in each module.


# Targets (alphabetically ordered)
.PHONY: all build check clean create-user deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
.PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

PHONY lists non-existent targets (generate, env-example)

They were removed but still declared; this breaks expectations and confuses help.

-.PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
+.PHONY: all build check clean deps dev fmt help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
.PHONY: all build check clean deps dev fmt help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/Makefile around line 39, the .PHONY declaration still
lists targets that no longer exist (generate, env-example); remove those entries
from the .PHONY line or restore the corresponding targets so the declaration
matches actual Makefile targets, and then run make help to verify output and
update any documentation referencing them.

.PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet

all: clean generate build ## Clean, generate, and build
all: clean build ## Clean, generate, and build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Update help text for all target

It no longer generates; adjust description.

-all: clean build ## Clean, generate, and build
+all: clean build ## Clean and build
🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/Makefile around line 41, the `all` target help text
still says "Clean, generate, and build" but the target no longer performs
generation; update the description to accurately reflect behavior (e.g., "Clean
and build" or similar). Edit the comment after the target to replace "Clean,
generate, and build" with the corrected short description and ensure formatting
matches the existing help comment style.

Comment on lines 57 to 58
dev: generate ## Run in development mode
@go run ./cmd/$(SERVICE_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken deps: dev/setup depend on missing “generate”

These targets will fail with “No rule to make target ‘generate’”.

-dev: generate ## Run in development mode
+dev: ## Run in development mode
 	@go run ./cmd/$(SERVICE_NAME)
@@
-setup: deps generate ## Complete development setup
+setup: deps ## Complete development setup

Also applies to: 111-112

🤖 Prompt for AI Agents
go/deploy/assetmanagerd/Makefile around lines 57-58 (and also 111-112): the dev
and dev/setup targets declare a dependency on a non-existent "generate" target
which causes "No rule to make target 'generate'"; add a concrete "generate"
target before those lines that runs the project’s code-generation steps (for
example run go generate ./... and any protoc/Swagger/gen commands your repo
requires) or, if generation is not needed for development, remove the "generate"
dependency from the dev and dev/setup targets; ensure the new target is marked
.PHONY and invokes the exact generation commands used elsewhere in the repo so
the make targets succeed.

Comment on lines +14 to +15
assetv1 "github.com/unkeyed/unkey/go/gen/proto/deploy/assetmanagerd/v1"
"github.com/unkeyed/unkey/go/gen/proto/deploy/assetmanagerd/v1/assetmanagerdv1connect"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Import reorder is a no-op — OK

No behavioral change. Proceed.

Given significant builderd changes in this PR, remember the learning to bump the patch-level version for the service. If applicable here, please confirm the version bump.


🏁 Script executed:

#!/bin/bash
# Search for builderd version declarations/bakes to confirm a patch bump happened.
rg -nP -C2 '(?i)\b(version|appVersion|buildVersion)\b' go/deploy/builderd | sed -n '1,200p'

Length of output: 9672


Bump all version references to 0.5.1
Makefile is at 0.5.1, but README.md still shows 0.1.0 (go/deploy/builderd/README.md:278) and the CLI help default OTEL version shows “default: 0.1.0” (go/deploy/builderd/cmd/builderd/main.go:478). Update both to 0.5.1.

🤖 Prompt for AI Agents
In go/deploy/builderd/internal/assetmanager/client.go around lines 14-15 and
also update go/deploy/builderd/README.md (around line ~278) and
go/deploy/builderd/cmd/builderd/main.go (around line ~478), bump all version
references from 0.1.0 to 0.5.1; specifically replace the README’s displayed
version and the CLI help default OTEL version string with "0.5.1" and ensure any
other occurrences in these files or nearby lines are updated to match 0.5.1 so
Makefile, docs, and CLI defaults are consistent.

Comment on lines 198 to 201
// TODO: In production, implement proper tenant-user relationship checks
// This should query a tenant membership service or database
// This should query a tenant membership service or database using ctx for timeouts/tracing
_ = ctx // Will be used for database queries in production implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Upgrade TODO to AIDEV-TODO per repo guidelines and emphasize context usage.

-	// TODO: In production, implement proper tenant-user relationship checks
-	// This should query a tenant membership service or database using ctx for timeouts/tracing
+	// AIDEV-TODO: In production, implement proper tenant-user relationship checks.
+	// Use ctx for deadlines/tracing; query an authz service or DB for membership/roles.
 	_ = ctx // Will be used for database queries in production implementation
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TODO: In production, implement proper tenant-user relationship checks
// This should query a tenant membership service or database
// This should query a tenant membership service or database using ctx for timeouts/tracing
_ = ctx // Will be used for database queries in production implementation
// AIDEV-TODO: In production, implement proper tenant-user relationship checks.
// Use ctx for deadlines/tracing; query an authz service or DB for membership/roles.
_ = ctx // Will be used for database queries in production implementation
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/auth.go around lines 198 to 201, replace
the generic TODO comment and the placeholder "_ = ctx" with an AIDEV-TODO per
repo guidelines and update the note to explicitly require using the passed ctx
for timeouts, cancellation and tracing when querying the tenant membership
service or DB; in practice change the comment to start with "AIDEV-TODO:" and
state that the implementation must query a tenant membership store using ctx
(context-aware DB calls, propagate tracing/span IDs, respect
deadlines/cancellations) so future implementers know to use ctx rather than
ignore it.

Comment on lines +209 to 211
if requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" {
return fmt.Errorf("access denied: user %s cannot access restricted tenant", customerCtx.UserID)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Gate the demo “restricted-tenant” rule to dev mode.

Prevents accidental prod behavior differences.

-	if requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" {
+	if os.Getenv("UNKEY_METALD_AUTH_MODE") == "dev" &&
+		requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" {
 		return fmt.Errorf("access denied: user %s cannot access restricted tenant", customerCtx.UserID)
 	}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/auth.go around lines 209-211, the hardcoded
rule allowing only admin-user to access "restricted-tenant" must be limited to
development mode; modify the conditional to check a dev-mode flag (preferably an
injected config like cfg.IsDev or, if not available, os.Getenv("APP_ENV") ==
"development") in addition to requestedTenantID == "restricted-tenant" and
customerCtx.UserID != "admin-user", so the access-deny logic only runs when
running in dev mode.

GOOS ?= $(shell go env GOOS)
GOARCH ?= $(shell go env GOARCH)
LDFLAGS := -ldflags "-s -w -X main.version=$(VERSION)"
LDFLAGS := -ldflags "-s=false -w=false -X main.version=$(VERSION)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Debug flags inappropriately set in production build.

The LDFLAGS have been modified to disable stripping (-s=false -w=false), which will significantly increase binary size and potentially expose sensitive debug information in production builds. These flags should only be used in debug builds.

Apply this diff to fix the production build flags:

-LDFLAGS := -ldflags "-s=false -w=false -X main.version=$(VERSION)"
+LDFLAGS := -ldflags "-s -w -X main.version=$(VERSION)"

For debug builds, you should use separate debug flags:

+DEBUG_LDFLAGS := -ldflags "-X main.version=$(VERSION)"
🤖 Prompt for AI Agents
In go/deploy/metald/Makefile around line 11, LDFLAGS currently disable symbol
and DWARF stripping (-s=false -w=false); change production LDFLAGS to enable
stripping (-s -w) while preserving the version injection, and implement a
separate debug path (e.g., a DEBUG flag or distinct DEBUG_LDFLAGS) that sets
unstripped flags for local/dev builds; ensure the default Makefile target uses
the stripped production LDFLAGS and only uses the debug flags when an explicit
DEBUG=true (or similar) is provided.

Comment on lines +80 to 83
Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.

The `make install` command configures these automatically.
The `make install` command configures the service with appropriate permissions automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Fix double-spacing and add an AIDEV note on privileged operation.

Two grammar nits and add an anchor reminding maintainers to revisit hardening.

-Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
+Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
+
+<!-- AIDEV-NOTE: Running as root -->
+When feasible, apply systemd hardening (ProtectSystem, ProtectControlGroups, PrivateUsers, CapabilityBoundingSet) and network namespace isolation to reduce blast radius.
 
-The `make install` command configures the service with appropriate permissions automatically.
+The `make install` command configures the service with appropriate permissions automatically.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
The `make install` command configures these automatically.
The `make install` command configures the service with appropriate permissions automatically.
Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
<!-- AIDEV-NOTE: Running as root -->
When feasible, apply systemd hardening (ProtectSystem, ProtectControlGroups, PrivateUsers, CapabilityBoundingSet) and network namespace isolation to reduce blast radius.
The `make install` command configures the service with appropriate permissions automatically.
🧰 Tools
🪛 LanguageTool

[grammar] ~80-~80: Use correct spacing
Context: ...VM processes, ensuring proper isolation. The make install command configures th...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~82-~82: Use correct spacing
Context: ...h appropriate permissions automatically. ## Contributing See [Development Setup](./...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

🤖 Prompt for AI Agents
In go/deploy/metald/README.md around lines 80 to 83, fix the double-spacing in
the paragraph about metald running as root (remove any extra spaces between
sentences) and append a short AIDEV note/anchor reminding maintainers to revisit
hardening of privileged operations (e.g., add a single-line anchor or NOTE
(AIDEV): comment stating "revisit privileged operation hardening and service
isolation" and linkable anchor name). Ensure the sentence flow remains
grammatical and the new AIDEV note is concise and clearly identifiable for
future reference.

Comment on lines +55 to +57
generate-builder:
buf generate --path proto/deploy/builderd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add generate-builder target — add PHONY and tools dependency

Looks good. Make it phony and depend on tools so buf is present.

Apply:

- .PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate pull up clean
+ .PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate generate-builder pull up clean
@@
-generate-builder:
-	buf generate --path proto/deploy/builderd
+generate-builder: tools
+	buf generate --path proto/deploy/builderd

Note: Per prior learning, running buf from go/ with out: gen is correct.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
generate-builder:
buf generate --path proto/deploy/builderd
.PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate generate-builder pull up clean
generate-builder: tools
buf generate --path proto/deploy/builderd
🤖 Prompt for AI Agents
In go/Makefile around lines 55 to 57, the generate-builder target should be
declared phony and made to depend on the tools target so buf is available;
update the Makefile to add generate-builder to the .PHONY list and change the
target to depend on tools (e.g., "generate-builder: tools") while keeping the
existing command "buf generate --path proto/deploy/builderd".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +7 to +8
"io"
"net/http"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports (io, net/http) — currently cause build failure.

These are only referenced in a large block comment. Go rejects unused imports.

Apply:

-	"io"
-	"net/http"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"io"
"net/http"
🤖 Prompt for AI Agents
In go/apps/ctrl/services/deployment/deploy_workflow.go around lines 7-8, the
imports "io" and "net/http" are unused (only mentioned in a block comment) and
cause a build failure; remove "io" and "net/http" from the import block (or only
re-add them when actual code uses them) so the file compiles cleanly.

Comment on lines +90 to +109
// Step 3: Insert build into database
err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error {
w.logger.Info("inserting build into database", "build_id", buildID)
insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{
ID: buildID,
WorkspaceID: req.WorkspaceID,
ProjectID: req.ProjectID,
DeploymentID: req.DeploymentID,
CreatedAt: time.Now().UnixMilli(),
})
if insertErr != nil {
return fmt.Errorf("failed to create build record: %w", insertErr)
}
w.logger.Info("build record created successfully", "build_id", buildID)
return nil
})
if err != nil {
w.logger.Error("failed to insert build", "error", err, "build_id", buildID)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

buildID is undefined — initialize before InsertBuild.

Compilation fails without a value.

Apply:

 	}
 
-	// Step 3: Insert build into database
+	// Generate build ID
+	buildID := uid.New("build")
+
+	// Step 3: Insert build into database
 	err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Step 3: Insert build into database
err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error {
w.logger.Info("inserting build into database", "build_id", buildID)
insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{
ID: buildID,
WorkspaceID: req.WorkspaceID,
ProjectID: req.ProjectID,
DeploymentID: req.DeploymentID,
CreatedAt: time.Now().UnixMilli(),
})
if insertErr != nil {
return fmt.Errorf("failed to create build record: %w", insertErr)
}
w.logger.Info("build record created successfully", "build_id", buildID)
return nil
})
if err != nil {
w.logger.Error("failed to insert build", "error", err, "build_id", buildID)
return err
}
}
// Generate build ID
buildID := uid.New("build")
// Step 3: Insert build into database
err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error {
w.logger.Info("inserting build into database", "build_id", buildID)
insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{
ID: buildID,
WorkspaceID: req.WorkspaceID,
ProjectID: req.ProjectID,
DeploymentID: req.DeploymentID,
CreatedAt: time.Now().UnixMilli(),
})
if insertErr != nil {
return fmt.Errorf("failed to create build record: %w", insertErr)
}
w.logger.Info("build record created successfully", "build_id", buildID)
return nil
})
if err != nil {
w.logger.Error("failed to insert build", "error", err, "build_id", buildID)
return err
}
🤖 Prompt for AI Agents
In go/apps/ctrl/services/deployment/deploy_workflow.go around lines 90 to 109,
buildID is used but not defined; initialize a buildID before the InsertBuild
step (for example: buildID := uuid.NewString() or via your project's ID
generator) and use that variable in the db.InsertBuild call and subsequent logs,
and add the necessary import for the ID generator you choose.

Comment on lines +4 to 7
After=network.target spire-agent.service
Wants=network.target
Requires=spire-agent.service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Good: spire-agent ordering and dependency; suggest tighter coupling and graceful stop

  • Consider BindsTo for tighter lifetime coupling with spire-agent.
  • Add TimeoutStopSec for graceful shutdowns.
@@
-After=network.target spire-agent.service
+After=network.target spire-agent.service
 Wants=network.target
-Requires=spire-agent.service
+Requires=spire-agent.service
+BindsTo=spire-agent.service

Additionally, consider:

# outside this hunk
TimeoutStopSec=30s

Also ensure the referenced local storage path exists at boot:

# outside this hunk
ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs

🏁 Script executed:

#!/bin/bash
# Verify builderd path reference and pre-create directive presence
rg -n 'UNKEY_ASSETMANAGERD_LOCAL_STORAGE_PATH|/opt/builderd/rootfs' go/deploy/assetmanagerd/contrib/systemd -C2

Length of output: 730


Add tighter service coupling and graceful shutdown steps

  • Under the existing Requires=spire-agent.service, add
     After=network.target spire-agent.service
     Wants=network.target
     Requires=spire-agent.service
    +BindsTo=spire-agent.service
  • Outside that hunk, add for graceful stop:
    TimeoutStopSec=30s
  • And pre-create the local storage directory at boot:
    ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs
🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/contrib/systemd/assetmanagerd.service around lines 4
to 7, add tighter coupling and graceful shutdown/start steps: add a
Wants=spire-agent.service line beneath the existing
Requires=spire-agent.service, and add TimeoutStopSec=30s (to the Service
section) and ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs (to the
Service section before ExecStart) so the unit pre-creates the local storage
directory and allows a 30s graceful stop.

Comment on lines +59 to +60
atomic.StoreInt64(&shutdownStarted, 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t reset the shared shutdown flag inside each goroutine.

This defeats the purpose of racing multiple shutdown attempts against the same guard.

-      // Reset shutdown state before each attempt
-      atomic.StoreInt64(&shutdownStarted, 0)
+      // Intentionally share shutdownStarted across goroutines to exercise the guard.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
atomic.StoreInt64(&shutdownStarted, 0)
// Intentionally share shutdownStarted across goroutines to exercise the guard.
🤖 Prompt for AI Agents
In go/deploy/builderd/cmd/builderd/shutdown_test.go around lines 59-60, do not
reset the shared shutdown flag inside each goroutine (remove the
atomic.StoreInt64(&shutdownStarted, 0) call from within the goroutine); instead
initialize or reset the shared atomic shutdownStarted once before spawning the
concurrent goroutines so all goroutines race against the same guard, ensuring
the test exercises multiple concurrent shutdown attempts correctly.

Comment on lines +131 to +146
// Start servers in background
go func() {
if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
t.Logf("HTTP server error: %v", err)
}
}()

go func() {
if err := promServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
t.Logf("Prometheus server error: %v", err)
}
}()

// Give servers time to start
time.Sleep(100 * time.Millisecond)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional: use Server.Serve with a listener in TestShutdownSequence for symmetry.

This mirrors the timeout test, avoids spurious log noise on shutdown, and makes startup deterministic.

If flaky behavior emerges, switch to Serve on an explicit net.Listener and wait until it’s bound before proceeding.

🤖 Prompt for AI Agents
In go/deploy/builderd/cmd/builderd/shutdown_test.go around lines 131 to 146,
replace the goroutines calling server.ListenAndServe and
promServer.ListenAndServe with starting each server via server.Serve(listener)
and promServer.Serve(promListener) where the listeners are created with
net.Listen; create the listeners before the goroutines, wait until each listener
is bound (e.g., immediately after net.Listen returns) to make startup
deterministic, and use the Serve form to avoid spurious "use of closed network
connection" logs during shutdown and to mirror the timeout test’s symmetry.

Comment on lines +30 to +38
// Config holds network configuration
type Config struct {
BaseNetwork *net.IPNet
BridgeName string
DNSServers []string // Default: ["8.8.8.8", "8.8.4.4"]
EnableIPv6 bool
EnableRateLimit bool
RateLimitMbps int // Per VM rate limit in Mbps
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Document field constraints in Config.

Add comments for invariants (non-empty BridgeName, non-nil BaseNetwork, RateLimitMbps > 0 when enabled) to guide callers.

 type Config struct {
-  BaseNetwork     *net.IPNet
-  BridgeName      string
-  DNSServers      []string // Default: ["8.8.8.8", "8.8.4.4"]
-  EnableIPv6      bool
-  EnableRateLimit bool
-  RateLimitMbps   int // Per VM rate limit in Mbps
+  BaseNetwork     *net.IPNet // required
+  BridgeName      string     // required, linux bridge device name (e.g., "br0")
+  DNSServers      []string   // optional; IP literals. Default: ["8.8.8.8", "8.8.4.4"]
+  EnableIPv6      bool
+  EnableRateLimit bool
+  RateLimitMbps   int        // required if EnableRateLimit; per-VM Mbps
 }

Add an AIDEV-NOTE near NewManager stating the validation policy.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Config holds network configuration
type Config struct {
BaseNetwork *net.IPNet
BridgeName string
DNSServers []string // Default: ["8.8.8.8", "8.8.4.4"]
EnableIPv6 bool
EnableRateLimit bool
RateLimitMbps int // Per VM rate limit in Mbps
}
// Config holds network configuration
type Config struct {
BaseNetwork *net.IPNet // required
BridgeName string // required, linux bridge device name (e.g., "br0")
DNSServers []string // optional; IP literals. Default: ["8.8.8.8", "8.8.4.4"]
EnableIPv6 bool
EnableRateLimit bool
RateLimitMbps int // required if EnableRateLimit; per-VM Mbps
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/manager.go around lines 30 to 38, the
Config struct lacks comments documenting field invariants; add inline comments
for BridgeName (must be non-empty), BaseNetwork (must be non-nil), and
RateLimitMbps (must be >0 when EnableRateLimit is true) and clarify default
DNSServers; also add an AIDEV-NOTE comment immediately above NewManager
describing the validation policy (which fields are validated by NewManager vs.
callers' responsibility and whether NewManager returns an error on invalid
config) so callers know expectations and invariants.

Comment on lines +40 to +45
type Manager struct {
logger *slog.Logger
config *Config
mu sync.RWMutex
bridgeMu sync.RWMutex
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add GoDoc for exported Manager and document locking.

Public types must be documented; also record which mutex protects which fields.

- type Manager struct {
+// Manager orchestrates bridge/tap network lifecycle.
+// AIDEV-NOTE: mu guards Manager-wide state; bridgeMu guards bridge creation/configuration.
+type Manager struct {
   logger   *slog.Logger
   config   *Config
   mu       sync.RWMutex
   bridgeMu sync.RWMutex
 }
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/manager.go around lines 40 to 45, the
exported Manager type lacks GoDoc and documentation of its locking strategy; add
a GoDoc comment immediately above the type declaration that briefly describes
Manager's purpose and explicitly documents which mutex protects which fields
(e.g., mu guards config and other mutable state, bridgeMu guards bridge-related
fields), and mention that mu is an RWMutex for concurrent read access and
bridgeMu is a separate RWMutex for bridge-specific state and why both are
needed.

Comment on lines +1 to +29
package network

// setupVMNetworking creates and configures TAP and veth devices for a VM
func (m *Manager) setupTunTap(ip string) error {
// // tap := &netlink.Tuntap{
// // LinkAttrs: netlink.LinkAttrs{
// // Name: deviceNames.TAP,
// // },
// // Mode: netlink.TUNTAP_MODE_TAP,
// // }

// // if err := netlink.LinkAdd(tap); err != nil {
// // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err)
// // }

// // tapLink, err := netlink.LinkByName(deviceNames.TAP)
// // if err != nil {
// // return fmt.Errorf("failed to get TAP device: %w", err)
// // }

// // Set TAP device up
// if err := netlink.LinkSetUp(tap); err != nil {
// return fmt.Errorf("failed to bring TAP device up: %w", err)
// }

// m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name))

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Stub silently returns success; add linux build tag and fail fast until implemented.

Silent no-op can mislead callers. Gate to linux, fix doc, add TODO, and return NotImplemented.

-package network
+//go:build linux
+
+package network
+
+import "fmt"
 
-// setupVMNetworking creates and configures TAP and veth devices for a VM
+// setupTunTap creates and configures TAP and veth devices for a VM.
+// AIDEV-TODO: Implement using netlink (requires CAP_NET_ADMIN). Ensure bridge attach, MTU, and ownership.
 func (m *Manager) setupTunTap(ip string) error {
@@
-	return nil
+	// AIDEV-TODO: temporary NotImplemented to avoid silent success paths.
+	return fmt.Errorf("setupTunTap: not implemented for ip=%s", ip)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package network
// setupVMNetworking creates and configures TAP and veth devices for a VM
func (m *Manager) setupTunTap(ip string) error {
// // tap := &netlink.Tuntap{
// // LinkAttrs: netlink.LinkAttrs{
// // Name: deviceNames.TAP,
// // },
// // Mode: netlink.TUNTAP_MODE_TAP,
// // }
// // if err := netlink.LinkAdd(tap); err != nil {
// // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err)
// // }
// // tapLink, err := netlink.LinkByName(deviceNames.TAP)
// // if err != nil {
// // return fmt.Errorf("failed to get TAP device: %w", err)
// // }
// // Set TAP device up
// if err := netlink.LinkSetUp(tap); err != nil {
// return fmt.Errorf("failed to bring TAP device up: %w", err)
// }
// m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name))
return nil
}
//go:build linux
package network
import "fmt"
// setupTunTap creates and configures TAP and veth devices for a VM.
// AIDEV-TODO: Implement using netlink (requires CAP_NET_ADMIN). Ensure bridge attach, MTU, and ownership.
func (m *Manager) setupTunTap(ip string) error {
// // tap := &netlink.Tuntap{
// // LinkAttrs: netlink.LinkAttrs{
// // Name: deviceNames.TAP,
// // },
// // Mode: netlink.TUNTAP_MODE_TAP,
// // }
// // if err := netlink.LinkAdd(tap); err != nil {
// // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err)
// // }
// // tapLink, err := netlink.LinkByName(deviceNames.TAP)
// // if err != nil {
// // return fmt.Errorf("failed to get TAP device: %w", err)
// // }
// // Set TAP device up
// if err := netlink.LinkSetUp(tap); err != nil {
// return fmt.Errorf("failed to bring TAP device up: %w", err)
// }
// m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name))
// AIDEV-TODO: temporary NotImplemented to avoid silent success paths.
return fmt.Errorf("setupTunTap: not implemented for ip=%s", ip)
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/tuntap.go lines 1-29: add a linux-only
build tag at the top of the file (//go:build linux and the legacy +build line)
so this stub is only compiled on Linux, update the function comment to note it’s
unimplemented and include a TODO, and replace the silent no-op with an immediate
failing return (e.g. return fmt.Errorf("setupTunTap: not implemented")) so
callers fail fast; also add the required fmt import if missing.

build: deps ## Build the binary
@mkdir -p $(BUILD_DIR)
@go build $(LDFLAGS) -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
go build $(LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Debug flags included in regular build target.

The build target now includes debug flags (-gcflags="all=-N -l") which disable optimizations and inlining. This will significantly impact performance and should only be used in debug builds.

Apply this diff to separate production and debug builds:

 build: deps ## Build the binary
 	@mkdir -p $(BUILD_DIR)
-	go build $(LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
+	go build $(LDFLAGS) -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
+
+debug-build: deps ## Build the binary with debug symbols
+	@mkdir -p $(BUILD_DIR)
+	go build $(DEBUG_LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME)-debug ./cmd/metald


test: ## Run all tests
@go test ./... -v
@go test -json -failfast -v ./... | tparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Test runner changed to use tparse for better output formatting.

The change from go test ./... -v to go test -json -failfast -v ./... | tparse provides better test output formatting. However, ensure that tparse is installed as a dependency.

Run the following script to verify tparse availability:


🏁 Script executed:

#!/bin/bash
# Check if tparse is available in go.mod or as a tool dependency
echo "Checking for tparse in go.mod dependencies..."
rg -n "tparse" go/deploy/metald/go.mod

echo "Checking for tparse installation in Makefile or scripts..."
rg -n "go install.*tparse" go/deploy/metald/

echo "Checking if tparse is mentioned in documentation..."
fd README -t f | xargs rg -l "tparse"

Length of output: 464


Add tparse installation in Makefile and track as a Go tool dependency
No references to tparse were found in go.mod, scripts, or docs, so invoking go test … | tparse will break in a clean environment.

  • In go/deploy/metald/Makefile, add a target or prerequisite to install tparse, for example:
    install-tools:
    	go install github.com/jstemmer/tparse/cmd/tparse@latest
    and make test depend on install-tools.
  • Track tparse in version control by adding a tools.go under go/deploy/metald/:
    // +build tools
    
    package tools
    
    import _ "github.com/jstemmer/tparse/cmd/tparse"
  • Update README or developer setup docs to include the above steps.
🤖 Prompt for AI Agents
In go/deploy/metald/Makefile around line 88 the test target pipes output to
tparse which isn't installed or tracked; add a Makefile target (e.g.,
install-tools) that runs `go install
github.com/jstemmer/tparse/cmd/tparse@latest` and make the test target depend on
that target so tparse is available in a clean environment; add a new
go/deploy/metald/tools.go file with the tools build tag that imports `_
"github.com/jstemmer/tparse/cmd/tparse"` to pin tparse as a Go tool dependency
in version control; and update the repository README/developer setup docs to
mention the tools installation or the Makefile target.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +411 to +413
// AIDEV-NOTE: Signal handling continues during graceful shutdown to prevent SIGABRT panics
shutdownSignalReceived := make(chan struct{})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Signal ignore loop never runs due to closed-channel select; fix to actually drain signals.

After closing shutdownSignalReceived, the select always takes that branch and returns immediately, so additional signals are not ignored/logged.

-	// Continue handling signals during graceful shutdown to prevent SIGABRT panics
-	go func() {
-		for {
-			select {
-			case <-shutdownSignalReceived:
-				// Already shutting down, ignore
-				return
-			case sig := <-sigChan:
-				logger.Warn("received additional signal during shutdown, ignoring",
-					slog.String("signal", sig.String()),
-				)
-				// Continue listening for more signals
-			}
-		}
-	}()
+	// Continue handling (and ignoring) subsequent signals during shutdown
+	go func() {
+		// Wait until shutdown starts
+		<-shutdownSignalReceived
+		for sig := range sigChan {
+			logger.Warn("received additional signal during shutdown, ignoring",
+				slog.String("signal", sig.String()),
+			)
+		}
+	}()

Optionally call signal.Stop(sigChan) just before process exit to release resources.

Also applies to: 435-450

Comment on lines 295 to 311
// Register via assetmanager client
assetID, err := m.assetClient.RegisterBuildArtifact(ctx, "base-assets", localPath, asset.Type, labels)
if err != nil {
return fmt.Errorf("failed to register asset: %w", err)
// Already exists errors are fine, connection errors should cause retry
errStr := strings.ToLower(err.Error())
if strings.Contains(errStr, "already exists") || strings.Contains(errStr, "duplicate") ||
strings.Contains(errStr, "conflict") {
m.logger.InfoContext(ctx, "base asset already registered, skipping",
"asset", asset.Name,
"error", err,
)
return nil // Success - asset already registered
} else {
// This is likely a connection/service unavailable error - should trigger retry
return fmt.Errorf("failed to register base asset %s (service may not be ready): %w", asset.Name, err)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid substring matching for “already exists” errors; use typed/status errors

String contains checks are brittle and locale-dependent. Prefer:

  • gRPC: status.Code(err) == codes.AlreadyExists
  • Sentinel/typed errors: errors.Is(err, assetmanager.ErrAlreadyExists)

If neither is available, introduce an error classifying helper with unit tests.

Comment on lines +16 to +44
// copyMetadataForRootDevice copies metadata and creates container.cmd for a root device
func (c *Client) copyMetadataForRootDevice(ctx context.Context, vmID string, disk *metaldv1.StorageDevice, jailerRoot string, diskDst string) error {
baseName := strings.TrimSuffix(filepath.Base(disk.GetPath()), filepath.Ext(disk.GetPath()))
metadataSrc := filepath.Join(filepath.Dir(disk.GetPath()), baseName+".metadata.json")

// Check if metadata file exists
if _, err := os.Stat(metadataSrc); err != nil {
if os.IsNotExist(err) {
return nil // No metadata file is OK
}
return fmt.Errorf("failed to stat metadata file: %w", err)
}

// Copy metadata file
metadataDst := filepath.Join(jailerRoot, filepath.Base(metadataSrc))
if err := copyFileWithOwnership(metadataSrc, metadataDst, int(c.jailerConfig.UID), int(c.jailerConfig.GID)); err != nil {
return fmt.Errorf("failed to copy metadata file: %w", err)
}

c.logger.LogAttrs(ctx, slog.LevelInfo, "copied metadata file to jailer root",
slog.String("src", metadataSrc),
slog.String("dst", metadataDst),
)

// Load and process metadata to create container.cmd
metadata, err := c.loadContainerMetadata(ctx, disk.GetPath())
if err != nil || metadata == nil {
return nil // Can't create container.cmd without metadata
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Log metadata load failures at debug; add an anchor for future behavior.

Currently returns nil on errors, which can hide issues.

Apply this diff:

 	metadata, err := c.loadContainerMetadata(ctx, disk.GetPath())
 	if err != nil || metadata == nil {
-		return nil // Can't create container.cmd without metadata
+		// AIDEV-NOTE: Metadata missing/unreadable; skipping container.cmd generation is intentional.
+		// Consider promoting to debug with context to aid troubleshooting.
+		c.logger.DebugContext(ctx, "skipping container.cmd (no metadata)", slog.String("disk", disk.GetPath()), slog.Any("error", err))
+		return nil
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// copyMetadataForRootDevice copies metadata and creates container.cmd for a root device
func (c *Client) copyMetadataForRootDevice(ctx context.Context, vmID string, disk *metaldv1.StorageDevice, jailerRoot string, diskDst string) error {
baseName := strings.TrimSuffix(filepath.Base(disk.GetPath()), filepath.Ext(disk.GetPath()))
metadataSrc := filepath.Join(filepath.Dir(disk.GetPath()), baseName+".metadata.json")
// Check if metadata file exists
if _, err := os.Stat(metadataSrc); err != nil {
if os.IsNotExist(err) {
return nil // No metadata file is OK
}
return fmt.Errorf("failed to stat metadata file: %w", err)
}
// Copy metadata file
metadataDst := filepath.Join(jailerRoot, filepath.Base(metadataSrc))
if err := copyFileWithOwnership(metadataSrc, metadataDst, int(c.jailerConfig.UID), int(c.jailerConfig.GID)); err != nil {
return fmt.Errorf("failed to copy metadata file: %w", err)
}
c.logger.LogAttrs(ctx, slog.LevelInfo, "copied metadata file to jailer root",
slog.String("src", metadataSrc),
slog.String("dst", metadataDst),
)
// Load and process metadata to create container.cmd
metadata, err := c.loadContainerMetadata(ctx, disk.GetPath())
if err != nil || metadata == nil {
return nil // Can't create container.cmd without metadata
}
// Load and process metadata to create container.cmd
metadata, err := c.loadContainerMetadata(ctx, disk.GetPath())
if err != nil || metadata == nil {
// AIDEV-NOTE: Metadata missing/unreadable; skipping container.cmd generation is intentional.
// Consider promoting to debug with context to aid troubleshooting.
c.logger.DebugContext(
ctx,
"skipping container.cmd (no metadata)",
slog.String("disk", disk.GetPath()),
slog.Any("error", err),
)
return nil
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines 16
to 44, the call to loadContainerMetadata swallows errors and returns nil
silently; log failures at debug and add an anchor comment about future behavior.
Modify the error path to call c.logger.LogAttrs(ctx, slog.LevelDebug, "failed to
load container metadata", slog.String("vmID", vmID), slog.String("diskPath",
disk.GetPath()), slog.Any("error", err)) (or equivalent logger method) when err
!= nil, and include a TODO/TICKET comment noting that we currently return nil
but should surface or handle metadata load errors in the future; keep returning
nil for now to preserve existing behavior.

Comment on lines +61 to +67
// Create temporary mount directory
mountDir := filepath.Join("/tmp", fmt.Sprintf("mount-%s", vmID))
if err := os.MkdirAll(mountDir, 0o755); err != nil {
return fmt.Errorf("failed to create mount directory: %w", err)
}
defer os.RemoveAll(mountDir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Avoid RemoveAll on an active mount; use MkdirTemp to prevent collisions.

RemoveAll before unmount risks deleting guest files if umount fails. Also make mountpoint unique.

Apply this diff:

-	// Create temporary mount directory
-	mountDir := filepath.Join("/tmp", fmt.Sprintf("mount-%s", vmID))
-	if err := os.MkdirAll(mountDir, 0o755); err != nil {
-		return fmt.Errorf("failed to create mount directory: %w", err)
-	}
-	defer os.RemoveAll(mountDir)
+	// Create a unique temporary mount directory to avoid collisions
+	mountDir, err := os.MkdirTemp("", fmt.Sprintf("metald-mnt-%s-", vmID))
+	if err != nil {
+		return fmt.Errorf("failed to create mount directory: %w", err)
+	}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines
61-67, the code creates a non-unique /tmp mount directory and defers
os.RemoveAll immediately, which risks deleting guest files if unmount fails;
replace filepath.Join("/tmp", ...) + MkdirAll with os.MkdirTemp("/tmp",
"mount-*") to create a unique temporary directory with correct permissions,
remove the deferred os.RemoveAll, and instead perform cleanup (os.RemoveAll)
only after a successful unmount (or in a cleanup path that runs when the mount
is confirmed unmounted), ensuring errors from MkdirTemp are handled and that
RemoveAll is not executed while the mount point may still be active.

Comment on lines +68 to +83
// Mount the rootfs ext4 image
mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount rootfs: %w", err)
}
defer func() {
// Always unmount
umountCmd := exec.CommandContext(ctx, "umount", mountDir)
if err := umountCmd.Run(); err != nil {
c.logger.WarnContext(ctx, "failed to unmount rootfs",
"error", err,
"mountDir", mountDir,
)
}
}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Harden mount options and only remove mountpoint after successful unmount.

Use nosuid,nodev,noexec and explicit fs type. Don’t RemoveAll if unmount fails.

Apply this diff:

-	// Mount the rootfs ext4 image
-	mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir)
+	// AIDEV-NOTE: Requires CAP_SYS_ADMIN/root. Mount rw to write container.cmd but restrict capabilities.
+	// Mount the rootfs ext4 image
+	mountCmd := exec.CommandContext(ctx, "mount", "-t", "ext4", "-o", "loop,rw,nosuid,nodev,noexec", diskDst, mountDir)
 	if err := mountCmd.Run(); err != nil {
 		return fmt.Errorf("failed to mount rootfs: %w", err)
 	}
 	defer func() {
 		// Always unmount
 		umountCmd := exec.CommandContext(ctx, "umount", mountDir)
 		if err := umountCmd.Run(); err != nil {
 			c.logger.WarnContext(ctx, "failed to unmount rootfs",
 				"error", err,
 				"mountDir", mountDir,
 			)
-		}
+			// Do not remove mountDir if unmount failed to avoid deleting guest files from the image.
+			return
+		}
+		// Safe to remove the mount directory after a successful unmount
+		_ = os.RemoveAll(mountDir)
 	}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Mount the rootfs ext4 image
mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount rootfs: %w", err)
}
defer func() {
// Always unmount
umountCmd := exec.CommandContext(ctx, "umount", mountDir)
if err := umountCmd.Run(); err != nil {
c.logger.WarnContext(ctx, "failed to unmount rootfs",
"error", err,
"mountDir", mountDir,
)
}
}()
// AIDEV-NOTE: Requires CAP_SYS_ADMIN/root. Mount rw to write container.cmd but restrict capabilities.
// Mount the rootfs ext4 image
mountCmd := exec.CommandContext(ctx, "mount", "-t", "ext4", "-o", "loop,rw,nosuid,nodev,noexec", diskDst, mountDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount rootfs: %w", err)
}
defer func() {
// Always unmount
umountCmd := exec.CommandContext(ctx, "umount", mountDir)
if err := umountCmd.Run(); err != nil {
c.logger.WarnContext(ctx, "failed to unmount rootfs",
"error", err,
"mountDir", mountDir,
)
// Do not remove mountDir if unmount failed to avoid deleting guest files from the image.
return
}
// Safe to remove the mount directory after a successful unmount
_ = os.RemoveAll(mountDir)
}()
🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines 68
to 83, the mount call currently uses insecure defaults and the defer always
removes the mountpoint even if unmount fails; change the mount command to
include explicit filesystem type (ext4) and hardened options ("-o",
"loop,nosuid,nodev,noexec") and update the deferred cleanup so it first attempts
to unmount and only removes the mountDir (e.g. os.RemoveAll) if unmount
succeeds; if umount fails log the warning but do not delete the directory or its
contents.

Comment on lines +10 to 16
// PortMapping represents a mapping from VM port to host port
type PortMapping struct {
ContainerPort int `json:"container_port"`
HostPort int `json:"host_port"`
Protocol string `json:"protocol"` // tcp or udp
VMID string `json:"vm_id"`
VMPort int `json:"vm_port"`
HostPort int `json:"host_port"`
Protocol string `json:"protocol"` // tcp or udp
VMID string `json:"vm_id"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Backward-compat risk: JSON field rename to vm_port.

If persisted state or external APIs still send container_port, unmarshalling will break. Consider a custom UnmarshalJSON to accept both.

// Optional: accept both "vm_port" and "container_port" during decode.
func (p *PortMapping) UnmarshalJSON(b []byte) error {
	type Alias PortMapping
	aux := struct {
		Alias
		ContainerPort *int `json:"container_port"`
	}{}
	if err := json.Unmarshal(b, &aux); err != nil {
		return err
	}
	*p = PortMapping(aux.Alias)
	if p.VMPort == 0 && aux.ContainerPort != nil {
		p.VMPort = *aux.ContainerPort
	}
	return nil
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/port_allocator.go around lines 10 to 16,
the JSON field was renamed to "vm_port" which breaks unmarshalling of persisted
or external data still using "container_port"; implement a custom UnmarshalJSON
on PortMapping that defines an alias type, decodes into a struct that includes
both a ContainerPort *int and the alias, assign the decoded alias back to the
receiver, and if VMPort is zero and ContainerPort is non-nil set VMPort from
ContainerPort; ensure the method returns any json.Unmarshal error and that the
package imports encoding/json.

Comment on lines +10 to +19
// BridgeManager manages workspace allocation across multiple bridges
type BridgeManager struct {
bridgeCount int // 8 or 32 bridges
bridgePrefix string // "br-vms" -> br-vms-0, br-vms-1, etc.
workspaces map[string]*WorkspaceAllocation // workspace_id -> allocation
bridgeUsage map[int]map[string]bool // bridge_num -> workspace_id -> exists
mu sync.RWMutex
statePath string // Path to state persistence file
logger *slog.Logger // Structured logger for state operations
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Mutex/logger fields: document concurrency and logging contracts.

Since these types embed mu and logger, add AIDEV-NOTE comments outlining which methods take read/write locks and how logger context is propagated.

Also applies to: 29-39

🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/types.go around lines 10 to 19 (and
similarly for lines 29 to 39), add AIDEV-NOTE comments on the mutex and logger
fields: document which public/internal methods acquire the read lock (RLock) and
which acquire the write lock (Lock), include the expected lock ordering if any,
and state whether callsites must hold locks when calling those methods; also
document how the logger context is propagated (e.g., whether methods add
workspace/bridge fields, whether the logger is safe for concurrent use, and
whether callers should clone/With on the logger before adding fields). Keep the
notes brief and specific (method names and lock type), place them immediately
above the mu and logger field declarations, and update the mirrored note for the
other struct at lines 29–39.

Comment on lines +29 to +39
type MultiBridgeManager struct {
bridgeCount int // 8 or 32 bridges
bridgePrefix string // "br-vms" -> br-vms-0, br-vms-1, etc.
workspaces map[string]*WorkspaceAllocation // workspace_id -> allocation
bridgeUsage map[int]map[string]bool // bridge_num -> workspace_id -> exists
mu sync.RWMutex
vlanRangeStart int // Starting VLAN ID (100)
vlanRangeEnd int // Ending VLAN ID (4000)
statePath string // Path to state persistence file
logger *slog.Logger // Structured logger for state operations
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Duplicated manager types (BridgeManager vs MultiBridgeManager).

Both carry near-identical fields; this invites drift. Consolidate into a single manager with config options (bridge/vlan ranges) to reduce complexity.

Add an AIDEV-NOTE explaining the design choice if you keep both.

Comment on lines +43 to +48
WorkspaceID string `json:"workspace_id"`
BridgeNumber int `json:"bridge_number"` // 0-31
BridgeName string `json:"bridge_name"` // br-vms-N
CreatedAt string `json:"created_at"`
VMCount int `json:"vm_count"` // Track VM count for IP allocation
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

CreatedAt should be time.Time for consistency and safer JSON.

WorkspaceAllocation.CreatedAt is string while VMNetwork.CreatedAt is time.Time. Use time.Time and let encoding/json handle RFC3339; add omitempty if needed.

-	CreatedAt    string `json:"created_at"`
+	CreatedAt    time.Time `json:"created_at"`

Update serialization/deserialization sites accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WorkspaceID string `json:"workspace_id"`
BridgeNumber int `json:"bridge_number"` // 0-31
BridgeName string `json:"bridge_name"` // br-vms-N
CreatedAt string `json:"created_at"`
VMCount int `json:"vm_count"` // Track VM count for IP allocation
}
WorkspaceID string `json:"workspace_id"`
BridgeNumber int `json:"bridge_number"` // 0-31
BridgeName string `json:"bridge_name"` // br-vms-N
CreatedAt time.Time `json:"created_at"`
VMCount int `json:"vm_count"` // Track VM count for IP allocation
}

Comment on lines +52 to 67
VMID string `json:"vm_id"`
NetworkID string `json:"network_id"` // AIDEV-NOTE: Internal 8-char ID for network device naming
WorkspaceID string `json:"workspace_id"` // AIDEV-NOTE: Track workspace for proper IP release
Namespace string `json:"namespace"`
TapDevice string `json:"tap_device"`
IPAddress net.IP `json:"ip_address"`
Netmask net.IPMask `json:"netmask"`
Gateway net.IP `json:"gateway"`
MacAddress string `json:"mac_address"`
DNSServers []string `json:"dns_servers"`
CreatedAt time.Time `json:"created_at"`

// Optional fields for advanced configurations
VLANID int `json:"vlan_id,omitempty"`
// Optional fields for advanced configuration
IPv6Address net.IP `json:"ipv6_address,omitempty"`
Routes []Route `json:"routes,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

JSON of net.IPMask is awkward; prefer CIDR prefix or IPNet.

net.IP marshals nicely, but net.IPMask becomes base64-encoded bytes in JSON. Replace Netmask with a CIDR prefix int or use an IPNet for the interface address.

-	Netmask     net.IPMask `json:"netmask"`
+	Prefix      int        `json:"prefix"` // e.g., 24

Add an AIDEV-NOTE on why this change improves API ergonomics.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/types.go around lines 52 to 67, replace the
Netmask field (net.IPMask) which JSON-encodes awkwardly with either a CIDR
prefix int (e.g., PrefixLen int `json:"prefix_len"`) or an IPNet (e.g., IPNet
*net.IPNet `json:"ip_net,omitempty"`), update any uses to parse/format
accordingly, and add an AIDEV-NOTE comment explaining that using a prefix length
or IPNet improves API ergonomics because it avoids base64-encoding of mask bytes
and is human-readable/portable in JSON. Ensure tags use omitempty where
appropriate and keep semantic compatibility by converting old netmask byte
values at marshal/unmarshal points if needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +1 to +24
package executor

import (
"context"
"fmt"
"log/slog"
"os"
"path/filepath"
"time"

"github.com/unkeyed/unkey/go/deploy/builderd/internal/config"
"github.com/unkeyed/unkey/go/deploy/builderd/internal/observability"
"github.com/unkeyed/unkey/go/deploy/pkg/observability/interceptors"
builderv1 "github.com/unkeyed/unkey/go/gen/proto/deploy/builderd/v1"
)

// DockerPipelineExecutor wraps the existing DockerExecutor with step-based execution
type DockerPipelineExecutor struct {
dockerExecutor *DockerExecutor
pipeline *BuildPipeline
logger *slog.Logger
config *config.Config
buildMetrics *observability.BuildMetrics
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add high-level AIDEV anchor for pipeline semantics.

Given the orchestration role here, add a brief AIDEV-NOTE explaining idempotency, failure cleanup expectations, and resume assumptions.

Example:

// AIDEV-NOTE: Pipeline executor orchestrates idempotent steps. On failure, we still expect
// best-effort cleanup (container/image) and consistent metrics. Resume assumes step outputs
// can be reconstructed from workspaceDir/rootfsDir.
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 1 to 24,
add a high-level AIDEV-NOTE comment above the DockerPipelineExecutor type
describing pipeline semantics: state that the pipeline executor orchestrates
idempotent steps, that on failures best-effort cleanup of containers/images is
expected, and that resume logic assumes step outputs can be reconstructed from
workspaceDir/rootfsDir; place this single-line AIDEV-NOTE as a brief comment
block to guide future maintainers.

Comment on lines +44 to +55
// Get tenant context for logging and metrics
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}

logger := d.logger.With(
slog.String("tenant_id", tenantID),
slog.String("build_id", buildID),
slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Nil-pointer risk: image URI accessed before validation.

logger construction dereferences request.Config.Source.DockerImage before checking for nil, leading to panic on bad input. Validate and extract imageURI first, then build the logger.

Apply:

-	// Get tenant context for logging and metrics
+	// Validate request and extract image info
 	tenantID := "unknown"
 	if auth, ok := interceptors.TenantFromContext(ctx); ok {
 		tenantID = auth.TenantID
 	}
 
-	logger := d.logger.With(
-		slog.String("tenant_id", tenantID),
-		slog.String("build_id", buildID),
-		slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
-	)
+	if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil ||
+		request.GetConfig().GetSource().GetDockerImage() == nil ||
+		request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" {
+		return nil, fmt.Errorf("invalid request: docker image source is required")
+	}
+	imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri()
+
+	logger := d.logger.With(
+		slog.String("tenant_id", tenantID),
+		slog.String("build_id", buildID),
+		slog.String("image_uri", imageURI),
+	)
@@
-	dockerSource := request.GetConfig().GetSource().GetDockerImage()
-	if dockerSource == nil {
-		return nil, fmt.Errorf("docker image source is required")
-	}
+	// dockerSource validated above

AIDEV-NOTE: Validate nested request fields before logging to avoid panics on malformed input.

Also applies to: 68-72

Comment on lines +82 to +91
// Create directories
if err := os.MkdirAll(workspaceDir, 0755); err != nil {
logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create workspace directory: %w", err)
}

if err := os.MkdirAll(rootfsDir, 0755); err != nil {
logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create rootfs directory: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Workspace/rootfs perms: consider 0750/0700 for build isolation.

0755 allows world-read/exec. If builds can contain sensitive artifacts, default to least privilege.

Apply:

-	if err := os.MkdirAll(workspaceDir, 0755); err != nil {
+	if err := os.MkdirAll(workspaceDir, 0750); err != nil {
@@
-	if err := os.MkdirAll(rootfsDir, 0755); err != nil {
+	if err := os.MkdirAll(rootfsDir, 0750); err != nil {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Create directories
if err := os.MkdirAll(workspaceDir, 0755); err != nil {
logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create workspace directory: %w", err)
}
if err := os.MkdirAll(rootfsDir, 0755); err != nil {
logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create rootfs directory: %w", err)
}
// Create directories
if err := os.MkdirAll(workspaceDir, 0750); err != nil {
logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create workspace directory: %w", err)
}
if err := os.MkdirAll(rootfsDir, 0750); err != nil {
logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error()))
return nil, fmt.Errorf("failed to create rootfs directory: %w", err)
}
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 82 to
91, the directories are created with permissive 0755; change to least-privilege
modes by creating workspaceDir with 0750 and rootfsDir with 0700 to reduce world
access. Update the os.MkdirAll calls to use those permission bits (keep existing
error handling/log messages unchanged).

Comment on lines +102 to +111
// Execute the pipeline
result, err := d.pipeline.Execute(ctx, input)
if err != nil {
logger.ErrorContext(ctx, "pipeline execution failed", slog.String("error", err.Error()))
if d.buildMetrics != nil {
d.buildMetrics.RecordBuildComplete(ctx, "docker", "docker", time.Since(start), false)
}
return nil, err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Unify metrics completion via defer to avoid duplication.

You record completion in both branches. Use a single deferred recorder keyed on err for consistency.

Example:

start := time.Now()
defer func() {
	success := err == nil
	if d.buildMetrics != nil {
		d.buildMetrics.RecordBuildComplete(ctx, "docker", "docker", time.Since(start), success)
	}
}()

Apply within function scope.

Also applies to: 112-116

🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 102-111
(and also apply the same change for the 112-116 region), the build completion
metric is recorded in multiple branches; instead, declare a start := time.Now()
and use a single deferred closure that reads the function's returned err to
compute success and calls d.buildMetrics.RecordBuildComplete once. To do this,
make the function use a named error return (e.g., err error) or ensure the
deferred closure captures the correct err variable, move the metrics call out of
the error and success branches, and keep the existing slog logging in-place;
remove the duplicated RecordBuildComplete calls in the branches.

Comment on lines +120 to +136
// ResumeBuild resumes a build from a specific step
func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) {
start := time.Now()

// Get tenant context for logging and metrics
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}

logger := d.logger.With(
slog.String("tenant_id", tenantID),
slog.String("build_id", buildID),
slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
slog.Int("resume_from_step", lastCompletedStep+1),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

ResumeBuild has the same nil-pointer hazard.

It dereferences the image URI for logging without validation. Mirror the guard used in ExtractDockerImageWithID.

Apply:

-	// Get tenant context for logging and metrics
+	// Get tenant context and validate request
 	tenantID := "unknown"
 	if auth, ok := interceptors.TenantFromContext(ctx); ok {
 		tenantID = auth.TenantID
 	}
 
-	logger := d.logger.With(
+	if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil ||
+		request.GetConfig().GetSource().GetDockerImage() == nil ||
+		request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" {
+		return nil, fmt.Errorf("invalid request: docker image source is required")
+	}
+	imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri()
+
+	logger := d.logger.With(
 		slog.String("tenant_id", tenantID),
 		slog.String("build_id", buildID),
-		slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
+		slog.String("image_uri", imageURI),
 		slog.Int("resume_from_step", lastCompletedStep+1),
 	)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// ResumeBuild resumes a build from a specific step
func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) {
start := time.Now()
// Get tenant context for logging and metrics
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}
logger := d.logger.With(
slog.String("tenant_id", tenantID),
slog.String("build_id", buildID),
slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
slog.Int("resume_from_step", lastCompletedStep+1),
)
// ResumeBuild resumes a build from a specific step
func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) {
start := time.Now()
// Get tenant context and validate request
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}
if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil ||
request.GetConfig().GetSource().GetDockerImage() == nil ||
request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" {
return nil, fmt.Errorf("invalid request: docker image source is required")
}
imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri()
logger := d.logger.With(
slog.String("tenant_id", tenantID),
slog.String("build_id", buildID),
slog.String("image_uri", imageURI),
slog.Int("resume_from_step", lastCompletedStep+1),
)
// ... rest of ResumeBuild ...
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 120 to
136, the logger currently dereferences
request.GetConfig().GetSource().GetDockerImage().GetImageUri() directly which
can cause a nil-pointer panic; mirror the guard in ExtractDockerImageWithID by
first safely checking that request.GetConfig(), Config.GetSource(),
Source.GetDockerImage() are non-nil (and optionally that GetImageUri() is
non-empty), assign the image URI to a local string variable only if those checks
pass (otherwise set it to an empty string or "unknown"), and then use that safe
variable in logger.With instead of direct chained dereferences.

Comment on lines 68 to 74
// DEBUG: Log full request config for debugging
if config != nil {
configJSON, _ := json.Marshal(config)
s.logger.LogAttrs(ctx, slog.LevelInfo, "DEBUG: Full VM config received",
s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received",
slog.String("config_json", string(configJSON)),
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t log full VM config; redact or drop sensitive fields

Config may contain paths, keys, or credentials. Keep logs safe-by-default.

- configJSON, _ := json.Marshal(config)
- s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received",
-   slog.String("config_json", string(configJSON)),
- )
+ // AIDEV-BUSINESS_RULE: Avoid logging full VM config; include only non-sensitive summary.
+ s.logger.LogAttrs(ctx, slog.LevelDebug, "vm config received (summary)",
+   slog.Any("cpu", config.GetCpu()),
+   slog.Any("memory", config.GetMemory()),
+   // omit boot/storage/net details that may be sensitive
+ )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// DEBUG: Log full request config for debugging
if config != nil {
configJSON, _ := json.Marshal(config)
s.logger.LogAttrs(ctx, slog.LevelInfo, "DEBUG: Full VM config received",
s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received",
slog.String("config_json", string(configJSON)),
)
}
// DEBUG: Log full request config for debugging
if config != nil {
// AIDEV-BUSINESS_RULE: Avoid logging full VM config; include only non-sensitive summary.
s.logger.LogAttrs(ctx, slog.LevelDebug, "vm config received (summary)",
slog.Any("cpu", config.GetCpu()),
slog.Any("memory", config.GetMemory()),
// omit boot/storage/net details that may be sensitive
)
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 68-74, the code
currently logs the entire VM config which may contain sensitive data; replace
that with a safe/redacted representation before logging: identify and remove or
replace sensitive keys (e.g. "private_key", "ssh_key", "password", "token",
"secret", "cert", "key", "credentials", any file paths) by copying the config
into a map/struct and setting those fields to "[REDACTED]" or omitting them, or
else only log a small allowlist of non-sensitive fields (e.g. VM name, cpu,
memory, region); then marshal that safe representation and log it at debug level
instead of the raw config.

Comment on lines 85 to +95
// Validate required fields
if validateErr := s.validateVMConfig(config); validateErr != nil {
s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config",
slog.String("error", validateErr.Error()),
// if validateErr := s.validateVMConfig(config); validateErr != nil {
// s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config",
// slog.String("error", validateErr.Error()),
// )
// if s.vmMetrics != nil {
// s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config")
// }
// return nil, connect.NewError(connect.CodeInvalidArgument, validateErr)
// }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Re-enable minimal config validation

Basic guardrails prevent backend errors and undefined behavior.

- // if validateErr := s.validateVMConfig(config); validateErr != nil {
- //   ...
- // }
+ if validateErr := s.validateVMConfig(config); validateErr != nil {
+   s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config", slog.String("error", validateErr.Error()))
+   if s.vmMetrics != nil {
+     s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config")
+   }
+   return nil, connect.NewError(connect.CodeInvalidArgument, validateErr)
+ }

If you prefer staged rollout, gate with an env flag and add an AIDEV-TODO to remove the flag later.

Also applies to: 486-543

🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 85-95 (and similarly
apply to lines 486-543), re-enable the minimal VM config validation that was
commented out: call s.validateVMConfig(config), log the validation error with
s.logger.LogAttrs at slog.LevelError including validateErr.Error(), record a VM
create failure via s.vmMetrics.RecordVMCreateFailure with reason
"invalid_config" if vmMetrics is non-nil, and return a
connect.NewError(connect.CodeInvalidArgument, validateErr). If you want a staged
rollout, wrap the validation call in an environment-flag check (e.g., if
os.Getenv("ENABLE_MINIMAL_VM_VALIDATION") == "1") and add an AIDEV-TODO comment
to remove the flag later.

Comment on lines +96 to 106
network, netErr := s.queries.AllocateNetwork(ctx)
if netErr != nil {
s.logger.Info("failed to allocate network",
slog.String("error", netErr.Error()),
)
if s.vmMetrics != nil {
s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config")
}
return nil, connect.NewError(connect.CodeInvalidArgument, validateErr)
return nil, connect.NewError(connect.CodeInternal, netErr)
}

// Add tenant context to logs for audit trail
// AIDEV-NOTE: In multi-tenant systems, all VM operations should be logged with tenant context
s.logWithTenantContext(ctx, slog.LevelInfo, "creating vm",
slog.Int("vcpus", int(config.GetCpu().GetVcpuCount())),
slog.Int64("memory_bytes", config.GetMemory().GetSizeBytes()),
s.logger.Info("network allocated",
slog.String("network_cidr", network.BaseNetwork),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

🛠️ Refactor suggestion

Allocate-and-forget network; add cleanup on failure or defer allocation

If backend CreateVM fails, the allocated network is leaked.

Proposed pattern:

 network, netErr := s.queries.AllocateNetwork(ctx)
 if netErr != nil { ... }
+// AIDEV-TODO: release allocated network on CreateVM failure
 vmID, err := s.backend.CreateVM(ctx, config)
 if err != nil {
-  ...
+  // _ = s.queries.ReleaseNetwork(ctx, network.Id) // implement this
+  ...
 }

Alternatively, allocate network after VM creation if backend doesn’t require it during create.

Also applies to: 114-126

Comment on lines 145 to 149
return connect.NewResponse(&metaldv1.CreateVmResponse{
VmId: vmID,
State: metaldv1.VmState_VM_STATE_CREATED,
HostPortPair: "0.0.0.0:9999",
State: metaldv1.VmState_VM_STATE_CREATED,
}), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

🛠️ Refactor suggestion

⚠️ Potential issue

CreateVm response should surface a handle to manage the VM; avoid hard-coded host:port

Returning a fixed "0.0.0.0:9999" is a placeholder and clients have no durable identifier for follow-up ops.

Option A (no proto change): return vmID in a response header.

- return connect.NewResponse(&metaldv1.CreateVmResponse{
-   HostPortPair: "0.0.0.0:9999",
-   State:        metaldv1.VmState_VM_STATE_CREATED,
- }), nil
+ // AIDEV-TODO: surface the actual endpoint from networking/backends once available
+ resp := connect.NewResponse(&metaldv1.CreateVmResponse{
+   State: metaldv1.VmState_VM_STATE_CREATED,
+ })
+ resp.Header().Set("x-unkey-vm-id", vmID)
+ return resp, nil

Option B (preferred): add vm_id to CreateVmResponse in the proto.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return connect.NewResponse(&metaldv1.CreateVmResponse{
VmId: vmID,
State: metaldv1.VmState_VM_STATE_CREATED,
HostPortPair: "0.0.0.0:9999",
State: metaldv1.VmState_VM_STATE_CREATED,
}), nil
}
// AIDEV-TODO: surface the actual endpoint from networking/backends once available
resp := connect.NewResponse(&metaldv1.CreateVmResponse{
State: metaldv1.VmState_VM_STATE_CREATED,
})
resp.Header().Set("x-unkey-vm-id", vmID)
return resp, nil
}

Comment on lines 545 to 547
func (s *VMService) getBackendType() string {
// Try to determine backend type from the backend implementation
switch s.backend.(type) {
case interface{ GetProcessInfo() map[string]interface{} }:
return "firecracker"
default:
return "cloudhypervisor"
}
return "firecracker"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded backend type breaks Docker-dev metrics; pass backendType in ctor

Return the actual backend type instead of constant “firecracker”.

Apply both diffs.

  1. Service:
 type VMService struct {
   backend          types.Backend
   ...
-  tracer           trace.Tracer
-  queries          database.Querier
+  tracer           trace.Tracer
+  queries          database.Querier
+  backendType      string
   metaldv1connect.UnimplementedVmServiceHandler
 }
 
-func NewVMService(backend types.Backend, logger *slog.Logger, metricsCollector *billing.MetricsCollector, vmMetrics *observability.VMMetrics, queries database.Querier) *VMService {
+func NewVMService(backend types.Backend, logger *slog.Logger, metricsCollector *billing.MetricsCollector, vmMetrics *observability.VMMetrics, queries database.Querier, backendType string) *VMService {
   tracer := otel.Tracer("metald.service.vm")
   return &VMService{
     backend:          backend,
     ...
-    tracer:           tracer,
+    tracer:           tracer,
+    backendType:      backendType,
   }
 }
 
 func (s *VMService) getBackendType() string {
-  return "firecracker"
+  return s.backendType
 }
  1. main.go (wire actual type):
- vmService := service.NewVMService(backend, logger, metricsCollector, vmMetrics, db.Queries)
+ vmService := service.NewVMService(backend, logger, metricsCollector, vmMetrics, db.Queries, string(cfg.Backend.Type))

Also applies to: 35-45

🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 545-547 (and also check
lines 35-45), the method getBackendType currently returns the hard-coded string
"firecracker"; change the VMService to store a backendType field set from the
constructor and have getBackendType return that field. Update the VMService
constructor signature to accept backendType (string) and assign it to the new
field, and then update main.go where VMService is instantiated to pass the
actual backend type value wired from configuration or the caller so Docker-dev
metrics receive the correct backend type.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +4 to +6
After=network.target spire-agent.service assetmanagerd.service metald.service
Wants=network.target
Requires=spire-agent.service assetmanagerd.service metald.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Avoid cascading failures; use Wants(+After) for metald and wait for network-online.

Billaged likely doesn’t need a hard dependency on metald. Requires=metald means billaged is stopped if metald stops. Prefer Wants=metald with ordering, and wait for network-online to avoid TLS/mTLS races on boot.

Apply:

-After=network.target spire-agent.service assetmanagerd.service metald.service
-Wants=network.target
-Requires=spire-agent.service assetmanagerd.service metald.service
+After=network-online.target spire-agent.service assetmanagerd.service metald.service
+Wants=network-online.target metald.service
+Requires=spire-agent.service assetmanagerd.service

🏁 Script executed:

#!/bin/bash
set -e

echo "Locating billaged.service file:"
rg -l 'billaged.service'

echo
echo "Printing first 20 lines of billaged.service:"
sed -n '1,20p' go/deploy/billaged/contrib/systemd/billaged.service

echo
echo "Searching for network-online.target usage in all .service files:"
rg -n 'network-online.target' -g '*.service'

Length of output: 2049


🏁 Script executed:

#!/bin/bash
set -e

echo "Locating assetmanagerd.service"
find go -type f -name 'assetmanagerd.service'

echo
echo "Printing first 20 lines of metald.service"
sed -n '1,20p' go/deploy/docker/systemd/metald.service

echo
echo "Printing first 20 lines of spire-agent.service"
sed -n '1,20p' go/deploy/spire/contrib/systemd/spire-agent.service

Length of output: 1733


go/deploy/billaged/contrib/systemd/billaged.service – use Wants=metald.service (+After) and network-online.target

-After=network.target spire-agent.service assetmanagerd.service metald.service
-Wants=network.target
-Requires=spire-agent.service assetmanagerd.service metald.service
+After=network-online.target spire-agent.service assetmanagerd.service metald.service
+Wants=network-online.target metald.service
+Requires=spire-agent.service assetmanagerd.service
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
After=network.target spire-agent.service assetmanagerd.service metald.service
Wants=network.target
Requires=spire-agent.service assetmanagerd.service metald.service
After=network-online.target spire-agent.service assetmanagerd.service metald.service
Wants=network-online.target metald.service
Requires=spire-agent.service assetmanagerd.service
🤖 Prompt for AI Agents
In go/deploy/billaged/contrib/systemd/billaged.service around lines 4 to 6,
change the unit dependencies so metald is a soft dependency and the service
waits for the network to be online: replace After=network.target with
After=network-online.target spire-agent.service assetmanagerd.service
metald.service; add Wants=network-online.target metald.service (so metald is
pulled in but not a hard require); and remove metald.service from the Requires=
line leaving Requires=spire-agent.service assetmanagerd.service.

Environment=UNKEY_BUILDERD_SCRATCH_DIR=/opt/builderd/scratch
Environment=UNKEY_BUILDERD_ROOTFS_OUTPUT_DIR=/opt/builderd/rootfs
Environment=UNKEY_BUILDERD_WORKSPACE_DIR=/opt/builderd/workspace
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Feature flag default set to false — confirm rollout plan and configurability.

If staged rollout is intended, consider sourcing an EnvironmentFile to flip this per-host without editing the unit.

 Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false
+EnvironmentFile=-/etc/unkey/builderd.env
+# AIDEV-NOTE: Toggle UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=true via /etc/unkey/builderd.env during staged rollout
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false
# ─── go/deploy/builderd/contrib/systemd/builderd.service ───
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false
EnvironmentFile=-/etc/unkey/builderd.env
# AIDEV-NOTE: Toggle UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=true via /etc/unkey/builderd.env during staged rollout
🤖 Prompt for AI Agents
In go/deploy/builderd/contrib/systemd/builderd.service around line 38, the
feature flag is hard-coded to false which prevents per-host toggling and staged
rollout; update the unit to source an EnvironmentFile (e.g. add
EnvironmentFile=/etc/default/builderd and move
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR to that file) so operators can
flip the flag per host without editing the unit, keep a safe default in the
EnvironmentFile or override in /etc/systemd/system/builderd.service.d/ for
controlled rollout, and add a short note in deploy docs describing the rollout
plan and where to configure the flag.

Comment on lines +17 to +31
func (s *PullImageStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
imageName := input.Config.GetSource().GetDockerImage().GetImageUri()

input.Logger.InfoContext(ctx, "pulling Docker image", slog.String("image", imageName))

err := s.executor.pullDockerImage(ctx, input.Logger, imageName)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}

return StepOutput{
ImageName: imageName,
Success: true,
}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate image source before use.

Guard against nil Config/Source/DockerImage and empty URI to avoid panics.

Apply:

 func (s *PullImageStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
-	imageName := input.Config.GetSource().GetDockerImage().GetImageUri()
+	if input.Config == nil || input.Config.GetSource() == nil || input.Config.GetSource().GetDockerImage() == nil {
+		err := fmt.Errorf("docker image source is required")
+		return StepOutput{Success: false, Error: err}, err
+	}
+	imageName := input.Config.GetSource().GetDockerImage().GetImageUri()
+	if imageName == "" {
+		err := fmt.Errorf("docker image URI is empty")
+		return StepOutput{Success: false, Error: err}, err
+	}

Comment on lines +42 to +55
func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName))

containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}

return StepOutput{
ImageName: input.ImageName,
ContainerID: containerID,
Success: true,
}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Require image name before creating container.

If a previous step failed to set ImageName, this will fail later. Validate early.

Apply:

 func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
+	if input.ImageName == "" {
+		err := fmt.Errorf("image name is required to create container")
+		return StepOutput{Success: false, Error: err}, err
+	}
 	input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName))
containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}
return StepOutput{
ImageName: input.ImageName,
ContainerID: containerID,
Success: true,
}, nil
}
func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
if input.ImageName == "" {
err := fmt.Errorf("image name is required to create container")
return StepOutput{Success: false, Error: err}, err
}
input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName))
containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}
return StepOutput{
ImageName: input.ImageName,
ContainerID: containerID,
Success: true,
}, nil
}
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_steps.go around lines 42 to 55,
the Execute function does not validate that input.ImageName is set before
attempting to create a container; add an early check for a non-empty ImageName,
and if empty log an error with input.Logger (using ErrorContext) and return a
StepOutput with Success:false and a descriptive error (e.g., fmt.Errorf("missing
image name")) without calling s.executor.createContainer to avoid downstream
failures. Ensure the function returns the error and does not proceed when
ImageName is empty.

Comment on lines +91 to +99
func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
input.Logger.InfoContext(ctx, "extracting container filesystem",
slog.String("container_id", input.ContainerID),
slog.String("rootfs_dir", input.RootfsDir))

err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate container ID and metadata before filesystem extraction.

Prevent ambiguous errors from the executor.

Apply:

 func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
+	if input.ContainerID == "" {
+		err := fmt.Errorf("container ID is required to extract filesystem")
+		return StepOutput{Success: false, Error: err}, err
+	}
+	if input.Metadata == nil {
+		err := fmt.Errorf("image metadata is required to extract filesystem")
+		return StepOutput{Success: false, Error: err}, err
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
input.Logger.InfoContext(ctx, "extracting container filesystem",
slog.String("container_id", input.ContainerID),
slog.String("rootfs_dir", input.RootfsDir))
err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}
func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
if input.ContainerID == "" {
err := fmt.Errorf("container ID is required to extract filesystem")
return StepOutput{Success: false, Error: err}, err
}
if input.Metadata == nil {
err := fmt.Errorf("image metadata is required to extract filesystem")
return StepOutput{Success: false, Error: err}, err
}
input.Logger.InfoContext(ctx, "extracting container filesystem",
slog.String("container_id", input.ContainerID),
slog.String("rootfs_dir", input.RootfsDir))
err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata)
if err != nil {
return StepOutput{Success: false, Error: err}, err
}
// ... remainder of method ...
}
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_steps.go around lines 91 to 99,
the Execute method calls s.executor.extractFilesystem without validating
input.ContainerID and input.Metadata, which can cause ambiguous errors; add
precondition checks that return a clear StepOutput and error if ContainerID is
empty (or only whitespace) and if Metadata is nil or missing required fields,
logging a descriptive message via input.Logger before returning so the executor
is never invoked with invalid inputs.

Comment on lines +231 to +245
const returnIPJSON = `-- name: ReturnIPJSON :exec
UPDATE network_allocations
SET available_ips = json_insert(available_ips, '$[#]', ?)
WHERE deployment_id = ?
`

type ReturnIPJSONParams struct {
JsonInsert interface{} `db:"json_insert" json:"json_insert"`
DeploymentID string `db:"deployment_id" json:"deployment_id"`
}

func (q *Queries) ReturnIPJSON(ctx context.Context, arg ReturnIPJSONParams) error {
_, err := q.db.ExecContext(ctx, returnIPJSON, arg.JsonInsert, arg.DeploymentID)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Tighten JSON insert typing

Use a concrete string and CAST to TEXT to avoid type ambiguity across drivers.

Apply this diff (reflect in queries.sql and regenerate):

-SET available_ips = json_insert(available_ips, '$[#]', ?)
+SET available_ips = json_insert(available_ips, '$[#]', CAST(? AS TEXT))
@@
-type ReturnIPJSONParams struct {
-	JsonInsert   interface{} `db:"json_insert" json:"json_insert"`
+type ReturnIPJSONParams struct {
+	JsonInsert   string `db:"json_insert" json:"json_insert"`
 	DeploymentID string      `db:"deployment_id" json:"deployment_id"`
 }
🤖 Prompt for AI Agents
In go/deploy/metald/internal/database/queries.sql.go around lines 231 to 245,
the JSON insert parameter is currently typed as interface{} and passed directly
to json_insert causing driver type ambiguity; change the SQL to CAST the
placeholder to TEXT (e.g. json_insert(available_ips, '$[#]', CAST(? AS TEXT)))
and update the ReturnIPJSONParams.JsonInsert field to string (and ensure the
queries.sql entry is updated accordingly), then regenerate the queries to
reflect the concrete string typing.

Comment on lines +12 to +19
if link, err := netlink.LinkByName(m.config.BridgeName); err == nil {
m.logger.Debug("bridge exists",
slog.String("bridge", m.config.BridgeName),
slog.String("type", link.Type()),
slog.String("state", link.Attrs().OperState.String()),
)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Do not early-return on existing bridge; ensure address and admin-up state.

Current early return skips reconciling IP and link state. Make ensureBridge idempotent: when the bridge exists, still ensure it’s up and has the configured address.

Example approach (conceptual; see addr/Replace comment below):

  • Keep a reference to br whether created or found.
  • Always set link up.
  • Replace or ensure the address on br.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 12-19, the function
currently early-returns if the bridge exists which skips reconciling its IP
address and admin-up state; change it to keep a reference to the found link (br)
instead of returning, then always ensure the link is set UP (netlink.LinkSetUp)
and ensure/replace the configured IP address on that link (use addr.Replace or
the appropriate netlink addr add/replace flow) so the method is idempotent; log
actions and propagate errors from the LinkSetUp and address reconciliation steps
rather than returning immediately when the bridge exists.

Comment on lines +27 to +33
if err := netlink.LinkAdd(bridge); err != nil {
m.logger.Error("failed to create bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to create bridge: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle EEXIST race on LinkAdd.

If another process creates the bridge between the check and add, LinkAdd may return EEXIST. Treat as success.

-	if err := netlink.LinkAdd(bridge); err != nil {
+	if err := netlink.LinkAdd(bridge); err != nil {
+		// treat already-exists as success for idempotency
+		if !os.IsExist(err) {
 			m.logger.Error("failed to create bridge",
 				slog.String("bridge", m.config.BridgeName),
 				slog.String("error", err.Error()),
 			)
 			return fmt.Errorf("failed to create bridge: %w", err)
-	}
+		}
+	}

Add import:

 import (
 	"fmt"
 	"log/slog"
+	"os"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := netlink.LinkAdd(bridge); err != nil {
m.logger.Error("failed to create bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to create bridge: %w", err)
}
// In the import block, add "os":
import (
"fmt"
"log/slog"
"os"
)
// ... later, around the call to netlink.LinkAdd:
if err := netlink.LinkAdd(bridge); err != nil {
// treat already-exists as success for idempotency
if !os.IsExist(err) {
m.logger.Error("failed to create bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to create bridge: %w", err)
}
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 27 to 33, the call
to netlink.LinkAdd returns an error if the bridge was created by another process
(EEXIST) and should be treated as success; update the error handling to check
errors.Is(err, syscall.EEXIST) (or os.IsExist-equivalent) and ignore that case
(do not return an error), while keeping logging for other errors, and add the
syscall import to the file.

Comment on lines +40 to +46
addr, err := netlink.ParseAddr(m.config.BaseNetwork.String())
if err != nil {
return fmt.Errorf("failed to parse bridge IP: %w", err)
}

if err := netlink.AddrAdd(br, addr); err != nil {
m.logger.Error("failed to add IP to bridge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Wrong address assigned: using the network address as the bridge IP.

Parsing BaseNetwork (e.g., 172.16.0.0/12) yields the network address, not a host IP. Assign a host address (e.g., first usable: 172.16.0.1/12) or take a dedicated BridgeCIDR from config/env.

Minimal fix options:

  • Prefer a new config field (BridgeCIDR string) and parse that here.
  • Or derive first host from BaseNetwork before calling ParseAddr.

I can provide a patch once the desired source of truth (env vs code) is confirmed.

🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 40 to 46, the code
parses m.config.BaseNetwork which yields the network address (e.g.,
172.16.0.0/12) and assigns that as the bridge IP; instead, either parse a
dedicated BridgeCIDR config field or derive and use the first usable host
address from BaseNetwork (e.g., .1) before calling netlink.AddrAdd. Update the
config to include BridgeCIDR (preferred) and parse that here, or implement logic
to compute the first host IP from the parsed network prefix and construct an
address with the same mask, then pass that address to netlink.AddrAdd; ensure
validation and clear error messages on parse/failure.

Comment on lines +45 to +51
if err := netlink.AddrAdd(br, addr); err != nil {
m.logger.Error("failed to add IP to bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to add IP to bridge: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Prefer AddrReplace and ignore EEXIST for idempotency.

AddrAdd fails if address already present. Use AddrReplace so re-runs converge.

-	if err := netlink.AddrAdd(br, addr); err != nil {
+	if err := netlink.AddrReplace(br, addr); err != nil {
 		m.logger.Error("failed to add IP to bridge",
 			slog.String("bridge", m.config.BridgeName),
 			slog.String("error", err.Error()),
 		)
 		return fmt.Errorf("failed to add IP to bridge: %w", err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := netlink.AddrAdd(br, addr); err != nil {
m.logger.Error("failed to add IP to bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to add IP to bridge: %w", err)
}
if err := netlink.AddrReplace(br, addr); err != nil {
m.logger.Error("failed to add IP to bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to add IP to bridge: %w", err)
}
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 45-51, replace
netlink.AddrAdd with netlink.AddrReplace to make adding the IP idempotent; call
netlink.AddrReplace(br, addr) and update the error handling to treat EEXIST as
non-fatal (i.e., if errors.Is(err, syscall.EEXIST) return nil), otherwise log
and return the wrapped error as before; ensure syscall is imported and the log
still includes bridge name and error details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants