feat(deploy): redesign metald network architecture#3818
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
📝 WalkthroughWalkthroughRemoves multiple repository docs; adjusts Makefiles and systemd units; adds a Makefile target for proto generation; modifies DeployWorkflow to insert a build record; introduces builderd asset init with retry, a step-based Docker pipeline executor, shutdown tests, and metrics changes; overhauls metald: client/config/service, database (sqlc), networking model, Firecracker backend, Docker backend, health tests; adds a cleanup script. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator
participant Builderd as builderd main
participant Assets as BaseAssetManager
participant Metrics as BuildMetrics
Operator->>Builderd: start
Builderd->>Assets: NewBaseAssetManager().WithMetrics(Metrics)
loop retry until success/cancel
Builderd->>Assets: InitializeBaseAssetsWithRetry(ctx)
alt retryable error
Assets-->>Metrics: RecordBaseAssetInitRetry(attempt, reason)
end
end
alt failed after retries
Assets-->>Metrics: RecordBaseAssetInitFailure(total_attempts, error)
end
Builderd-->>Operator: ready
sequenceDiagram
autonumber
actor API
participant ExecReg as ExecutorRegistry
participant DPExec as DockerPipelineExecutor
participant Pipe as BuildPipeline
participant Steps as StepExecutors
API->>ExecReg: select executor (docker)
alt UsePipelineExecutor=true
ExecReg-->>API: DockerPipelineExecutor
API->>DPExec: Execute(request)
DPExec->>Pipe: Execute(StepInput)
Pipe->>Steps: PullImage → CreateContainer → ExtractMetadata → ExtractFilesystem → OptimizeRootfs → Cleanup
Steps-->>Pipe: StepOutput propagation
Pipe-->>DPExec: BuildResult(rootfs, metadata)
DPExec-->>API: BuildResult
else
ExecReg-->>API: DockerExecutor (monolithic)
end
sequenceDiagram
autonumber
actor Client
participant Service as metald VMService
participant DB as database.Querier
participant Backend as Backend(Firecracker/Docker)
Client->>Service: CreateVm(request)
Service->>DB: AllocateNetwork()
DB-->>Service: Network allocation
Service->>Backend: CreateVM(config)
Backend-->>Service: vm_id
Service-->>Client: CreateVmResponse(state=CREATED, host_port_pair)
sequenceDiagram
autonumber
participant Ctrl as ctrl DeployWorkflow
participant DB as db.Queries
Ctrl->>Ctrl: log-deployment-pending
Ctrl->>DB: InsertBuild(build_id, workspace_id, project_id, deployment_id, created_at)
alt error
DB-->>Ctrl: error
Ctrl-->>Ctrl: abort
else
DB-->>Ctrl: ok
Ctrl->>Ctrl: proceed to build version
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~180 minutes Possibly related PRs
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
fix(metald): stuff fix(metald): clean up network package feat: overhaul metald
df3a129 to
4bf86d8
Compare
|
broke.. like my spirit |
There was a problem hiding this comment.
lgtm except this file, can you revert this file, or use the one that is in main, cause we need the partition db and we don't need the insert-build step.
I'd be surprised if this actually builds successfully right now
There was a problem hiding this comment.
happy to merge this as is after that
There was a problem hiding this comment.
ah that's weird.. my LSP didn't see that as an issue. on it.
There was a problem hiding this comment.
🤔 Hm, I thought the side I chose was main but I think I chose the wrong side oops. Well that makes me worried for the rest of the rebase action I had yesterday but.. the changes shouldn't impact anything in production so I think we just fix as necessary and do better in the future lol.
There was a problem hiding this comment.
can you look at this and resolve the convo if it's correct @chronark
|
Thank you for following the naming conventions for pull request titles! 🙏 |
|
lmao i forgot to commit the changes i made while we were on the hangout |
|
works for me |
|
you sure about those uints? you had pretty strong opinions about keeping it as int earlier this week 😛 |
|
ok cool,. if CI is green, I’ll merge |
|
i'm one of those "i'll address it in the next iteration" kind of people... the rabbit just tries to hold me back!!! |
|
I should’ve lead with a sarcasm warning |
|
biggest yeet this side of 2025 |
| google.golang.org/genproto/googleapis/rpc v0.0.0-20250707201910-8d1bb00bc6a7 // indirect | ||
| google.golang.org/grpc v1.73.0 // indirect | ||
| google.golang.org/protobuf v1.36.6 // indirect | ||
| google.golang.org/protobuf v1.36.8 // indirect |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Proto bump looks fine; verify repo-wide alignment and regen if needed.
Confirm all modules now resolve google.golang.org/protobuf v1.36.8 and regenerate any generated protobuf code if you vendor stubs or pin protoc-gen-go. Run go mod tidy afterwards.
🏁 Script executed:
#!/bin/bash
# Check protobuf version across all modules
fd go.mod | xargs -I{} bash -lc 'echo "== {} =="; rg -n "google.golang.org/protobuf" {} || true'
# Inspect effective deps in this module
( cd go/deploy/assetmanagerd && go list -m all | rg "^google.golang.org/(grpc|protobuf|genproto)" )
# Optional: locate protoc-gen-go usage to decide if regen is needed
rg -nP "(protoc|buf).*protoc-gen-go" -g '!**/node_modules/**' -C2Length of output: 3027
Align google.golang.org/protobuf to v1.36.8 repository-wide
Bump protobuf to v1.36.8 in the following go.mod files (currently on older versions):
- apps/agent/go.mod (v1.36.5)
- apps/chproxy/go.mod (v1.36.6)
- go/deploy/billaged/go.mod (v1.36.6)
- go/deploy/metald/go.mod (v1.36.6)
- go/deploy/pkg/observability/interceptors/go.mod (v1.36.6)
- go/deploy/pkg/spiffe/go.mod (v1.36.5)
- go/deploy/pkg/telemetry/go.mod (v1.36.6)
- go/deploy/pkg/tls/go.mod (v1.36.5)
After aligning versions, regenerate any vendored protobuf stubs or run your protoc-gen-go pipeline (none found via rg search) and run go mod tidy in each module.
|
|
||
| # Targets (alphabetically ordered) | ||
| .PHONY: all build check clean create-user deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet | ||
| .PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet |
There was a problem hiding this comment.
PHONY lists non-existent targets (generate, env-example)
They were removed but still declared; this breaks expectations and confuses help.
-.PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet
+.PHONY: all build check clean deps dev fmt help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet | |
| .PHONY: all build check clean deps dev fmt help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet |
🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/Makefile around line 39, the .PHONY declaration still
lists targets that no longer exist (generate, env-example); remove those entries
from the .PHONY line or restore the corresponding targets so the declaration
matches actual Makefile targets, and then run make help to verify output and
update any documentation referencing them.
| .PHONY: all build check clean deps dev env-example fmt generate help install lint lint-proto run service-logs service-logs-tail service-restart service-start service-status service-stop setup test test-coverage uninstall version vet | ||
|
|
||
| all: clean generate build ## Clean, generate, and build | ||
| all: clean build ## Clean, generate, and build |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Update help text for all target
It no longer generates; adjust description.
-all: clean build ## Clean, generate, and build
+all: clean build ## Clean and build🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/Makefile around line 41, the `all` target help text
still says "Clean, generate, and build" but the target no longer performs
generation; update the description to accurately reflect behavior (e.g., "Clean
and build" or similar). Edit the comment after the target to replace "Clean,
generate, and build" with the corrected short description and ensure formatting
matches the existing help comment style.
| dev: generate ## Run in development mode | ||
| @go run ./cmd/$(SERVICE_NAME) |
There was a problem hiding this comment.
Broken deps: dev/setup depend on missing “generate”
These targets will fail with “No rule to make target ‘generate’”.
-dev: generate ## Run in development mode
+dev: ## Run in development mode
@go run ./cmd/$(SERVICE_NAME)
@@
-setup: deps generate ## Complete development setup
+setup: deps ## Complete development setupAlso applies to: 111-112
🤖 Prompt for AI Agents
go/deploy/assetmanagerd/Makefile around lines 57-58 (and also 111-112): the dev
and dev/setup targets declare a dependency on a non-existent "generate" target
which causes "No rule to make target 'generate'"; add a concrete "generate"
target before those lines that runs the project’s code-generation steps (for
example run go generate ./... and any protoc/Swagger/gen commands your repo
requires) or, if generation is not needed for development, remove the "generate"
dependency from the dev and dev/setup targets; ensure the new target is marked
.PHONY and invokes the exact generation commands used elsewhere in the repo so
the make targets succeed.
| assetv1 "github.com/unkeyed/unkey/go/gen/proto/deploy/assetmanagerd/v1" | ||
| "github.com/unkeyed/unkey/go/gen/proto/deploy/assetmanagerd/v1/assetmanagerdv1connect" |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Import reorder is a no-op — OK
No behavioral change. Proceed.
Given significant builderd changes in this PR, remember the learning to bump the patch-level version for the service. If applicable here, please confirm the version bump.
🏁 Script executed:
#!/bin/bash
# Search for builderd version declarations/bakes to confirm a patch bump happened.
rg -nP -C2 '(?i)\b(version|appVersion|buildVersion)\b' go/deploy/builderd | sed -n '1,200p'Length of output: 9672
Bump all version references to 0.5.1
Makefile is at 0.5.1, but README.md still shows 0.1.0 (go/deploy/builderd/README.md:278) and the CLI help default OTEL version shows “default: 0.1.0” (go/deploy/builderd/cmd/builderd/main.go:478). Update both to 0.5.1.
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/assetmanager/client.go around lines 14-15 and
also update go/deploy/builderd/README.md (around line ~278) and
go/deploy/builderd/cmd/builderd/main.go (around line ~478), bump all version
references from 0.1.0 to 0.5.1; specifically replace the README’s displayed
version and the CLI help default OTEL version string with "0.5.1" and ensure any
other occurrences in these files or nearby lines are updated to match 0.5.1 so
Makefile, docs, and CLI defaults are consistent.
| // TODO: In production, implement proper tenant-user relationship checks | ||
| // This should query a tenant membership service or database | ||
| // This should query a tenant membership service or database using ctx for timeouts/tracing | ||
| _ = ctx // Will be used for database queries in production implementation | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Upgrade TODO to AIDEV-TODO per repo guidelines and emphasize context usage.
- // TODO: In production, implement proper tenant-user relationship checks
- // This should query a tenant membership service or database using ctx for timeouts/tracing
+ // AIDEV-TODO: In production, implement proper tenant-user relationship checks.
+ // Use ctx for deadlines/tracing; query an authz service or DB for membership/roles.
_ = ctx // Will be used for database queries in production implementation📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TODO: In production, implement proper tenant-user relationship checks | |
| // This should query a tenant membership service or database | |
| // This should query a tenant membership service or database using ctx for timeouts/tracing | |
| _ = ctx // Will be used for database queries in production implementation | |
| // AIDEV-TODO: In production, implement proper tenant-user relationship checks. | |
| // Use ctx for deadlines/tracing; query an authz service or DB for membership/roles. | |
| _ = ctx // Will be used for database queries in production implementation |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/auth.go around lines 198 to 201, replace
the generic TODO comment and the placeholder "_ = ctx" with an AIDEV-TODO per
repo guidelines and update the note to explicitly require using the passed ctx
for timeouts, cancellation and tracing when querying the tenant membership
service or DB; in practice change the comment to start with "AIDEV-TODO:" and
state that the implementation must query a tenant membership store using ctx
(context-aware DB calls, propagate tracing/span IDs, respect
deadlines/cancellations) so future implementers know to use ctx rather than
ignore it.
| if requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" { | ||
| return fmt.Errorf("access denied: user %s cannot access restricted tenant", customerCtx.UserID) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Gate the demo “restricted-tenant” rule to dev mode.
Prevents accidental prod behavior differences.
- if requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" {
+ if os.Getenv("UNKEY_METALD_AUTH_MODE") == "dev" &&
+ requestedTenantID == "restricted-tenant" && customerCtx.UserID != "admin-user" {
return fmt.Errorf("access denied: user %s cannot access restricted tenant", customerCtx.UserID)
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/auth.go around lines 209-211, the hardcoded
rule allowing only admin-user to access "restricted-tenant" must be limited to
development mode; modify the conditional to check a dev-mode flag (preferably an
injected config like cfg.IsDev or, if not available, os.Getenv("APP_ENV") ==
"development") in addition to requestedTenantID == "restricted-tenant" and
customerCtx.UserID != "admin-user", so the access-deny logic only runs when
running in dev mode.
| GOOS ?= $(shell go env GOOS) | ||
| GOARCH ?= $(shell go env GOARCH) | ||
| LDFLAGS := -ldflags "-s -w -X main.version=$(VERSION)" | ||
| LDFLAGS := -ldflags "-s=false -w=false -X main.version=$(VERSION)" |
There was a problem hiding this comment.
Debug flags inappropriately set in production build.
The LDFLAGS have been modified to disable stripping (-s=false -w=false), which will significantly increase binary size and potentially expose sensitive debug information in production builds. These flags should only be used in debug builds.
Apply this diff to fix the production build flags:
-LDFLAGS := -ldflags "-s=false -w=false -X main.version=$(VERSION)"
+LDFLAGS := -ldflags "-s -w -X main.version=$(VERSION)"For debug builds, you should use separate debug flags:
+DEBUG_LDFLAGS := -ldflags "-X main.version=$(VERSION)"🤖 Prompt for AI Agents
In go/deploy/metald/Makefile around line 11, LDFLAGS currently disable symbol
and DWARF stripping (-s=false -w=false); change production LDFLAGS to enable
stripping (-s -w) while preserving the version injection, and implement a
separate debug path (e.g., a DEBUG flag or distinct DEBUG_LDFLAGS) that sets
unstripped flags for local/dev builds; ensure the default Makefile target uses
the stripped production LDFLAGS and only uses the debug flags when an explicit
DEBUG=true (or similar) is provided.
| Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation. | ||
|
|
||
| The `make install` command configures these automatically. | ||
| The `make install` command configures the service with appropriate permissions automatically. | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Fix double-spacing and add an AIDEV note on privileged operation.
Two grammar nits and add an anchor reminding maintainers to revisit hardening.
-Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
+Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation.
+
+<!-- AIDEV-NOTE: Running as root -->
+When feasible, apply systemd hardening (ProtectSystem, ProtectControlGroups, PrivateUsers, CapabilityBoundingSet) and network namespace isolation to reduce blast radius.
-The `make install` command configures the service with appropriate permissions automatically.
+The `make install` command configures the service with appropriate permissions automatically.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation. | |
| The `make install` command configures these automatically. | |
| The `make install` command configures the service with appropriate permissions automatically. | |
| Metald runs as root to manage network namespaces, interfaces, and iptables operations. This is acceptable as metald is designed to be the sole application on dedicated VM hosts. The integrated jailer still drops privileges to specified UID/GID for individual VM processes, ensuring proper isolation. | |
| <!-- AIDEV-NOTE: Running as root --> | |
| When feasible, apply systemd hardening (ProtectSystem, ProtectControlGroups, PrivateUsers, CapabilityBoundingSet) and network namespace isolation to reduce blast radius. | |
| The `make install` command configures the service with appropriate permissions automatically. |
🧰 Tools
🪛 LanguageTool
[grammar] ~80-~80: Use correct spacing
Context: ...VM processes, ensuring proper isolation. The make install command configures th...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~82-~82: Use correct spacing
Context: ...h appropriate permissions automatically. ## Contributing See [Development Setup](./...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
🤖 Prompt for AI Agents
In go/deploy/metald/README.md around lines 80 to 83, fix the double-spacing in
the paragraph about metald running as root (remove any extra spaces between
sentences) and append a short AIDEV note/anchor reminding maintainers to revisit
hardening of privileged operations (e.g., add a single-line anchor or NOTE
(AIDEV): comment stating "revisit privileged operation hardening and service
isolation" and linkable anchor name). Ensure the sentence flow remains
grammatical and the new AIDEV note is concise and clearly identifiable for
future reference.
| generate-builder: | ||
| buf generate --path proto/deploy/builderd | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Add generate-builder target — add PHONY and tools dependency
Looks good. Make it phony and depend on tools so buf is present.
Apply:
- .PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate pull up clean
+ .PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate generate-builder pull up clean
@@
-generate-builder:
- buf generate --path proto/deploy/builderd
+generate-builder: tools
+ buf generate --path proto/deploy/builderdNote: Per prior learning, running buf from go/ with out: gen is correct.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| generate-builder: | |
| buf generate --path proto/deploy/builderd | |
| .PHONY: install tools fmt test-unit test-integration test-integration-long test-stress test build generate generate-builder pull up clean | |
| generate-builder: tools | |
| buf generate --path proto/deploy/builderd |
🤖 Prompt for AI Agents
In go/Makefile around lines 55 to 57, the generate-builder target should be
declared phony and made to depend on the tools target so buf is available;
update the Makefile to add generate-builder to the .PHONY list and change the
target to depend on tools (e.g., "generate-builder: tools") while keeping the
existing command "buf generate --path proto/deploy/builderd".
| "io" | ||
| "net/http" |
There was a problem hiding this comment.
Remove unused imports (io, net/http) — currently cause build failure.
These are only referenced in a large block comment. Go rejects unused imports.
Apply:
- "io"
- "net/http"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "io" | |
| "net/http" |
🤖 Prompt for AI Agents
In go/apps/ctrl/services/deployment/deploy_workflow.go around lines 7-8, the
imports "io" and "net/http" are unused (only mentioned in a block comment) and
cause a build failure; remove "io" and "net/http" from the import block (or only
re-add them when actual code uses them) so the file compiles cleanly.
| // Step 3: Insert build into database | ||
| err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error { | ||
| w.logger.Info("inserting build into database", "build_id", buildID) | ||
| insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{ | ||
| ID: buildID, | ||
| WorkspaceID: req.WorkspaceID, | ||
| ProjectID: req.ProjectID, | ||
| DeploymentID: req.DeploymentID, | ||
| CreatedAt: time.Now().UnixMilli(), | ||
| }) | ||
| if insertErr != nil { | ||
| return fmt.Errorf("failed to create build record: %w", insertErr) | ||
| } | ||
| w.logger.Info("build record created successfully", "build_id", buildID) | ||
| return nil | ||
| }) | ||
| if err != nil { | ||
| w.logger.Error("failed to insert build", "error", err, "build_id", buildID) | ||
| return err | ||
| } |
There was a problem hiding this comment.
buildID is undefined — initialize before InsertBuild.
Compilation fails without a value.
Apply:
}
- // Step 3: Insert build into database
+ // Generate build ID
+ buildID := uid.New("build")
+
+ // Step 3: Insert build into database
err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Step 3: Insert build into database | |
| err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error { | |
| w.logger.Info("inserting build into database", "build_id", buildID) | |
| insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{ | |
| ID: buildID, | |
| WorkspaceID: req.WorkspaceID, | |
| ProjectID: req.ProjectID, | |
| DeploymentID: req.DeploymentID, | |
| CreatedAt: time.Now().UnixMilli(), | |
| }) | |
| if insertErr != nil { | |
| return fmt.Errorf("failed to create build record: %w", insertErr) | |
| } | |
| w.logger.Info("build record created successfully", "build_id", buildID) | |
| return nil | |
| }) | |
| if err != nil { | |
| w.logger.Error("failed to insert build", "error", err, "build_id", buildID) | |
| return err | |
| } | |
| } | |
| // Generate build ID | |
| buildID := uid.New("build") | |
| // Step 3: Insert build into database | |
| err = hydra.StepVoid(ctx, "insert-build", func(stepCtx context.Context) error { | |
| w.logger.Info("inserting build into database", "build_id", buildID) | |
| insertErr := db.Query.InsertBuild(stepCtx, w.db.RW(), db.InsertBuildParams{ | |
| ID: buildID, | |
| WorkspaceID: req.WorkspaceID, | |
| ProjectID: req.ProjectID, | |
| DeploymentID: req.DeploymentID, | |
| CreatedAt: time.Now().UnixMilli(), | |
| }) | |
| if insertErr != nil { | |
| return fmt.Errorf("failed to create build record: %w", insertErr) | |
| } | |
| w.logger.Info("build record created successfully", "build_id", buildID) | |
| return nil | |
| }) | |
| if err != nil { | |
| w.logger.Error("failed to insert build", "error", err, "build_id", buildID) | |
| return err | |
| } |
🤖 Prompt for AI Agents
In go/apps/ctrl/services/deployment/deploy_workflow.go around lines 90 to 109,
buildID is used but not defined; initialize a buildID before the InsertBuild
step (for example: buildID := uuid.NewString() or via your project's ID
generator) and use that variable in the db.InsertBuild call and subsequent logs,
and add the necessary import for the ID generator you choose.
| After=network.target spire-agent.service | ||
| Wants=network.target | ||
| Requires=spire-agent.service | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Good: spire-agent ordering and dependency; suggest tighter coupling and graceful stop
- Consider BindsTo for tighter lifetime coupling with spire-agent.
- Add TimeoutStopSec for graceful shutdowns.
@@
-After=network.target spire-agent.service
+After=network.target spire-agent.service
Wants=network.target
-Requires=spire-agent.service
+Requires=spire-agent.service
+BindsTo=spire-agent.serviceAdditionally, consider:
# outside this hunk
TimeoutStopSec=30sAlso ensure the referenced local storage path exists at boot:
# outside this hunk
ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs🏁 Script executed:
#!/bin/bash
# Verify builderd path reference and pre-create directive presence
rg -n 'UNKEY_ASSETMANAGERD_LOCAL_STORAGE_PATH|/opt/builderd/rootfs' go/deploy/assetmanagerd/contrib/systemd -C2Length of output: 730
Add tighter service coupling and graceful shutdown steps
- Under the existing
Requires=spire-agent.service, addAfter=network.target spire-agent.service Wants=network.target Requires=spire-agent.service +BindsTo=spire-agent.service - Outside that hunk, add for graceful stop:
TimeoutStopSec=30s - And pre-create the local storage directory at boot:
ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs
🤖 Prompt for AI Agents
In go/deploy/assetmanagerd/contrib/systemd/assetmanagerd.service around lines 4
to 7, add tighter coupling and graceful shutdown/start steps: add a
Wants=spire-agent.service line beneath the existing
Requires=spire-agent.service, and add TimeoutStopSec=30s (to the Service
section) and ExecStartPre=+/usr/bin/mkdir -p /opt/builderd/rootfs (to the
Service section before ExecStart) so the unit pre-creates the local storage
directory and allows a 30s graceful stop.
| atomic.StoreInt64(&shutdownStarted, 0) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Don’t reset the shared shutdown flag inside each goroutine.
This defeats the purpose of racing multiple shutdown attempts against the same guard.
- // Reset shutdown state before each attempt
- atomic.StoreInt64(&shutdownStarted, 0)
+ // Intentionally share shutdownStarted across goroutines to exercise the guard.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| atomic.StoreInt64(&shutdownStarted, 0) | |
| // Intentionally share shutdownStarted across goroutines to exercise the guard. |
🤖 Prompt for AI Agents
In go/deploy/builderd/cmd/builderd/shutdown_test.go around lines 59-60, do not
reset the shared shutdown flag inside each goroutine (remove the
atomic.StoreInt64(&shutdownStarted, 0) call from within the goroutine); instead
initialize or reset the shared atomic shutdownStarted once before spawning the
concurrent goroutines so all goroutines race against the same guard, ensuring
the test exercises multiple concurrent shutdown attempts correctly.
| // Start servers in background | ||
| go func() { | ||
| if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { | ||
| t.Logf("HTTP server error: %v", err) | ||
| } | ||
| }() | ||
|
|
||
| go func() { | ||
| if err := promServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { | ||
| t.Logf("Prometheus server error: %v", err) | ||
| } | ||
| }() | ||
|
|
||
| // Give servers time to start | ||
| time.Sleep(100 * time.Millisecond) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Optional: use Server.Serve with a listener in TestShutdownSequence for symmetry.
This mirrors the timeout test, avoids spurious log noise on shutdown, and makes startup deterministic.
If flaky behavior emerges, switch to Serve on an explicit net.Listener and wait until it’s bound before proceeding.
🤖 Prompt for AI Agents
In go/deploy/builderd/cmd/builderd/shutdown_test.go around lines 131 to 146,
replace the goroutines calling server.ListenAndServe and
promServer.ListenAndServe with starting each server via server.Serve(listener)
and promServer.Serve(promListener) where the listeners are created with
net.Listen; create the listeners before the goroutines, wait until each listener
is bound (e.g., immediately after net.Listen returns) to make startup
deterministic, and use the Serve form to avoid spurious "use of closed network
connection" logs during shutdown and to mirror the timeout test’s symmetry.
| // Config holds network configuration | ||
| type Config struct { | ||
| BaseNetwork *net.IPNet | ||
| BridgeName string | ||
| DNSServers []string // Default: ["8.8.8.8", "8.8.4.4"] | ||
| EnableIPv6 bool | ||
| EnableRateLimit bool | ||
| RateLimitMbps int // Per VM rate limit in Mbps | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Document field constraints in Config.
Add comments for invariants (non-empty BridgeName, non-nil BaseNetwork, RateLimitMbps > 0 when enabled) to guide callers.
type Config struct {
- BaseNetwork *net.IPNet
- BridgeName string
- DNSServers []string // Default: ["8.8.8.8", "8.8.4.4"]
- EnableIPv6 bool
- EnableRateLimit bool
- RateLimitMbps int // Per VM rate limit in Mbps
+ BaseNetwork *net.IPNet // required
+ BridgeName string // required, linux bridge device name (e.g., "br0")
+ DNSServers []string // optional; IP literals. Default: ["8.8.8.8", "8.8.4.4"]
+ EnableIPv6 bool
+ EnableRateLimit bool
+ RateLimitMbps int // required if EnableRateLimit; per-VM Mbps
}Add an AIDEV-NOTE near NewManager stating the validation policy.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Config holds network configuration | |
| type Config struct { | |
| BaseNetwork *net.IPNet | |
| BridgeName string | |
| DNSServers []string // Default: ["8.8.8.8", "8.8.4.4"] | |
| EnableIPv6 bool | |
| EnableRateLimit bool | |
| RateLimitMbps int // Per VM rate limit in Mbps | |
| } | |
| // Config holds network configuration | |
| type Config struct { | |
| BaseNetwork *net.IPNet // required | |
| BridgeName string // required, linux bridge device name (e.g., "br0") | |
| DNSServers []string // optional; IP literals. Default: ["8.8.8.8", "8.8.4.4"] | |
| EnableIPv6 bool | |
| EnableRateLimit bool | |
| RateLimitMbps int // required if EnableRateLimit; per-VM Mbps | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/manager.go around lines 30 to 38, the
Config struct lacks comments documenting field invariants; add inline comments
for BridgeName (must be non-empty), BaseNetwork (must be non-nil), and
RateLimitMbps (must be >0 when EnableRateLimit is true) and clarify default
DNSServers; also add an AIDEV-NOTE comment immediately above NewManager
describing the validation policy (which fields are validated by NewManager vs.
callers' responsibility and whether NewManager returns an error on invalid
config) so callers know expectations and invariants.
| type Manager struct { | ||
| logger *slog.Logger | ||
| config *Config | ||
| mu sync.RWMutex | ||
| bridgeMu sync.RWMutex | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Add GoDoc for exported Manager and document locking.
Public types must be documented; also record which mutex protects which fields.
- type Manager struct {
+// Manager orchestrates bridge/tap network lifecycle.
+// AIDEV-NOTE: mu guards Manager-wide state; bridgeMu guards bridge creation/configuration.
+type Manager struct {
logger *slog.Logger
config *Config
mu sync.RWMutex
bridgeMu sync.RWMutex
}🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/manager.go around lines 40 to 45, the
exported Manager type lacks GoDoc and documentation of its locking strategy; add
a GoDoc comment immediately above the type declaration that briefly describes
Manager's purpose and explicitly documents which mutex protects which fields
(e.g., mu guards config and other mutable state, bridgeMu guards bridge-related
fields), and mention that mu is an RWMutex for concurrent read access and
bridgeMu is a separate RWMutex for bridge-specific state and why both are
needed.
| package network | ||
|
|
||
| // setupVMNetworking creates and configures TAP and veth devices for a VM | ||
| func (m *Manager) setupTunTap(ip string) error { | ||
| // // tap := &netlink.Tuntap{ | ||
| // // LinkAttrs: netlink.LinkAttrs{ | ||
| // // Name: deviceNames.TAP, | ||
| // // }, | ||
| // // Mode: netlink.TUNTAP_MODE_TAP, | ||
| // // } | ||
|
|
||
| // // if err := netlink.LinkAdd(tap); err != nil { | ||
| // // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err) | ||
| // // } | ||
|
|
||
| // // tapLink, err := netlink.LinkByName(deviceNames.TAP) | ||
| // // if err != nil { | ||
| // // return fmt.Errorf("failed to get TAP device: %w", err) | ||
| // // } | ||
|
|
||
| // // Set TAP device up | ||
| // if err := netlink.LinkSetUp(tap); err != nil { | ||
| // return fmt.Errorf("failed to bring TAP device up: %w", err) | ||
| // } | ||
|
|
||
| // m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name)) | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
Stub silently returns success; add linux build tag and fail fast until implemented.
Silent no-op can mislead callers. Gate to linux, fix doc, add TODO, and return NotImplemented.
-package network
+//go:build linux
+
+package network
+
+import "fmt"
-// setupVMNetworking creates and configures TAP and veth devices for a VM
+// setupTunTap creates and configures TAP and veth devices for a VM.
+// AIDEV-TODO: Implement using netlink (requires CAP_NET_ADMIN). Ensure bridge attach, MTU, and ownership.
func (m *Manager) setupTunTap(ip string) error {
@@
- return nil
+ // AIDEV-TODO: temporary NotImplemented to avoid silent success paths.
+ return fmt.Errorf("setupTunTap: not implemented for ip=%s", ip)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package network | |
| // setupVMNetworking creates and configures TAP and veth devices for a VM | |
| func (m *Manager) setupTunTap(ip string) error { | |
| // // tap := &netlink.Tuntap{ | |
| // // LinkAttrs: netlink.LinkAttrs{ | |
| // // Name: deviceNames.TAP, | |
| // // }, | |
| // // Mode: netlink.TUNTAP_MODE_TAP, | |
| // // } | |
| // // if err := netlink.LinkAdd(tap); err != nil { | |
| // // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err) | |
| // // } | |
| // // tapLink, err := netlink.LinkByName(deviceNames.TAP) | |
| // // if err != nil { | |
| // // return fmt.Errorf("failed to get TAP device: %w", err) | |
| // // } | |
| // // Set TAP device up | |
| // if err := netlink.LinkSetUp(tap); err != nil { | |
| // return fmt.Errorf("failed to bring TAP device up: %w", err) | |
| // } | |
| // m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name)) | |
| return nil | |
| } | |
| //go:build linux | |
| package network | |
| import "fmt" | |
| // setupTunTap creates and configures TAP and veth devices for a VM. | |
| // AIDEV-TODO: Implement using netlink (requires CAP_NET_ADMIN). Ensure bridge attach, MTU, and ownership. | |
| func (m *Manager) setupTunTap(ip string) error { | |
| // // tap := &netlink.Tuntap{ | |
| // // LinkAttrs: netlink.LinkAttrs{ | |
| // // Name: deviceNames.TAP, | |
| // // }, | |
| // // Mode: netlink.TUNTAP_MODE_TAP, | |
| // // } | |
| // // if err := netlink.LinkAdd(tap); err != nil { | |
| // // return fmt.Errorf("failed to create TAP device %s: %w", deviceNames.TAP, err) | |
| // // } | |
| // // tapLink, err := netlink.LinkByName(deviceNames.TAP) | |
| // // if err != nil { | |
| // // return fmt.Errorf("failed to get TAP device: %w", err) | |
| // // } | |
| // // Set TAP device up | |
| // if err := netlink.LinkSetUp(tap); err != nil { | |
| // return fmt.Errorf("failed to bring TAP device up: %w", err) | |
| // } | |
| // m.logger.Info("TAP device created and up", slog.String("tap", tapLink.Attrs().Name)) | |
| // AIDEV-TODO: temporary NotImplemented to avoid silent success paths. | |
| return fmt.Errorf("setupTunTap: not implemented for ip=%s", ip) | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/tuntap.go lines 1-29: add a linux-only
build tag at the top of the file (//go:build linux and the legacy +build line)
so this stub is only compiled on Linux, update the function comment to note it’s
unimplemented and include a TODO, and replace the silent no-op with an immediate
failing return (e.g. return fmt.Errorf("setupTunTap: not implemented")) so
callers fail fast; also add the required fmt import if missing.
| build: deps ## Build the binary | ||
| @mkdir -p $(BUILD_DIR) | ||
| @go build $(LDFLAGS) -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald | ||
| go build $(LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald |
There was a problem hiding this comment.
Debug flags included in regular build target.
The build target now includes debug flags (-gcflags="all=-N -l") which disable optimizations and inlining. This will significantly impact performance and should only be used in debug builds.
Apply this diff to separate production and debug builds:
build: deps ## Build the binary
@mkdir -p $(BUILD_DIR)
- go build $(LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
+ go build $(LDFLAGS) -o $(BUILD_DIR)/$(BINARY_NAME) ./cmd/metald
+
+debug-build: deps ## Build the binary with debug symbols
+ @mkdir -p $(BUILD_DIR)
+ go build $(DEBUG_LDFLAGS) -gcflags="all=-N -l" -o $(BUILD_DIR)/$(BINARY_NAME)-debug ./cmd/metald|
|
||
| test: ## Run all tests | ||
| @go test ./... -v | ||
| @go test -json -failfast -v ./... | tparse |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Test runner changed to use tparse for better output formatting.
The change from go test ./... -v to go test -json -failfast -v ./... | tparse provides better test output formatting. However, ensure that tparse is installed as a dependency.
Run the following script to verify tparse availability:
🏁 Script executed:
#!/bin/bash
# Check if tparse is available in go.mod or as a tool dependency
echo "Checking for tparse in go.mod dependencies..."
rg -n "tparse" go/deploy/metald/go.mod
echo "Checking for tparse installation in Makefile or scripts..."
rg -n "go install.*tparse" go/deploy/metald/
echo "Checking if tparse is mentioned in documentation..."
fd README -t f | xargs rg -l "tparse"Length of output: 464
Add tparse installation in Makefile and track as a Go tool dependency
No references to tparse were found in go.mod, scripts, or docs, so invoking go test … | tparse will break in a clean environment.
- In go/deploy/metald/Makefile, add a target or prerequisite to install tparse, for example:
and make
install-tools: go install github.com/jstemmer/tparse/cmd/tparse@latesttestdepend oninstall-tools. - Track tparse in version control by adding a
tools.gounder go/deploy/metald/:// +build tools package tools import _ "github.com/jstemmer/tparse/cmd/tparse"
- Update README or developer setup docs to include the above steps.
🤖 Prompt for AI Agents
In go/deploy/metald/Makefile around line 88 the test target pipes output to
tparse which isn't installed or tracked; add a Makefile target (e.g.,
install-tools) that runs `go install
github.com/jstemmer/tparse/cmd/tparse@latest` and make the test target depend on
that target so tparse is available in a clean environment; add a new
go/deploy/metald/tools.go file with the tools build tag that imports `_
"github.com/jstemmer/tparse/cmd/tparse"` to pin tparse as a Go tool dependency
in version control; and update the repository README/developer setup docs to
mention the tools installation or the Makefile target.
| // AIDEV-NOTE: Signal handling continues during graceful shutdown to prevent SIGABRT panics | ||
| shutdownSignalReceived := make(chan struct{}) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Signal ignore loop never runs due to closed-channel select; fix to actually drain signals.
After closing shutdownSignalReceived, the select always takes that branch and returns immediately, so additional signals are not ignored/logged.
- // Continue handling signals during graceful shutdown to prevent SIGABRT panics
- go func() {
- for {
- select {
- case <-shutdownSignalReceived:
- // Already shutting down, ignore
- return
- case sig := <-sigChan:
- logger.Warn("received additional signal during shutdown, ignoring",
- slog.String("signal", sig.String()),
- )
- // Continue listening for more signals
- }
- }
- }()
+ // Continue handling (and ignoring) subsequent signals during shutdown
+ go func() {
+ // Wait until shutdown starts
+ <-shutdownSignalReceived
+ for sig := range sigChan {
+ logger.Warn("received additional signal during shutdown, ignoring",
+ slog.String("signal", sig.String()),
+ )
+ }
+ }()Optionally call signal.Stop(sigChan) just before process exit to release resources.
Also applies to: 435-450
| // Register via assetmanager client | ||
| assetID, err := m.assetClient.RegisterBuildArtifact(ctx, "base-assets", localPath, asset.Type, labels) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to register asset: %w", err) | ||
| // Already exists errors are fine, connection errors should cause retry | ||
| errStr := strings.ToLower(err.Error()) | ||
| if strings.Contains(errStr, "already exists") || strings.Contains(errStr, "duplicate") || | ||
| strings.Contains(errStr, "conflict") { | ||
| m.logger.InfoContext(ctx, "base asset already registered, skipping", | ||
| "asset", asset.Name, | ||
| "error", err, | ||
| ) | ||
| return nil // Success - asset already registered | ||
| } else { | ||
| // This is likely a connection/service unavailable error - should trigger retry | ||
| return fmt.Errorf("failed to register base asset %s (service may not be ready): %w", asset.Name, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid substring matching for “already exists” errors; use typed/status errors
String contains checks are brittle and locale-dependent. Prefer:
- gRPC: status.Code(err) == codes.AlreadyExists
- Sentinel/typed errors: errors.Is(err, assetmanager.ErrAlreadyExists)
If neither is available, introduce an error classifying helper with unit tests.
| // copyMetadataForRootDevice copies metadata and creates container.cmd for a root device | ||
| func (c *Client) copyMetadataForRootDevice(ctx context.Context, vmID string, disk *metaldv1.StorageDevice, jailerRoot string, diskDst string) error { | ||
| baseName := strings.TrimSuffix(filepath.Base(disk.GetPath()), filepath.Ext(disk.GetPath())) | ||
| metadataSrc := filepath.Join(filepath.Dir(disk.GetPath()), baseName+".metadata.json") | ||
|
|
||
| // Check if metadata file exists | ||
| if _, err := os.Stat(metadataSrc); err != nil { | ||
| if os.IsNotExist(err) { | ||
| return nil // No metadata file is OK | ||
| } | ||
| return fmt.Errorf("failed to stat metadata file: %w", err) | ||
| } | ||
|
|
||
| // Copy metadata file | ||
| metadataDst := filepath.Join(jailerRoot, filepath.Base(metadataSrc)) | ||
| if err := copyFileWithOwnership(metadataSrc, metadataDst, int(c.jailerConfig.UID), int(c.jailerConfig.GID)); err != nil { | ||
| return fmt.Errorf("failed to copy metadata file: %w", err) | ||
| } | ||
|
|
||
| c.logger.LogAttrs(ctx, slog.LevelInfo, "copied metadata file to jailer root", | ||
| slog.String("src", metadataSrc), | ||
| slog.String("dst", metadataDst), | ||
| ) | ||
|
|
||
| // Load and process metadata to create container.cmd | ||
| metadata, err := c.loadContainerMetadata(ctx, disk.GetPath()) | ||
| if err != nil || metadata == nil { | ||
| return nil // Can't create container.cmd without metadata | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Log metadata load failures at debug; add an anchor for future behavior.
Currently returns nil on errors, which can hide issues.
Apply this diff:
metadata, err := c.loadContainerMetadata(ctx, disk.GetPath())
if err != nil || metadata == nil {
- return nil // Can't create container.cmd without metadata
+ // AIDEV-NOTE: Metadata missing/unreadable; skipping container.cmd generation is intentional.
+ // Consider promoting to debug with context to aid troubleshooting.
+ c.logger.DebugContext(ctx, "skipping container.cmd (no metadata)", slog.String("disk", disk.GetPath()), slog.Any("error", err))
+ return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // copyMetadataForRootDevice copies metadata and creates container.cmd for a root device | |
| func (c *Client) copyMetadataForRootDevice(ctx context.Context, vmID string, disk *metaldv1.StorageDevice, jailerRoot string, diskDst string) error { | |
| baseName := strings.TrimSuffix(filepath.Base(disk.GetPath()), filepath.Ext(disk.GetPath())) | |
| metadataSrc := filepath.Join(filepath.Dir(disk.GetPath()), baseName+".metadata.json") | |
| // Check if metadata file exists | |
| if _, err := os.Stat(metadataSrc); err != nil { | |
| if os.IsNotExist(err) { | |
| return nil // No metadata file is OK | |
| } | |
| return fmt.Errorf("failed to stat metadata file: %w", err) | |
| } | |
| // Copy metadata file | |
| metadataDst := filepath.Join(jailerRoot, filepath.Base(metadataSrc)) | |
| if err := copyFileWithOwnership(metadataSrc, metadataDst, int(c.jailerConfig.UID), int(c.jailerConfig.GID)); err != nil { | |
| return fmt.Errorf("failed to copy metadata file: %w", err) | |
| } | |
| c.logger.LogAttrs(ctx, slog.LevelInfo, "copied metadata file to jailer root", | |
| slog.String("src", metadataSrc), | |
| slog.String("dst", metadataDst), | |
| ) | |
| // Load and process metadata to create container.cmd | |
| metadata, err := c.loadContainerMetadata(ctx, disk.GetPath()) | |
| if err != nil || metadata == nil { | |
| return nil // Can't create container.cmd without metadata | |
| } | |
| // Load and process metadata to create container.cmd | |
| metadata, err := c.loadContainerMetadata(ctx, disk.GetPath()) | |
| if err != nil || metadata == nil { | |
| // AIDEV-NOTE: Metadata missing/unreadable; skipping container.cmd generation is intentional. | |
| // Consider promoting to debug with context to aid troubleshooting. | |
| c.logger.DebugContext( | |
| ctx, | |
| "skipping container.cmd (no metadata)", | |
| slog.String("disk", disk.GetPath()), | |
| slog.Any("error", err), | |
| ) | |
| return nil | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines 16
to 44, the call to loadContainerMetadata swallows errors and returns nil
silently; log failures at debug and add an anchor comment about future behavior.
Modify the error path to call c.logger.LogAttrs(ctx, slog.LevelDebug, "failed to
load container metadata", slog.String("vmID", vmID), slog.String("diskPath",
disk.GetPath()), slog.Any("error", err)) (or equivalent logger method) when err
!= nil, and include a TODO/TICKET comment noting that we currently return nil
but should surface or handle metadata load errors in the future; keep returning
nil for now to preserve existing behavior.
| // Create temporary mount directory | ||
| mountDir := filepath.Join("/tmp", fmt.Sprintf("mount-%s", vmID)) | ||
| if err := os.MkdirAll(mountDir, 0o755); err != nil { | ||
| return fmt.Errorf("failed to create mount directory: %w", err) | ||
| } | ||
| defer os.RemoveAll(mountDir) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid RemoveAll on an active mount; use MkdirTemp to prevent collisions.
RemoveAll before unmount risks deleting guest files if umount fails. Also make mountpoint unique.
Apply this diff:
- // Create temporary mount directory
- mountDir := filepath.Join("/tmp", fmt.Sprintf("mount-%s", vmID))
- if err := os.MkdirAll(mountDir, 0o755); err != nil {
- return fmt.Errorf("failed to create mount directory: %w", err)
- }
- defer os.RemoveAll(mountDir)
+ // Create a unique temporary mount directory to avoid collisions
+ mountDir, err := os.MkdirTemp("", fmt.Sprintf("metald-mnt-%s-", vmID))
+ if err != nil {
+ return fmt.Errorf("failed to create mount directory: %w", err)
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines
61-67, the code creates a non-unique /tmp mount directory and defers
os.RemoveAll immediately, which risks deleting guest files if unmount fails;
replace filepath.Join("/tmp", ...) + MkdirAll with os.MkdirTemp("/tmp",
"mount-*") to create a unique temporary directory with correct permissions,
remove the deferred os.RemoveAll, and instead perform cleanup (os.RemoveAll)
only after a successful unmount (or in a cleanup path that runs when the mount
is confirmed unmounted), ensuring errors from MkdirTemp are handled and that
RemoveAll is not executed while the mount point may still be active.
| // Mount the rootfs ext4 image | ||
| mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir) | ||
| if err := mountCmd.Run(); err != nil { | ||
| return fmt.Errorf("failed to mount rootfs: %w", err) | ||
| } | ||
| defer func() { | ||
| // Always unmount | ||
| umountCmd := exec.CommandContext(ctx, "umount", mountDir) | ||
| if err := umountCmd.Run(); err != nil { | ||
| c.logger.WarnContext(ctx, "failed to unmount rootfs", | ||
| "error", err, | ||
| "mountDir", mountDir, | ||
| ) | ||
| } | ||
| }() | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Harden mount options and only remove mountpoint after successful unmount.
Use nosuid,nodev,noexec and explicit fs type. Don’t RemoveAll if unmount fails.
Apply this diff:
- // Mount the rootfs ext4 image
- mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir)
+ // AIDEV-NOTE: Requires CAP_SYS_ADMIN/root. Mount rw to write container.cmd but restrict capabilities.
+ // Mount the rootfs ext4 image
+ mountCmd := exec.CommandContext(ctx, "mount", "-t", "ext4", "-o", "loop,rw,nosuid,nodev,noexec", diskDst, mountDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount rootfs: %w", err)
}
defer func() {
// Always unmount
umountCmd := exec.CommandContext(ctx, "umount", mountDir)
if err := umountCmd.Run(); err != nil {
c.logger.WarnContext(ctx, "failed to unmount rootfs",
"error", err,
"mountDir", mountDir,
)
- }
+ // Do not remove mountDir if unmount failed to avoid deleting guest files from the image.
+ return
+ }
+ // Safe to remove the mount directory after a successful unmount
+ _ = os.RemoveAll(mountDir)
}()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Mount the rootfs ext4 image | |
| mountCmd := exec.CommandContext(ctx, "mount", "-o", "loop", diskDst, mountDir) | |
| if err := mountCmd.Run(); err != nil { | |
| return fmt.Errorf("failed to mount rootfs: %w", err) | |
| } | |
| defer func() { | |
| // Always unmount | |
| umountCmd := exec.CommandContext(ctx, "umount", mountDir) | |
| if err := umountCmd.Run(); err != nil { | |
| c.logger.WarnContext(ctx, "failed to unmount rootfs", | |
| "error", err, | |
| "mountDir", mountDir, | |
| ) | |
| } | |
| }() | |
| // AIDEV-NOTE: Requires CAP_SYS_ADMIN/root. Mount rw to write container.cmd but restrict capabilities. | |
| // Mount the rootfs ext4 image | |
| mountCmd := exec.CommandContext(ctx, "mount", "-t", "ext4", "-o", "loop,rw,nosuid,nodev,noexec", diskDst, mountDir) | |
| if err := mountCmd.Run(); err != nil { | |
| return fmt.Errorf("failed to mount rootfs: %w", err) | |
| } | |
| defer func() { | |
| // Always unmount | |
| umountCmd := exec.CommandContext(ctx, "umount", mountDir) | |
| if err := umountCmd.Run(); err != nil { | |
| c.logger.WarnContext(ctx, "failed to unmount rootfs", | |
| "error", err, | |
| "mountDir", mountDir, | |
| ) | |
| // Do not remove mountDir if unmount failed to avoid deleting guest files from the image. | |
| return | |
| } | |
| // Safe to remove the mount directory after a successful unmount | |
| _ = os.RemoveAll(mountDir) | |
| }() |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/backend/firecracker/asset_static.go around lines 68
to 83, the mount call currently uses insecure defaults and the defer always
removes the mountpoint even if unmount fails; change the mount command to
include explicit filesystem type (ext4) and hardened options ("-o",
"loop,nosuid,nodev,noexec") and update the deferred cleanup so it first attempts
to unmount and only removes the mountDir (e.g. os.RemoveAll) if unmount
succeeds; if umount fails log the warning but do not delete the directory or its
contents.
| // PortMapping represents a mapping from VM port to host port | ||
| type PortMapping struct { | ||
| ContainerPort int `json:"container_port"` | ||
| HostPort int `json:"host_port"` | ||
| Protocol string `json:"protocol"` // tcp or udp | ||
| VMID string `json:"vm_id"` | ||
| VMPort int `json:"vm_port"` | ||
| HostPort int `json:"host_port"` | ||
| Protocol string `json:"protocol"` // tcp or udp | ||
| VMID string `json:"vm_id"` | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Backward-compat risk: JSON field rename to vm_port.
If persisted state or external APIs still send container_port, unmarshalling will break. Consider a custom UnmarshalJSON to accept both.
// Optional: accept both "vm_port" and "container_port" during decode.
func (p *PortMapping) UnmarshalJSON(b []byte) error {
type Alias PortMapping
aux := struct {
Alias
ContainerPort *int `json:"container_port"`
}{}
if err := json.Unmarshal(b, &aux); err != nil {
return err
}
*p = PortMapping(aux.Alias)
if p.VMPort == 0 && aux.ContainerPort != nil {
p.VMPort = *aux.ContainerPort
}
return nil
}🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/port_allocator.go around lines 10 to 16,
the JSON field was renamed to "vm_port" which breaks unmarshalling of persisted
or external data still using "container_port"; implement a custom UnmarshalJSON
on PortMapping that defines an alias type, decodes into a struct that includes
both a ContainerPort *int and the alias, assign the decoded alias back to the
receiver, and if VMPort is zero and ContainerPort is non-nil set VMPort from
ContainerPort; ensure the method returns any json.Unmarshal error and that the
package imports encoding/json.
| // BridgeManager manages workspace allocation across multiple bridges | ||
| type BridgeManager struct { | ||
| bridgeCount int // 8 or 32 bridges | ||
| bridgePrefix string // "br-vms" -> br-vms-0, br-vms-1, etc. | ||
| workspaces map[string]*WorkspaceAllocation // workspace_id -> allocation | ||
| bridgeUsage map[int]map[string]bool // bridge_num -> workspace_id -> exists | ||
| mu sync.RWMutex | ||
| statePath string // Path to state persistence file | ||
| logger *slog.Logger // Structured logger for state operations | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Mutex/logger fields: document concurrency and logging contracts.
Since these types embed mu and logger, add AIDEV-NOTE comments outlining which methods take read/write locks and how logger context is propagated.
Also applies to: 29-39
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/types.go around lines 10 to 19 (and
similarly for lines 29 to 39), add AIDEV-NOTE comments on the mutex and logger
fields: document which public/internal methods acquire the read lock (RLock) and
which acquire the write lock (Lock), include the expected lock ordering if any,
and state whether callsites must hold locks when calling those methods; also
document how the logger context is propagated (e.g., whether methods add
workspace/bridge fields, whether the logger is safe for concurrent use, and
whether callers should clone/With on the logger before adding fields). Keep the
notes brief and specific (method names and lock type), place them immediately
above the mu and logger field declarations, and update the mirrored note for the
other struct at lines 29–39.
| type MultiBridgeManager struct { | ||
| bridgeCount int // 8 or 32 bridges | ||
| bridgePrefix string // "br-vms" -> br-vms-0, br-vms-1, etc. | ||
| workspaces map[string]*WorkspaceAllocation // workspace_id -> allocation | ||
| bridgeUsage map[int]map[string]bool // bridge_num -> workspace_id -> exists | ||
| mu sync.RWMutex | ||
| vlanRangeStart int // Starting VLAN ID (100) | ||
| vlanRangeEnd int // Ending VLAN ID (4000) | ||
| statePath string // Path to state persistence file | ||
| logger *slog.Logger // Structured logger for state operations | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Duplicated manager types (BridgeManager vs MultiBridgeManager).
Both carry near-identical fields; this invites drift. Consolidate into a single manager with config options (bridge/vlan ranges) to reduce complexity.
Add an AIDEV-NOTE explaining the design choice if you keep both.
| WorkspaceID string `json:"workspace_id"` | ||
| BridgeNumber int `json:"bridge_number"` // 0-31 | ||
| BridgeName string `json:"bridge_name"` // br-vms-N | ||
| CreatedAt string `json:"created_at"` | ||
| VMCount int `json:"vm_count"` // Track VM count for IP allocation | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
CreatedAt should be time.Time for consistency and safer JSON.
WorkspaceAllocation.CreatedAt is string while VMNetwork.CreatedAt is time.Time. Use time.Time and let encoding/json handle RFC3339; add omitempty if needed.
- CreatedAt string `json:"created_at"`
+ CreatedAt time.Time `json:"created_at"`Update serialization/deserialization sites accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| WorkspaceID string `json:"workspace_id"` | |
| BridgeNumber int `json:"bridge_number"` // 0-31 | |
| BridgeName string `json:"bridge_name"` // br-vms-N | |
| CreatedAt string `json:"created_at"` | |
| VMCount int `json:"vm_count"` // Track VM count for IP allocation | |
| } | |
| WorkspaceID string `json:"workspace_id"` | |
| BridgeNumber int `json:"bridge_number"` // 0-31 | |
| BridgeName string `json:"bridge_name"` // br-vms-N | |
| CreatedAt time.Time `json:"created_at"` | |
| VMCount int `json:"vm_count"` // Track VM count for IP allocation | |
| } |
| VMID string `json:"vm_id"` | ||
| NetworkID string `json:"network_id"` // AIDEV-NOTE: Internal 8-char ID for network device naming | ||
| WorkspaceID string `json:"workspace_id"` // AIDEV-NOTE: Track workspace for proper IP release | ||
| Namespace string `json:"namespace"` | ||
| TapDevice string `json:"tap_device"` | ||
| IPAddress net.IP `json:"ip_address"` | ||
| Netmask net.IPMask `json:"netmask"` | ||
| Gateway net.IP `json:"gateway"` | ||
| MacAddress string `json:"mac_address"` | ||
| DNSServers []string `json:"dns_servers"` | ||
| CreatedAt time.Time `json:"created_at"` | ||
|
|
||
| // Optional fields for advanced configurations | ||
| VLANID int `json:"vlan_id,omitempty"` | ||
| // Optional fields for advanced configuration | ||
| IPv6Address net.IP `json:"ipv6_address,omitempty"` | ||
| Routes []Route `json:"routes,omitempty"` | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
JSON of net.IPMask is awkward; prefer CIDR prefix or IPNet.
net.IP marshals nicely, but net.IPMask becomes base64-encoded bytes in JSON. Replace Netmask with a CIDR prefix int or use an IPNet for the interface address.
- Netmask net.IPMask `json:"netmask"`
+ Prefix int `json:"prefix"` // e.g., 24Add an AIDEV-NOTE on why this change improves API ergonomics.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/types.go around lines 52 to 67, replace the
Netmask field (net.IPMask) which JSON-encodes awkwardly with either a CIDR
prefix int (e.g., PrefixLen int `json:"prefix_len"`) or an IPNet (e.g., IPNet
*net.IPNet `json:"ip_net,omitempty"`), update any uses to parse/format
accordingly, and add an AIDEV-NOTE comment explaining that using a prefix length
or IPNet improves API ergonomics because it avoids base64-encoding of mask bytes
and is human-readable/portable in JSON. Ensure tags use omitempty where
appropriate and keep semantic compatibility by converting old netmask byte
values at marshal/unmarshal points if needed.
| package executor | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
| "log/slog" | ||
| "os" | ||
| "path/filepath" | ||
| "time" | ||
|
|
||
| "github.com/unkeyed/unkey/go/deploy/builderd/internal/config" | ||
| "github.com/unkeyed/unkey/go/deploy/builderd/internal/observability" | ||
| "github.com/unkeyed/unkey/go/deploy/pkg/observability/interceptors" | ||
| builderv1 "github.com/unkeyed/unkey/go/gen/proto/deploy/builderd/v1" | ||
| ) | ||
|
|
||
| // DockerPipelineExecutor wraps the existing DockerExecutor with step-based execution | ||
| type DockerPipelineExecutor struct { | ||
| dockerExecutor *DockerExecutor | ||
| pipeline *BuildPipeline | ||
| logger *slog.Logger | ||
| config *config.Config | ||
| buildMetrics *observability.BuildMetrics | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Add high-level AIDEV anchor for pipeline semantics.
Given the orchestration role here, add a brief AIDEV-NOTE explaining idempotency, failure cleanup expectations, and resume assumptions.
Example:
// AIDEV-NOTE: Pipeline executor orchestrates idempotent steps. On failure, we still expect
// best-effort cleanup (container/image) and consistent metrics. Resume assumes step outputs
// can be reconstructed from workspaceDir/rootfsDir.🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 1 to 24,
add a high-level AIDEV-NOTE comment above the DockerPipelineExecutor type
describing pipeline semantics: state that the pipeline executor orchestrates
idempotent steps, that on failures best-effort cleanup of containers/images is
expected, and that resume logic assumes step outputs can be reconstructed from
workspaceDir/rootfsDir; place this single-line AIDEV-NOTE as a brief comment
block to guide future maintainers.
| // Get tenant context for logging and metrics | ||
| tenantID := "unknown" | ||
| if auth, ok := interceptors.TenantFromContext(ctx); ok { | ||
| tenantID = auth.TenantID | ||
| } | ||
|
|
||
| logger := d.logger.With( | ||
| slog.String("tenant_id", tenantID), | ||
| slog.String("build_id", buildID), | ||
| slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()), | ||
| ) | ||
|
|
There was a problem hiding this comment.
Nil-pointer risk: image URI accessed before validation.
logger construction dereferences request.Config.Source.DockerImage before checking for nil, leading to panic on bad input. Validate and extract imageURI first, then build the logger.
Apply:
- // Get tenant context for logging and metrics
+ // Validate request and extract image info
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}
- logger := d.logger.With(
- slog.String("tenant_id", tenantID),
- slog.String("build_id", buildID),
- slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
- )
+ if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil ||
+ request.GetConfig().GetSource().GetDockerImage() == nil ||
+ request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" {
+ return nil, fmt.Errorf("invalid request: docker image source is required")
+ }
+ imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri()
+
+ logger := d.logger.With(
+ slog.String("tenant_id", tenantID),
+ slog.String("build_id", buildID),
+ slog.String("image_uri", imageURI),
+ )
@@
- dockerSource := request.GetConfig().GetSource().GetDockerImage()
- if dockerSource == nil {
- return nil, fmt.Errorf("docker image source is required")
- }
+ // dockerSource validated aboveAIDEV-NOTE: Validate nested request fields before logging to avoid panics on malformed input.
Also applies to: 68-72
| // Create directories | ||
| if err := os.MkdirAll(workspaceDir, 0755); err != nil { | ||
| logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error())) | ||
| return nil, fmt.Errorf("failed to create workspace directory: %w", err) | ||
| } | ||
|
|
||
| if err := os.MkdirAll(rootfsDir, 0755); err != nil { | ||
| logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error())) | ||
| return nil, fmt.Errorf("failed to create rootfs directory: %w", err) | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Workspace/rootfs perms: consider 0750/0700 for build isolation.
0755 allows world-read/exec. If builds can contain sensitive artifacts, default to least privilege.
Apply:
- if err := os.MkdirAll(workspaceDir, 0755); err != nil {
+ if err := os.MkdirAll(workspaceDir, 0750); err != nil {
@@
- if err := os.MkdirAll(rootfsDir, 0755); err != nil {
+ if err := os.MkdirAll(rootfsDir, 0750); err != nil {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Create directories | |
| if err := os.MkdirAll(workspaceDir, 0755); err != nil { | |
| logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error())) | |
| return nil, fmt.Errorf("failed to create workspace directory: %w", err) | |
| } | |
| if err := os.MkdirAll(rootfsDir, 0755); err != nil { | |
| logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error())) | |
| return nil, fmt.Errorf("failed to create rootfs directory: %w", err) | |
| } | |
| // Create directories | |
| if err := os.MkdirAll(workspaceDir, 0750); err != nil { | |
| logger.ErrorContext(ctx, "failed to create workspace directory", slog.String("error", err.Error())) | |
| return nil, fmt.Errorf("failed to create workspace directory: %w", err) | |
| } | |
| if err := os.MkdirAll(rootfsDir, 0750); err != nil { | |
| logger.ErrorContext(ctx, "failed to create rootfs directory", slog.String("error", err.Error())) | |
| return nil, fmt.Errorf("failed to create rootfs directory: %w", err) | |
| } |
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 82 to
91, the directories are created with permissive 0755; change to least-privilege
modes by creating workspaceDir with 0750 and rootfsDir with 0700 to reduce world
access. Update the os.MkdirAll calls to use those permission bits (keep existing
error handling/log messages unchanged).
| // Execute the pipeline | ||
| result, err := d.pipeline.Execute(ctx, input) | ||
| if err != nil { | ||
| logger.ErrorContext(ctx, "pipeline execution failed", slog.String("error", err.Error())) | ||
| if d.buildMetrics != nil { | ||
| d.buildMetrics.RecordBuildComplete(ctx, "docker", "docker", time.Since(start), false) | ||
| } | ||
| return nil, err | ||
| } | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Unify metrics completion via defer to avoid duplication.
You record completion in both branches. Use a single deferred recorder keyed on err for consistency.
Example:
start := time.Now()
defer func() {
success := err == nil
if d.buildMetrics != nil {
d.buildMetrics.RecordBuildComplete(ctx, "docker", "docker", time.Since(start), success)
}
}()Apply within function scope.
Also applies to: 112-116
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 102-111
(and also apply the same change for the 112-116 region), the build completion
metric is recorded in multiple branches; instead, declare a start := time.Now()
and use a single deferred closure that reads the function's returned err to
compute success and calls d.buildMetrics.RecordBuildComplete once. To do this,
make the function use a named error return (e.g., err error) or ensure the
deferred closure captures the correct err variable, move the metrics call out of
the error and success branches, and keep the existing slog logging in-place;
remove the duplicated RecordBuildComplete calls in the branches.
| // ResumeBuild resumes a build from a specific step | ||
| func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) { | ||
| start := time.Now() | ||
|
|
||
| // Get tenant context for logging and metrics | ||
| tenantID := "unknown" | ||
| if auth, ok := interceptors.TenantFromContext(ctx); ok { | ||
| tenantID = auth.TenantID | ||
| } | ||
|
|
||
| logger := d.logger.With( | ||
| slog.String("tenant_id", tenantID), | ||
| slog.String("build_id", buildID), | ||
| slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()), | ||
| slog.Int("resume_from_step", lastCompletedStep+1), | ||
| ) | ||
|
|
There was a problem hiding this comment.
ResumeBuild has the same nil-pointer hazard.
It dereferences the image URI for logging without validation. Mirror the guard used in ExtractDockerImageWithID.
Apply:
- // Get tenant context for logging and metrics
+ // Get tenant context and validate request
tenantID := "unknown"
if auth, ok := interceptors.TenantFromContext(ctx); ok {
tenantID = auth.TenantID
}
- logger := d.logger.With(
+ if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil ||
+ request.GetConfig().GetSource().GetDockerImage() == nil ||
+ request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" {
+ return nil, fmt.Errorf("invalid request: docker image source is required")
+ }
+ imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri()
+
+ logger := d.logger.With(
slog.String("tenant_id", tenantID),
slog.String("build_id", buildID),
- slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()),
+ slog.String("image_uri", imageURI),
slog.Int("resume_from_step", lastCompletedStep+1),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // ResumeBuild resumes a build from a specific step | |
| func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) { | |
| start := time.Now() | |
| // Get tenant context for logging and metrics | |
| tenantID := "unknown" | |
| if auth, ok := interceptors.TenantFromContext(ctx); ok { | |
| tenantID = auth.TenantID | |
| } | |
| logger := d.logger.With( | |
| slog.String("tenant_id", tenantID), | |
| slog.String("build_id", buildID), | |
| slog.String("image_uri", request.GetConfig().GetSource().GetDockerImage().GetImageUri()), | |
| slog.Int("resume_from_step", lastCompletedStep+1), | |
| ) | |
| // ResumeBuild resumes a build from a specific step | |
| func (d *DockerPipelineExecutor) ResumeBuild(ctx context.Context, request *builderv1.CreateBuildRequest, buildID string, lastCompletedStep int) (*BuildResult, error) { | |
| start := time.Now() | |
| // Get tenant context and validate request | |
| tenantID := "unknown" | |
| if auth, ok := interceptors.TenantFromContext(ctx); ok { | |
| tenantID = auth.TenantID | |
| } | |
| if request == nil || request.GetConfig() == nil || request.GetConfig().GetSource() == nil || | |
| request.GetConfig().GetSource().GetDockerImage() == nil || | |
| request.GetConfig().GetSource().GetDockerImage().GetImageUri() == "" { | |
| return nil, fmt.Errorf("invalid request: docker image source is required") | |
| } | |
| imageURI := request.GetConfig().GetSource().GetDockerImage().GetImageUri() | |
| logger := d.logger.With( | |
| slog.String("tenant_id", tenantID), | |
| slog.String("build_id", buildID), | |
| slog.String("image_uri", imageURI), | |
| slog.Int("resume_from_step", lastCompletedStep+1), | |
| ) | |
| // ... rest of ResumeBuild ... |
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_pipeline.go around lines 120 to
136, the logger currently dereferences
request.GetConfig().GetSource().GetDockerImage().GetImageUri() directly which
can cause a nil-pointer panic; mirror the guard in ExtractDockerImageWithID by
first safely checking that request.GetConfig(), Config.GetSource(),
Source.GetDockerImage() are non-nil (and optionally that GetImageUri() is
non-empty), assign the image URI to a local string variable only if those checks
pass (otherwise set it to an empty string or "unknown"), and then use that safe
variable in logger.With instead of direct chained dereferences.
| // DEBUG: Log full request config for debugging | ||
| if config != nil { | ||
| configJSON, _ := json.Marshal(config) | ||
| s.logger.LogAttrs(ctx, slog.LevelInfo, "DEBUG: Full VM config received", | ||
| s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received", | ||
| slog.String("config_json", string(configJSON)), | ||
| ) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Don’t log full VM config; redact or drop sensitive fields
Config may contain paths, keys, or credentials. Keep logs safe-by-default.
- configJSON, _ := json.Marshal(config)
- s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received",
- slog.String("config_json", string(configJSON)),
- )
+ // AIDEV-BUSINESS_RULE: Avoid logging full VM config; include only non-sensitive summary.
+ s.logger.LogAttrs(ctx, slog.LevelDebug, "vm config received (summary)",
+ slog.Any("cpu", config.GetCpu()),
+ slog.Any("memory", config.GetMemory()),
+ // omit boot/storage/net details that may be sensitive
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // DEBUG: Log full request config for debugging | |
| if config != nil { | |
| configJSON, _ := json.Marshal(config) | |
| s.logger.LogAttrs(ctx, slog.LevelInfo, "DEBUG: Full VM config received", | |
| s.logger.LogAttrs(ctx, slog.LevelDebug, "full VM config received", | |
| slog.String("config_json", string(configJSON)), | |
| ) | |
| } | |
| // DEBUG: Log full request config for debugging | |
| if config != nil { | |
| // AIDEV-BUSINESS_RULE: Avoid logging full VM config; include only non-sensitive summary. | |
| s.logger.LogAttrs(ctx, slog.LevelDebug, "vm config received (summary)", | |
| slog.Any("cpu", config.GetCpu()), | |
| slog.Any("memory", config.GetMemory()), | |
| // omit boot/storage/net details that may be sensitive | |
| ) | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 68-74, the code
currently logs the entire VM config which may contain sensitive data; replace
that with a safe/redacted representation before logging: identify and remove or
replace sensitive keys (e.g. "private_key", "ssh_key", "password", "token",
"secret", "cert", "key", "credentials", any file paths) by copying the config
into a map/struct and setting those fields to "[REDACTED]" or omitting them, or
else only log a small allowlist of non-sensitive fields (e.g. VM name, cpu,
memory, region); then marshal that safe representation and log it at debug level
instead of the raw config.
| // Validate required fields | ||
| if validateErr := s.validateVMConfig(config); validateErr != nil { | ||
| s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config", | ||
| slog.String("error", validateErr.Error()), | ||
| // if validateErr := s.validateVMConfig(config); validateErr != nil { | ||
| // s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config", | ||
| // slog.String("error", validateErr.Error()), | ||
| // ) | ||
| // if s.vmMetrics != nil { | ||
| // s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config") | ||
| // } | ||
| // return nil, connect.NewError(connect.CodeInvalidArgument, validateErr) | ||
| // } | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Re-enable minimal config validation
Basic guardrails prevent backend errors and undefined behavior.
- // if validateErr := s.validateVMConfig(config); validateErr != nil {
- // ...
- // }
+ if validateErr := s.validateVMConfig(config); validateErr != nil {
+ s.logger.LogAttrs(ctx, slog.LevelError, "invalid vm config", slog.String("error", validateErr.Error()))
+ if s.vmMetrics != nil {
+ s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config")
+ }
+ return nil, connect.NewError(connect.CodeInvalidArgument, validateErr)
+ }If you prefer staged rollout, gate with an env flag and add an AIDEV-TODO to remove the flag later.
Also applies to: 486-543
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 85-95 (and similarly
apply to lines 486-543), re-enable the minimal VM config validation that was
commented out: call s.validateVMConfig(config), log the validation error with
s.logger.LogAttrs at slog.LevelError including validateErr.Error(), record a VM
create failure via s.vmMetrics.RecordVMCreateFailure with reason
"invalid_config" if vmMetrics is non-nil, and return a
connect.NewError(connect.CodeInvalidArgument, validateErr). If you want a staged
rollout, wrap the validation call in an environment-flag check (e.g., if
os.Getenv("ENABLE_MINIMAL_VM_VALIDATION") == "1") and add an AIDEV-TODO comment
to remove the flag later.
| network, netErr := s.queries.AllocateNetwork(ctx) | ||
| if netErr != nil { | ||
| s.logger.Info("failed to allocate network", | ||
| slog.String("error", netErr.Error()), | ||
| ) | ||
| if s.vmMetrics != nil { | ||
| s.vmMetrics.RecordVMCreateFailure(ctx, s.getBackendType(), "invalid_config") | ||
| } | ||
| return nil, connect.NewError(connect.CodeInvalidArgument, validateErr) | ||
| return nil, connect.NewError(connect.CodeInternal, netErr) | ||
| } | ||
|
|
||
| // Add tenant context to logs for audit trail | ||
| // AIDEV-NOTE: In multi-tenant systems, all VM operations should be logged with tenant context | ||
| s.logWithTenantContext(ctx, slog.LevelInfo, "creating vm", | ||
| slog.Int("vcpus", int(config.GetCpu().GetVcpuCount())), | ||
| slog.Int64("memory_bytes", config.GetMemory().GetSizeBytes()), | ||
| s.logger.Info("network allocated", | ||
| slog.String("network_cidr", network.BaseNetwork), | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
🛠️ Refactor suggestion
Allocate-and-forget network; add cleanup on failure or defer allocation
If backend CreateVM fails, the allocated network is leaked.
Proposed pattern:
network, netErr := s.queries.AllocateNetwork(ctx)
if netErr != nil { ... }
+// AIDEV-TODO: release allocated network on CreateVM failure
vmID, err := s.backend.CreateVM(ctx, config)
if err != nil {
- ...
+ // _ = s.queries.ReleaseNetwork(ctx, network.Id) // implement this
+ ...
}Alternatively, allocate network after VM creation if backend doesn’t require it during create.
Also applies to: 114-126
| return connect.NewResponse(&metaldv1.CreateVmResponse{ | ||
| VmId: vmID, | ||
| State: metaldv1.VmState_VM_STATE_CREATED, | ||
| HostPortPair: "0.0.0.0:9999", | ||
| State: metaldv1.VmState_VM_STATE_CREATED, | ||
| }), nil | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
🛠️ Refactor suggestion
CreateVm response should surface a handle to manage the VM; avoid hard-coded host:port
Returning a fixed "0.0.0.0:9999" is a placeholder and clients have no durable identifier for follow-up ops.
Option A (no proto change): return vmID in a response header.
- return connect.NewResponse(&metaldv1.CreateVmResponse{
- HostPortPair: "0.0.0.0:9999",
- State: metaldv1.VmState_VM_STATE_CREATED,
- }), nil
+ // AIDEV-TODO: surface the actual endpoint from networking/backends once available
+ resp := connect.NewResponse(&metaldv1.CreateVmResponse{
+ State: metaldv1.VmState_VM_STATE_CREATED,
+ })
+ resp.Header().Set("x-unkey-vm-id", vmID)
+ return resp, nilOption B (preferred): add vm_id to CreateVmResponse in the proto.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return connect.NewResponse(&metaldv1.CreateVmResponse{ | |
| VmId: vmID, | |
| State: metaldv1.VmState_VM_STATE_CREATED, | |
| HostPortPair: "0.0.0.0:9999", | |
| State: metaldv1.VmState_VM_STATE_CREATED, | |
| }), nil | |
| } | |
| // AIDEV-TODO: surface the actual endpoint from networking/backends once available | |
| resp := connect.NewResponse(&metaldv1.CreateVmResponse{ | |
| State: metaldv1.VmState_VM_STATE_CREATED, | |
| }) | |
| resp.Header().Set("x-unkey-vm-id", vmID) | |
| return resp, nil | |
| } |
| func (s *VMService) getBackendType() string { | ||
| // Try to determine backend type from the backend implementation | ||
| switch s.backend.(type) { | ||
| case interface{ GetProcessInfo() map[string]interface{} }: | ||
| return "firecracker" | ||
| default: | ||
| return "cloudhypervisor" | ||
| } | ||
| return "firecracker" | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Hard-coded backend type breaks Docker-dev metrics; pass backendType in ctor
Return the actual backend type instead of constant “firecracker”.
Apply both diffs.
- Service:
type VMService struct {
backend types.Backend
...
- tracer trace.Tracer
- queries database.Querier
+ tracer trace.Tracer
+ queries database.Querier
+ backendType string
metaldv1connect.UnimplementedVmServiceHandler
}
-func NewVMService(backend types.Backend, logger *slog.Logger, metricsCollector *billing.MetricsCollector, vmMetrics *observability.VMMetrics, queries database.Querier) *VMService {
+func NewVMService(backend types.Backend, logger *slog.Logger, metricsCollector *billing.MetricsCollector, vmMetrics *observability.VMMetrics, queries database.Querier, backendType string) *VMService {
tracer := otel.Tracer("metald.service.vm")
return &VMService{
backend: backend,
...
- tracer: tracer,
+ tracer: tracer,
+ backendType: backendType,
}
}
func (s *VMService) getBackendType() string {
- return "firecracker"
+ return s.backendType
}- main.go (wire actual type):
- vmService := service.NewVMService(backend, logger, metricsCollector, vmMetrics, db.Queries)
+ vmService := service.NewVMService(backend, logger, metricsCollector, vmMetrics, db.Queries, string(cfg.Backend.Type))Also applies to: 35-45
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 545-547 (and also check
lines 35-45), the method getBackendType currently returns the hard-coded string
"firecracker"; change the VMService to store a backendType field set from the
constructor and have getBackendType return that field. Update the VMService
constructor signature to accept backendType (string) and assign it to the new
field, and then update main.go where VMService is instantiated to pass the
actual backend type value wired from configuration or the caller so Docker-dev
metrics receive the correct backend type.
| After=network.target spire-agent.service assetmanagerd.service metald.service | ||
| Wants=network.target | ||
| Requires=spire-agent.service assetmanagerd.service metald.service |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Avoid cascading failures; use Wants(+After) for metald and wait for network-online.
Billaged likely doesn’t need a hard dependency on metald. Requires=metald means billaged is stopped if metald stops. Prefer Wants=metald with ordering, and wait for network-online to avoid TLS/mTLS races on boot.
Apply:
-After=network.target spire-agent.service assetmanagerd.service metald.service
-Wants=network.target
-Requires=spire-agent.service assetmanagerd.service metald.service
+After=network-online.target spire-agent.service assetmanagerd.service metald.service
+Wants=network-online.target metald.service
+Requires=spire-agent.service assetmanagerd.service🏁 Script executed:
#!/bin/bash
set -e
echo "Locating billaged.service file:"
rg -l 'billaged.service'
echo
echo "Printing first 20 lines of billaged.service:"
sed -n '1,20p' go/deploy/billaged/contrib/systemd/billaged.service
echo
echo "Searching for network-online.target usage in all .service files:"
rg -n 'network-online.target' -g '*.service'Length of output: 2049
🏁 Script executed:
#!/bin/bash
set -e
echo "Locating assetmanagerd.service"
find go -type f -name 'assetmanagerd.service'
echo
echo "Printing first 20 lines of metald.service"
sed -n '1,20p' go/deploy/docker/systemd/metald.service
echo
echo "Printing first 20 lines of spire-agent.service"
sed -n '1,20p' go/deploy/spire/contrib/systemd/spire-agent.serviceLength of output: 1733
go/deploy/billaged/contrib/systemd/billaged.service – use Wants=metald.service (+After) and network-online.target
-After=network.target spire-agent.service assetmanagerd.service metald.service
-Wants=network.target
-Requires=spire-agent.service assetmanagerd.service metald.service
+After=network-online.target spire-agent.service assetmanagerd.service metald.service
+Wants=network-online.target metald.service
+Requires=spire-agent.service assetmanagerd.service📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| After=network.target spire-agent.service assetmanagerd.service metald.service | |
| Wants=network.target | |
| Requires=spire-agent.service assetmanagerd.service metald.service | |
| After=network-online.target spire-agent.service assetmanagerd.service metald.service | |
| Wants=network-online.target metald.service | |
| Requires=spire-agent.service assetmanagerd.service |
🤖 Prompt for AI Agents
In go/deploy/billaged/contrib/systemd/billaged.service around lines 4 to 6,
change the unit dependencies so metald is a soft dependency and the service
waits for the network to be online: replace After=network.target with
After=network-online.target spire-agent.service assetmanagerd.service
metald.service; add Wants=network-online.target metald.service (so metald is
pulled in but not a hard require); and remove metald.service from the Requires=
line leaving Requires=spire-agent.service assetmanagerd.service.
| Environment=UNKEY_BUILDERD_SCRATCH_DIR=/opt/builderd/scratch | ||
| Environment=UNKEY_BUILDERD_ROOTFS_OUTPUT_DIR=/opt/builderd/rootfs | ||
| Environment=UNKEY_BUILDERD_WORKSPACE_DIR=/opt/builderd/workspace | ||
| Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Feature flag default set to false — confirm rollout plan and configurability.
If staged rollout is intended, consider sourcing an EnvironmentFile to flip this per-host without editing the unit.
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false
+EnvironmentFile=-/etc/unkey/builderd.env
+# AIDEV-NOTE: Toggle UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=true via /etc/unkey/builderd.env during staged rollout📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false | |
| # ─── go/deploy/builderd/contrib/systemd/builderd.service ─── | |
| Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=false | |
| EnvironmentFile=-/etc/unkey/builderd.env | |
| # AIDEV-NOTE: Toggle UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR=true via /etc/unkey/builderd.env during staged rollout |
🤖 Prompt for AI Agents
In go/deploy/builderd/contrib/systemd/builderd.service around line 38, the
feature flag is hard-coded to false which prevents per-host toggling and staged
rollout; update the unit to source an EnvironmentFile (e.g. add
EnvironmentFile=/etc/default/builderd and move
Environment=UNKEY_BUILDERD_USE_PIPELINE_EXECUTOR to that file) so operators can
flip the flag per host without editing the unit, keep a safe default in the
EnvironmentFile or override in /etc/systemd/system/builderd.service.d/ for
controlled rollout, and add a short note in deploy docs describing the rollout
plan and where to configure the flag.
| func (s *PullImageStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | ||
| imageName := input.Config.GetSource().GetDockerImage().GetImageUri() | ||
|
|
||
| input.Logger.InfoContext(ctx, "pulling Docker image", slog.String("image", imageName)) | ||
|
|
||
| err := s.executor.pullDockerImage(ctx, input.Logger, imageName) | ||
| if err != nil { | ||
| return StepOutput{Success: false, Error: err}, err | ||
| } | ||
|
|
||
| return StepOutput{ | ||
| ImageName: imageName, | ||
| Success: true, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
Validate image source before use.
Guard against nil Config/Source/DockerImage and empty URI to avoid panics.
Apply:
func (s *PullImageStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
- imageName := input.Config.GetSource().GetDockerImage().GetImageUri()
+ if input.Config == nil || input.Config.GetSource() == nil || input.Config.GetSource().GetDockerImage() == nil {
+ err := fmt.Errorf("docker image source is required")
+ return StepOutput{Success: false, Error: err}, err
+ }
+ imageName := input.Config.GetSource().GetDockerImage().GetImageUri()
+ if imageName == "" {
+ err := fmt.Errorf("docker image URI is empty")
+ return StepOutput{Success: false, Error: err}, err
+ }| func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | ||
| input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName)) | ||
|
|
||
| containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName) | ||
| if err != nil { | ||
| return StepOutput{Success: false, Error: err}, err | ||
| } | ||
|
|
||
| return StepOutput{ | ||
| ImageName: input.ImageName, | ||
| ContainerID: containerID, | ||
| Success: true, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Require image name before creating container.
If a previous step failed to set ImageName, this will fail later. Validate early.
Apply:
func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
+ if input.ImageName == "" {
+ err := fmt.Errorf("image name is required to create container")
+ return StepOutput{Success: false, Error: err}, err
+ }
input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | |
| input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName)) | |
| containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName) | |
| if err != nil { | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| return StepOutput{ | |
| ImageName: input.ImageName, | |
| ContainerID: containerID, | |
| Success: true, | |
| }, nil | |
| } | |
| func (s *CreateContainerStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | |
| if input.ImageName == "" { | |
| err := fmt.Errorf("image name is required to create container") | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| input.Logger.InfoContext(ctx, "creating container", slog.String("image", input.ImageName)) | |
| containerID, err := s.executor.createContainer(ctx, input.Logger, input.ImageName) | |
| if err != nil { | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| return StepOutput{ | |
| ImageName: input.ImageName, | |
| ContainerID: containerID, | |
| Success: true, | |
| }, nil | |
| } |
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_steps.go around lines 42 to 55,
the Execute function does not validate that input.ImageName is set before
attempting to create a container; add an early check for a non-empty ImageName,
and if empty log an error with input.Logger (using ErrorContext) and return a
StepOutput with Success:false and a descriptive error (e.g., fmt.Errorf("missing
image name")) without calling s.executor.createContainer to avoid downstream
failures. Ensure the function returns the error and does not proceed when
ImageName is empty.
| func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | ||
| input.Logger.InfoContext(ctx, "extracting container filesystem", | ||
| slog.String("container_id", input.ContainerID), | ||
| slog.String("rootfs_dir", input.RootfsDir)) | ||
|
|
||
| err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata) | ||
| if err != nil { | ||
| return StepOutput{Success: false, Error: err}, err | ||
| } |
There was a problem hiding this comment.
Validate container ID and metadata before filesystem extraction.
Prevent ambiguous errors from the executor.
Apply:
func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) {
+ if input.ContainerID == "" {
+ err := fmt.Errorf("container ID is required to extract filesystem")
+ return StepOutput{Success: false, Error: err}, err
+ }
+ if input.Metadata == nil {
+ err := fmt.Errorf("image metadata is required to extract filesystem")
+ return StepOutput{Success: false, Error: err}, err
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | |
| input.Logger.InfoContext(ctx, "extracting container filesystem", | |
| slog.String("container_id", input.ContainerID), | |
| slog.String("rootfs_dir", input.RootfsDir)) | |
| err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata) | |
| if err != nil { | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| func (s *ExtractFilesystemStep) Execute(ctx context.Context, input StepInput) (StepOutput, error) { | |
| if input.ContainerID == "" { | |
| err := fmt.Errorf("container ID is required to extract filesystem") | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| if input.Metadata == nil { | |
| err := fmt.Errorf("image metadata is required to extract filesystem") | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| input.Logger.InfoContext(ctx, "extracting container filesystem", | |
| slog.String("container_id", input.ContainerID), | |
| slog.String("rootfs_dir", input.RootfsDir)) | |
| err := s.executor.extractFilesystem(ctx, input.Logger, input.ContainerID, input.RootfsDir, input.Metadata) | |
| if err != nil { | |
| return StepOutput{Success: false, Error: err}, err | |
| } | |
| // ... remainder of method ... | |
| } |
🤖 Prompt for AI Agents
In go/deploy/builderd/internal/executor/docker_steps.go around lines 91 to 99,
the Execute method calls s.executor.extractFilesystem without validating
input.ContainerID and input.Metadata, which can cause ambiguous errors; add
precondition checks that return a clear StepOutput and error if ContainerID is
empty (or only whitespace) and if Metadata is nil or missing required fields,
logging a descriptive message via input.Logger before returning so the executor
is never invoked with invalid inputs.
| const returnIPJSON = `-- name: ReturnIPJSON :exec | ||
| UPDATE network_allocations | ||
| SET available_ips = json_insert(available_ips, '$[#]', ?) | ||
| WHERE deployment_id = ? | ||
| ` | ||
|
|
||
| type ReturnIPJSONParams struct { | ||
| JsonInsert interface{} `db:"json_insert" json:"json_insert"` | ||
| DeploymentID string `db:"deployment_id" json:"deployment_id"` | ||
| } | ||
|
|
||
| func (q *Queries) ReturnIPJSON(ctx context.Context, arg ReturnIPJSONParams) error { | ||
| _, err := q.db.ExecContext(ctx, returnIPJSON, arg.JsonInsert, arg.DeploymentID) | ||
| return err | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Tighten JSON insert typing
Use a concrete string and CAST to TEXT to avoid type ambiguity across drivers.
Apply this diff (reflect in queries.sql and regenerate):
-SET available_ips = json_insert(available_ips, '$[#]', ?)
+SET available_ips = json_insert(available_ips, '$[#]', CAST(? AS TEXT))
@@
-type ReturnIPJSONParams struct {
- JsonInsert interface{} `db:"json_insert" json:"json_insert"`
+type ReturnIPJSONParams struct {
+ JsonInsert string `db:"json_insert" json:"json_insert"`
DeploymentID string `db:"deployment_id" json:"deployment_id"`
}🤖 Prompt for AI Agents
In go/deploy/metald/internal/database/queries.sql.go around lines 231 to 245,
the JSON insert parameter is currently typed as interface{} and passed directly
to json_insert causing driver type ambiguity; change the SQL to CAST the
placeholder to TEXT (e.g. json_insert(available_ips, '$[#]', CAST(? AS TEXT)))
and update the ReturnIPJSONParams.JsonInsert field to string (and ensure the
queries.sql entry is updated accordingly), then regenerate the queries to
reflect the concrete string typing.
| if link, err := netlink.LinkByName(m.config.BridgeName); err == nil { | ||
| m.logger.Debug("bridge exists", | ||
| slog.String("bridge", m.config.BridgeName), | ||
| slog.String("type", link.Type()), | ||
| slog.String("state", link.Attrs().OperState.String()), | ||
| ) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Do not early-return on existing bridge; ensure address and admin-up state.
Current early return skips reconciling IP and link state. Make ensureBridge idempotent: when the bridge exists, still ensure it’s up and has the configured address.
Example approach (conceptual; see addr/Replace comment below):
- Keep a reference to br whether created or found.
- Always set link up.
- Replace or ensure the address on br.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 12-19, the function
currently early-returns if the bridge exists which skips reconciling its IP
address and admin-up state; change it to keep a reference to the found link (br)
instead of returning, then always ensure the link is set UP (netlink.LinkSetUp)
and ensure/replace the configured IP address on that link (use addr.Replace or
the appropriate netlink addr add/replace flow) so the method is idempotent; log
actions and propagate errors from the LinkSetUp and address reconciliation steps
rather than returning immediately when the bridge exists.
| if err := netlink.LinkAdd(bridge); err != nil { | ||
| m.logger.Error("failed to create bridge", | ||
| slog.String("bridge", m.config.BridgeName), | ||
| slog.String("error", err.Error()), | ||
| ) | ||
| return fmt.Errorf("failed to create bridge: %w", err) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Handle EEXIST race on LinkAdd.
If another process creates the bridge between the check and add, LinkAdd may return EEXIST. Treat as success.
- if err := netlink.LinkAdd(bridge); err != nil {
+ if err := netlink.LinkAdd(bridge); err != nil {
+ // treat already-exists as success for idempotency
+ if !os.IsExist(err) {
m.logger.Error("failed to create bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to create bridge: %w", err)
- }
+ }
+ }Add import:
import (
"fmt"
"log/slog"
+ "os"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err := netlink.LinkAdd(bridge); err != nil { | |
| m.logger.Error("failed to create bridge", | |
| slog.String("bridge", m.config.BridgeName), | |
| slog.String("error", err.Error()), | |
| ) | |
| return fmt.Errorf("failed to create bridge: %w", err) | |
| } | |
| // In the import block, add "os": | |
| import ( | |
| "fmt" | |
| "log/slog" | |
| "os" | |
| ) | |
| // ... later, around the call to netlink.LinkAdd: | |
| if err := netlink.LinkAdd(bridge); err != nil { | |
| // treat already-exists as success for idempotency | |
| if !os.IsExist(err) { | |
| m.logger.Error("failed to create bridge", | |
| slog.String("bridge", m.config.BridgeName), | |
| slog.String("error", err.Error()), | |
| ) | |
| return fmt.Errorf("failed to create bridge: %w", err) | |
| } | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 27 to 33, the call
to netlink.LinkAdd returns an error if the bridge was created by another process
(EEXIST) and should be treated as success; update the error handling to check
errors.Is(err, syscall.EEXIST) (or os.IsExist-equivalent) and ignore that case
(do not return an error), while keeping logging for other errors, and add the
syscall import to the file.
| addr, err := netlink.ParseAddr(m.config.BaseNetwork.String()) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to parse bridge IP: %w", err) | ||
| } | ||
|
|
||
| if err := netlink.AddrAdd(br, addr); err != nil { | ||
| m.logger.Error("failed to add IP to bridge", |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Wrong address assigned: using the network address as the bridge IP.
Parsing BaseNetwork (e.g., 172.16.0.0/12) yields the network address, not a host IP. Assign a host address (e.g., first usable: 172.16.0.1/12) or take a dedicated BridgeCIDR from config/env.
Minimal fix options:
- Prefer a new config field (BridgeCIDR string) and parse that here.
- Or derive first host from BaseNetwork before calling ParseAddr.
I can provide a patch once the desired source of truth (env vs code) is confirmed.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 40 to 46, the code
parses m.config.BaseNetwork which yields the network address (e.g.,
172.16.0.0/12) and assigns that as the bridge IP; instead, either parse a
dedicated BridgeCIDR config field or derive and use the first usable host
address from BaseNetwork (e.g., .1) before calling netlink.AddrAdd. Update the
config to include BridgeCIDR (preferred) and parse that here, or implement logic
to compute the first host IP from the parsed network prefix and construct an
address with the same mask, then pass that address to netlink.AddrAdd; ensure
validation and clear error messages on parse/failure.
| if err := netlink.AddrAdd(br, addr); err != nil { | ||
| m.logger.Error("failed to add IP to bridge", | ||
| slog.String("bridge", m.config.BridgeName), | ||
| slog.String("error", err.Error()), | ||
| ) | ||
| return fmt.Errorf("failed to add IP to bridge: %w", err) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Prefer AddrReplace and ignore EEXIST for idempotency.
AddrAdd fails if address already present. Use AddrReplace so re-runs converge.
- if err := netlink.AddrAdd(br, addr); err != nil {
+ if err := netlink.AddrReplace(br, addr); err != nil {
m.logger.Error("failed to add IP to bridge",
slog.String("bridge", m.config.BridgeName),
slog.String("error", err.Error()),
)
return fmt.Errorf("failed to add IP to bridge: %w", err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err := netlink.AddrAdd(br, addr); err != nil { | |
| m.logger.Error("failed to add IP to bridge", | |
| slog.String("bridge", m.config.BridgeName), | |
| slog.String("error", err.Error()), | |
| ) | |
| return fmt.Errorf("failed to add IP to bridge: %w", err) | |
| } | |
| if err := netlink.AddrReplace(br, addr); err != nil { | |
| m.logger.Error("failed to add IP to bridge", | |
| slog.String("bridge", m.config.BridgeName), | |
| slog.String("error", err.Error()), | |
| ) | |
| return fmt.Errorf("failed to add IP to bridge: %w", err) | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 45-51, replace
netlink.AddrAdd with netlink.AddrReplace to make adding the IP idempotent; call
netlink.AddrReplace(br, addr) and update the error handling to treat EEXIST as
non-fatal (i.e., if errors.Is(err, syscall.EEXIST) return nil), otherwise log
and return the wrapped error as before; ensure syscall is imported and the log
still includes bridge name and error details.

What does this PR do?
Fixes # (issue)
If there is not an issue for this, please create one first. This is used to tracking purposes and also helps use understand why this PR exists
Type of change
How should this be tested?
Checklist
Required
pnpm buildpnpm fmtconsole.logsgit pull origin mainAppreciated
Summary by CodeRabbit