Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughRemoved tenant-auth interceptor options; added bridge_name to DB schema and models; refactored bridge creation to a standalone CreateBridge function; replaced stateful ID generator with stateless GenerateID; added GenerateAvailableIPs; large parts of network package commented out; CreateDeployment and VM flows updated with tracing and network/bridge allocation changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as metald.API
participant DB as Database
participant NET as network package
participant BR as Bridge subsystem
participant BACK as Backend (VM)
Note over API: CreateDeployment request
Client ->> API: CreateDeployment(ctx, req)
API ->> API: start span "metald.vm.create"
API ->> DB: CreateNetworkAllocation(deploymentID, networkID, availableIPs, bridgeName)
API ->> NET: GenerateAvailableIPs(baseCIDR)
NET -->> API: availableIPs JSON
API ->> BR: CreateBridge(logger, cfg{BridgeName, BaseNetwork,...})
BR -->> API: bridgeName
loop per VM
API ->> DB: allocate/pop IP for VM
DB -->> API: assigned IP
API ->> BACK: create VM (placeholder / RPC)
BACK -->> API: VM created (or placeholder)
end
API -->> Client: CreateDeploymentResponse (vm ids)
Note over API,DB: span ended
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks (1 passed, 1 warning, 1 inconclusive)❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thank you for following the naming conventions for pull request titles! 🙏 |
There was a problem hiding this comment.
Actionable comments posted: 21
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (24)
go/deploy/spire/Makefile (2)
21-25: Fix production trust domain typo.
production.unkey.cliudshould beproduction.unkey.cloud. Current value will produce incorrect SPIFFE IDs.-TRUST_DOMAIN = production.unkey.cliud +TRUST_DOMAIN = production.unkey.cloud
127-129: Add missingcreate-userstarget and run SPIRE services as non-root
- Define an idempotent
create-userstarget before theuninstalltarget ingo/deploy/spire/Makefilesinceinstall-serverdepends on it.- Update
go/deploy/spire/contrib/systemd/spire-server.service(line 9) toUser=spire-serverandGroup=spire-server.- Update
go/deploy/spire/contrib/systemd/spire-agent.service(line 10) toUser=spire-agentandGroup=spire-agent.go/deploy/metald/internal/network/network.go (1)
3-118: No callers for deprecated APIs
Ripgrep across the repo found zero references to the now‐commented exports (GetSubnetInfo,CalculateIndex,ValidateIdentifier, etc.), so builds won’t break. You can safely deletego/deploy/metald/internal/network/network.goand add adoc.gooutlining the new API surface to avoid confusion.go/deploy/metald/sqlc/schema.sql (4)
13-17: Enforce Linux bridge name length (<=15) to prevent runtime netlink errors.Add a CHECK constraint now so invalid names never persist.
- bridge_name TEXT NOT NULL UNIQUE, + bridge_name TEXT NOT NULL UNIQUE CHECK(length(bridge_name) <= 15),
9-17: Clarify FK delete behavior and enforce via explicit actions.Decide whether to CASCADE or RESTRICT deletes to avoid orphans; encode it explicitly.
- FOREIGN KEY (network_id) REFERENCES networks(id) + FOREIGN KEY (network_id) REFERENCES networks(id) ON DELETE RESTRICT- FOREIGN KEY (network_allocation_id) REFERENCES network_allocations(id), + FOREIGN KEY (network_allocation_id) REFERENCES network_allocations(id) ON DELETE RESTRICT,
37-39: Redundant index: deployment_id already has a UNIQUE constraint.SQLite creates an index for UNIQUE(deployment_id); the extra index duplicates it. Remove to save writes and space.
-CREATE INDEX idx_network_allocations_deployment - ON network_allocations(deployment_id);
34-35: Likely-needed index on network_allocations(network_id).Joins/filtering by network_id will benefit from an index.
+CREATE INDEX idx_network_allocations_network + ON network_allocations(network_id);go/deploy/metald/cmd/metald/main.go (5)
318-336: Auth removed; ensure mTLS-only access or refuse to start when TLS is disabled.With tenant auth interceptor gone, running with TLS.Mode="disabled" exposes unauthenticated RPCs. Fail fast or add request-layer auth.
Minimal guard (outside this hunk, after config is loaded):
// hard-fail if no transport security AND no request auth if cfg.TLS.Mode == "disabled" { logger.Error("TLS disabled and no auth interceptors configured; refusing to start") os.Exit(1) }If you rely on mTLS identity, add a lightweight Connect interceptor to enforce client SVIDs:
type mtlsAuth struct{} func (mtlsAuth) WrapUnary(next connect.UnaryFunc) connect.UnaryFunc { return func(ctx context.Context, req connect.AnyRequest) (connect.AnyResponse, error) { info := connect.RequestMetadataFromContext(ctx) if info == nil || len(info.PeerCertificates) == 0 { return nil, connect.NewError(connect.CodeUnauthenticated, errors.New("client cert required")) } // TODO: verify SPIFFE IDs / allowed subjects here return next(ctx, req) } }
348-355: Duplicate /metrics endpoints (main mux and dedicated Prom server). Pick one.Serving metrics twice can double scrape cost and confuse operators.
- if cfg.OpenTelemetry.Enabled && cfg.OpenTelemetry.PrometheusEnabled { - mux.Handle("/metrics", otelProviders.PrometheusHTTP) - logger.Info("Prometheus metrics endpoint enabled", - slog.String("path", "/metrics"), - ) - } + // metrics are exposed via the dedicated Prometheus server belowAlso applies to: 408-436
55-58: Comment/code mismatch: says “First 8 chars” but returns 7.Align the comment with behavior.
- return "dev-" + setting.Value[:7] // First 8 chars of commit hash + return "dev-" + setting.Value[:7] // First 7 chars of commit hash
359-371: H2C allowed when TLS is off. Combined with no auth, this is risky.If keeping H2C for local dev, restrict bind to loopback when TLS disabled.
Example (outside hunk):
if cfg.TLS.Mode == "disabled" && (cfg.Server.Address != "127.0.0.1" && cfg.Server.Address != "localhost") { logger.Error("insecure H2C requires loopback bind; set UNKEY_METALD_ADDRESS=127.0.0.1") os.Exit(1) }Also applies to: 373-380
481-491: Docs: default OTEL service name in help text doesn’t match runtime (“metald”).Update help to avoid confusion.
- UNKEY_METALD_OTEL_SERVICE_NAME Service name (default: vmm-controlplane) + UNKEY_METALD_OTEL_SERVICE_NAME Service name (default: metald)go/deploy/metald/internal/database/models.go (1)
25-32: Explicitly order columns in CreateNetworkAllocation and GetNetworkAllocation queries
RETURNING * andSELECT na.*emitavailable_ipsbeforebridge_name(table order), but the Go model declaresBridgeNamebeforeAvailableIps, swapping those fields at scan time. Change both SQL statements to explicitly list columns in struct order—e.g.:-- CreateNetworkAllocation INSERT INTO network_allocations ( id, deployment_id, network_id, bridge_name, available_ips, allocated_at ) VALUES (DEFAULT, ?, ?, ?, ?, DEFAULT) RETURNING id, deployment_id, network_id, bridge_name, available_ips, allocated_at; -- GetNetworkAllocation SELECT na.id, na.deployment_id, na.network_id, na.bridge_name, na.available_ips, na.allocated_at, n.base_network FROM network_allocations na JOIN networks n ON na.network_id = n.id WHERE na.deployment_id = ?;go/deploy/metald/internal/service/vm.go (7)
93-104: CreateVm allocates a network but never creates a network_allocation or bridge.This leaves the DB in a partial state and leaks the network. Either:
- Move single-VM provisioning under the new CreateDeployment flow, or
- Create a network_allocation row (with available_ips and bridge_name) and clean up on failure.
I can draft a transactional flow if you confirm CreateVm should remain supported.
105-111: Compile-time break: AllocateIP called without required fields.database.AllocateIPParams requires IpAddr and NetworkAllocationID. Also, IP must be popped first via PopAvailableIPJSON.
Proposed fix (assuming CreateVmRequest has deployment_id; adjust if different):
- ip, ipErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ - VmID: req.Msg.GetVmId(), - }) - if ipErr != nil { - return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("failed to allocate IP for vm: %w", ipErr)) - } + // Pop an IP and allocate it to this VM + ipRow, popErr := s.queries.PopAvailableIPJSON(ctx, req.Msg.GetDeploymentId()) + if popErr != nil { + return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("failed to pop available IP: %w", popErr)) + } + ipAlloc, ipErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ + VmID: req.Msg.GetVmId(), // or use backend vmID below + IpAddr: ipRow.AllocatedIp, // ensure alias matches query + NetworkAllocationID: ipRow.ID, + }) + if ipErr != nil { + return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("failed to allocate IP for vm: %w", ipErr)) + }Follow-up: Use the backend’s returned vmID for IP allocation to avoid mismatch if req.Msg.VmId is empty.
112-151: ID/IP flow mismatch and potential leakage.
- You allocate IP before knowing the actual vmID returned by backend.CreateVM, risking orphaned IPs if creation fails.
- Consider: pop IP → attempt VM creation with vmID generated by service → on backend failure, return IP to pool; on success, persist ip_allocations with the final vmID.
I can provide a transaction-aware sequence and compensating actions.
76-82: Full config logging may leak secrets.Dumping the entire VM config at debug can expose credentials/paths. Redact sensitive fields or gate behind an explicit debug flag.
154-212: DeleteVm should release IPs and possibly free network resources.Currently only calls backend.DeleteVM and stops metrics. Add DB cleanup:
- ReleaseIP(vm_id) and, if last VM in deployment, consider ReleaseNetwork + DeleteNetworkAllocation.
I can add the DB teardown sequence if you confirm the lifecycle.
484-541: Validation is commented out; minimal checks recommended.Re-enable at least CPU/memory/boot sanity checks to fail fast before backend.
I can reintroduce a trimmed validator aligned with current proto fields.
105-111: Remove incorrect GetDeploymentId usage CreateVmRequest has no deployment_id field—only vm_id; replace any GetDeploymentId calls with req.Msg.GetVmId().go/deploy/metald/sqlc/queries.sql (2)
25-31: PopAvailableIPJSON returns the wrong IP.After the UPDATE, json_extract reads the new 0th element—not the popped one—causing allocation of a still-available IP. Use a CTE to capture the original element.
Apply:
--- a/go/deploy/metald/sqlc/queries.sql +++ b/go/deploy/metald/sqlc/queries.sql @@ --- name: PopAvailableIPJSON :one -UPDATE network_allocations -SET available_ips = json_remove(available_ips, '$[0]') -WHERE deployment_id = ? -AND json_array_length(available_ips) > 0 -RETURNING json_extract(available_ips, '$[0]') as allocated_ip, id; +WITH orig AS ( + SELECT id, available_ips + FROM network_allocations + WHERE deployment_id = ? + AND json_array_length(available_ips) > 0 + LIMIT 1 +), +upd AS ( + UPDATE network_allocations + SET available_ips = json_remove(available_ips, '$[0]') + WHERE id = (SELECT id FROM orig) + RETURNING id +) +SELECT json_extract(orig.available_ips, '$[0]') AS allocated_ip, orig.id +FROM orig;
49-72: Add supporting indexes for hot paths.Queries frequently filter by deployment_id and network_allocation_id. Add indexes:
- network_allocations(deployment_id)
- ip_allocations(vm_id)
- ip_allocations(network_allocation_id)
I can add a migration snippet if desired.
go/deploy/metald/internal/service/deployment.go (1)
140-143: Great place to include metrics.Record deployment success/failure and durations similar to VM paths.
go/deploy/metald/internal/database/queries.sql.go (1)
200-206: Bug: PopAvailableIPJSON returns the wrong IP (post-update value).
RETURNING json_extract(available_ips, '$[0]')runs after the update, so it returns the new first IP, not the popped one. Make it atomic and return the pre-update value using a CTE.Apply this diff to fix the query logic:
-UPDATE network_allocations -SET available_ips = json_remove(available_ips, '$[0]') -WHERE deployment_id = ? -AND json_array_length(available_ips) > 0 -RETURNING json_extract(available_ips, '$[0]') as allocated_ip, id +WITH popped AS ( + SELECT + deployment_id, + json_extract(available_ips, '$[0]') AS allocated_ip + FROM network_allocations + WHERE deployment_id = ? + AND json_array_length(available_ips) > 0 +) +UPDATE network_allocations +SET available_ips = json_remove(available_ips, '$[0]') +WHERE deployment_id = (SELECT deployment_id FROM popped) +RETURNING (SELECT allocated_ip FROM popped) AS allocated_ip, id
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (15)
go/deploy/metald/cmd/metald/main.go(1 hunks)go/deploy/metald/internal/backend/firecracker/asset_manager.go(0 hunks)go/deploy/metald/internal/database/models.go(1 hunks)go/deploy/metald/internal/database/queries.sql.go(4 hunks)go/deploy/metald/internal/network/bridge.go(1 hunks)go/deploy/metald/internal/network/idgen.go(1 hunks)go/deploy/metald/internal/network/iputil.go(1 hunks)go/deploy/metald/internal/network/network.go(1 hunks)go/deploy/metald/internal/service/auth.go(0 hunks)go/deploy/metald/internal/service/deployment.go(1 hunks)go/deploy/metald/internal/service/vm.go(3 hunks)go/deploy/metald/sqlc/queries.sql(1 hunks)go/deploy/metald/sqlc/schema.sql(1 hunks)go/deploy/pkg/observability/interceptors/options.go(0 hunks)go/deploy/spire/Makefile(1 hunks)
💤 Files with no reviewable changes (3)
- go/deploy/metald/internal/backend/firecracker/asset_manager.go
- go/deploy/pkg/observability/interceptors/options.go
- go/deploy/metald/internal/service/auth.go
🧰 Additional context used
🧬 Code graph analysis (2)
go/deploy/metald/internal/database/queries.sql.go (1)
go/deploy/metald/internal/database/models.go (1)
NetworkAllocation(25-32)
go/deploy/metald/internal/service/deployment.go (5)
go/deploy/metald/internal/service/vm.go (1)
VMService(26-34)go/deploy/metald/internal/network/idgen.go (1)
GenerateID(13-19)go/deploy/metald/internal/network/bridge.go (1)
CreateBridge(11-68)go/deploy/metald/internal/network/iputil.go (1)
GenerateAvailableIPs(11-48)go/deploy/metald/internal/database/queries.sql.go (2)
CreateNetworkAllocationParams(65-70)AllocateIPParams(19-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test API / API Test Local
- GitHub Check: Test Go API Local / Test
- GitHub Check: Build / Build
🔇 Additional comments (11)
go/deploy/metald/sqlc/schema.sql (1)
13-13: Ignore migration warning for bridge_name constraint
Thisschema.sqlrepresents the initial (greenfield) DDL—there’s no existing data or migration pipeline ingo/deploy/metald—so addingNOT NULL UNIQUEhere does not pose a migration risk.Likely an incorrect or invalid review comment.
go/deploy/metald/internal/database/models.go (1)
25-32: BridgeName mapping looks correct and aligns with schema.DB/json tags are consistent with schema and API expectations.
go/deploy/metald/internal/service/vm.go (5)
145-151: Returned endpoint uses ip from earlier broken AllocateIP call.After fixing IP allocation, ensure Host uses the allocated IP (ipAlloc.IpAddr). Also, avoid hardcoding the port if backend dictates it.
Would you like me to wire the backend port or make it configurable?
36-47: Service label change to "metald" is fine.Unifies logging tags across services.
65-73: Good: metrics hooks on VM create path.Create/failure metrics are recorded consistently.
176-184: Good: stop metrics collection on delete/shutdown before backend calls.Avoids dangling collectors.
214-482: Boot/Shutdown/Pause/Resume/GetVmInfo flows look consistent.Error handling and tracing are reasonable.
go/deploy/metald/sqlc/queries.sql (1)
14-17: No action needed: parameter order in CreateNetworkAllocationParams matches the SQL definition.go/deploy/metald/internal/service/deployment.go (1)
70-85: Good: precomputing available IPs and persisting bridge_name.This aligns with the new schema and flow.
go/deploy/metald/internal/database/queries.sql.go (2)
60-63: CreateNetworkAllocation insert/return scan order looks correct.Columns and scan order for
bridge_namealign with the struct; placeholders prevent injection. No issues.Also applies to: 73-78, 84-84
142-142: GetNetworkAllocation bridge_name propagation is consistent.Selection, struct field addition, and scan order are coherent. Good.
Also applies to: 152-152, 165-165
| INSERT INTO network_allocations (deployment_id, network_id, available_ips, bridge_name) | ||
| VALUES (?, ?, ?, ?) | ||
| RETURNING id, deployment_id, network_id, bridge_name, available_ips, allocated_at | ||
| ` |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add a DB-level safeguard: enforce Linux IFNAMSIZ (15 chars) on bridge_name.
Linux interface names are limited to 15 characters. The schema appears to define bridge_name TEXT NOT NULL UNIQUE, but without a length constraint you can persist invalid names that will later fail in netlink. Add a CHECK constraint in schema (and validate in application) to cap length at 15.
Example (schema.sql):
ALTER TABLE network_allocations
ADD CONSTRAINT chk_bridge_name_len CHECK (length(bridge_name) <= 15);Also applies to: 69-69, 73-78, 84-84
🤖 Prompt for AI Agents
In go/deploy/metald/internal/database/queries.sql.go around lines 60-63 (and
also apply to lines 69, 73-78, 84), add a DB-level CHECK constraint enforcing
Linux IFNAMSIZ by ensuring bridge_name length <= 15 and update the related
INSERT/UPDATE DDL or ALTER TABLE statements to include that constraint; also add
complementary application-side validation to reject or truncate bridge names
longer than 15 characters before DB write so netlink calls never receive invalid
names.
| // CreateBridge creates the bridge if it doesn't exist | ||
| func CreateBridge(logger *slog.Logger, cfg Config) (string, error) { | ||
| link, linkErr := netlink.LinkByName(cfg.BridgeName) | ||
| if linkErr == nil { | ||
| logger.Debug("bridge exists", | ||
| slog.String("bridge", cfg.BridgeName), | ||
| slog.Int("mtu", link.Attrs().MTU), | ||
| ) | ||
| return nil | ||
| return link.Attrs().Name, nil | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Make CreateBridge idempotent: ensure address and link-up even when bridge exists.
Early-return on existing bridge skips address reconciliation and LinkSetUp. Convert to “ensure” semantics: reuse the existing link, add/replace the IP, and bring it up.
Apply this diff:
func CreateBridge(logger *slog.Logger, cfg Config) (string, error) {
- link, linkErr := netlink.LinkByName(cfg.BridgeName)
- if linkErr == nil {
- logger.Debug("bridge exists",
- slog.String("bridge", cfg.BridgeName),
- slog.Int("mtu", link.Attrs().MTU),
- )
- return link.Attrs().Name, nil
- }
+ link, linkErr := netlink.LinkByName(cfg.BridgeName)
+ var br netlink.Link
- bridge := &netlink.Bridge{ //nolint:exhaustruct
- LinkAttrs: netlink.LinkAttrs{
- Name: cfg.BridgeName, //nolint:exhaustruct
- },
- }
-
- if err := netlink.LinkAdd(bridge); err != nil {
- logger.Error("failed to create bridge",
- slog.String("bridge", cfg.BridgeName),
- slog.String("error", err.Error()),
- )
- return "", fmt.Errorf("failed to create bridge: %w", err)
- }
-
- br, brErr := netlink.LinkByName(cfg.BridgeName)
- if brErr != nil {
- return "", fmt.Errorf("failed to get bridge: %w", brErr)
- }
+ if linkErr == nil {
+ logger.Debug("bridge exists",
+ slog.String("bridge", cfg.BridgeName),
+ slog.Int("mtu", link.Attrs().MTU),
+ )
+ br = link
+ } else {
+ bridge := &netlink.Bridge{ //nolint:exhaustruct
+ LinkAttrs: netlink.LinkAttrs{Name: cfg.BridgeName}, //nolint:exhaustruct
+ }
+ if err := netlink.LinkAdd(bridge); err != nil {
+ logger.Error("failed to create bridge",
+ slog.String("bridge", cfg.BridgeName),
+ slog.String("error", err.Error()),
+ )
+ return "", fmt.Errorf("failed to create bridge: %w", err)
+ }
+ var brErr error
+ br, brErr = netlink.LinkByName(cfg.BridgeName)
+ if brErr != nil {
+ return "", fmt.Errorf("failed to get bridge: %w", brErr)
+ }
+ }
@@
- if err := netlink.AddrAdd(br, addr); err != nil {
+ if err := netlink.AddrReplace(br, addr); err != nil {
logger.Error("failed to add IP to bridge",
slog.String("bridge", cfg.BridgeName),
slog.String("error", err.Error()),
)
- return "", fmt.Errorf("failed to add IP to bridge: %w", err)
+ return "", fmt.Errorf("failed to configure IP on bridge: %w", err)
}
@@
- logger.Debug("bridge created",
+ logger.Debug("bridge configured",
slog.String("bridge", cfg.BridgeName),
slog.Int("mtu", br.Attrs().MTU),
)
return cfg.BridgeName, nilNote: AddrReplace makes the operation idempotent and safe on re-runs.
Also applies to: 35-38, 53-60, 62-67
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 10-19 (and also
adjust the similar logic at 35-38, 53-60, 62-67), the function currently
early-returns when the bridge exists which skips address reconciliation and
LinkSetUp; change it to “ensure” semantics by reusing the existing link instead
of returning: after obtaining link := netlink.LinkByName(cfg.BridgeName) when no
error, parse cfg.BridgeIP into an *net.IPNet, call netlink.AddrReplace(link,
&netlink.Addr{IPNet: parsedIPNet}) to idempotently add/replace the IP address,
then call netlink.LinkSetUp(link) to bring the interface up, handling and
returning any errors from AddrReplace/LinkSetUp instead of exiting early.
| addr, err := netlink.ParseAddr(cfg.BaseNetwork.String()) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to parse bridge IP: %w", err) | ||
| return "", fmt.Errorf("failed to parse bridge IP: %w", err) | ||
| } |
There was a problem hiding this comment.
Wrong IP assigned to bridge: using network address instead of a host address.
netlink.ParseAddr(cfg.BaseNetwork.String()) yields the network address (e.g., 192.168.0.0/24). Assign a valid host (e.g., first host .1) within the prefix.
Apply this diff (minimal fix) and add a small helper:
- addr, err := netlink.ParseAddr(cfg.BaseNetwork.String())
+ prefixSize, _ := cfg.BaseNetwork.Mask.Size()
+ gwIP, err := firstHostIPv4(cfg.BaseNetwork.IP)
+ if err != nil {
+ return "", fmt.Errorf("failed to derive bridge IP: %w", err)
+ }
+ addr, err := netlink.ParseAddr(fmt.Sprintf("%s/%d", gwIP.String(), prefixSize))
if err != nil {
return "", fmt.Errorf("failed to parse bridge IP: %w", err)
}Add this helper in the same package:
// firstHostIPv4 returns the first usable IPv4 address in the subnet (network IP + 1).
func firstHostIPv4(ip net.IP) (net.IP, error) {
ip4 := ip.To4()
if ip4 == nil {
return nil, fmt.Errorf("only IPv4 is supported for bridge IP assignment")
}
res := make(net.IP, 4)
copy(res, ip4)
for i := 3; i >= 0; i-- {
res[i]++
if res[i] != 0 {
break
}
}
return res, nil
}Also add net to imports.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/bridge.go around lines 40 to 43, the code
parses the network CIDR as the bridge address which yields the network address
(e.g., 192.168.0.0/24) instead of a usable host; replace the netlink.ParseAddr
call with logic that parses the CIDR to get the network IP, compute the first
usable host (network IP + 1) using a new helper, and then create the
netlink.Addr with that host IP; add the provided firstHostIPv4 helper in the
same package (which validates IPv4 and increments the IP bytes), and add the net
import to the file imports.
| // IDGenerator generates short, unique IDs for network devices | ||
| // Network interface names in Linux are limited to 15 characters, | ||
| // so we generate 8-character IDs to leave room for prefixes like "tap-", "vh-", etc. | ||
| type IDGenerator struct { | ||
| mu sync.Mutex | ||
| generated map[string]struct{} // Track generated IDs to ensure uniqueness | ||
| } | ||
|
|
||
| // NewIDGenerator creates a new ID generator | ||
| func NewIDGenerator() *IDGenerator { | ||
| //exhaustruct:ignore | ||
| return &IDGenerator{ | ||
| generated: make(map[string]struct{}), | ||
| } | ||
| } | ||
|
|
||
| // GenerateNetworkID generates a unique 8-character ID for network devices | ||
| // The ID is guaranteed to be unique within this generator instance | ||
| func (g *IDGenerator) GenerateNetworkID() (string, error) { | ||
| g.mu.Lock() | ||
| defer g.mu.Unlock() | ||
|
|
||
| // Try up to 10 times to generate a unique ID | ||
| for range 10 { | ||
| // Generate 4 random bytes (8 hex characters) | ||
| bytes := make([]byte, 4) | ||
| if _, err := rand.Read(bytes); err != nil { | ||
| return "", fmt.Errorf("failed to generate random bytes: %w", err) | ||
| } | ||
|
|
||
| id := hex.EncodeToString(bytes) | ||
|
|
||
| // Check if this ID already exists | ||
| if _, exists := g.generated[id]; !exists { | ||
| g.generated[id] = struct{}{} | ||
| return id, nil | ||
| } | ||
| } | ||
|
|
||
| return "", fmt.Errorf("failed to generate unique ID after 10 attempts") | ||
| } | ||
|
|
||
| // ReleaseID removes an ID from the tracking set, allowing it to be reused | ||
| func (g *IDGenerator) ReleaseID(id string) { | ||
| g.mu.Lock() | ||
| defer g.mu.Unlock() | ||
| delete(g.generated, id) | ||
| } | ||
|
|
||
| // NetworkDeviceNames holds all network device names for a VM | ||
| // AIDEV-NOTE: All names follow consistent patterns and stay within 15-char limit | ||
| type NetworkDeviceNames struct { | ||
| ID string // 8-character internal ID | ||
| Namespace string // Network namespace name (no length limit) | ||
| TAP string // TAP device name (15 char limit) | ||
| Bridge string // Bridge name (15 char limit) | ||
| } | ||
| // Generates an for the bridge, it's not guaranteed unique until it's saved to the DB |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Stale/incomplete comments.
The file-level comments still refer to the removed IDGenerator and have a sentence fragment. Update or remove to avoid confusion.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/idgen.go around lines 9 to 12, the
top-of-file comments are stale and contain a sentence fragment referencing a
removed IDGenerator and an incomplete sentence about the bridge; update these
comments to accurately describe the current purpose of the file (or remove them
entirely). Replace the outdated references with a concise, complete file comment
that explains what the package/file now provides (e.g., short unique ID
generation for network resources or whatever the implementation actually does),
ensuring no sentence fragments remain and Linux interface name length notes are
only present if relevant.
| // Generates an for the bridge, it's not guaranteed unique until it's saved to the DB | ||
| func GenerateID() string { | ||
| b := make([]byte, 8) | ||
| rand.Read(b) | ||
| id := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:12] | ||
|
|
||
| // GenerateDeviceNames creates a consistent set of network device names | ||
| func GenerateDeviceNames(networkID string) *NetworkDeviceNames { | ||
| return &NetworkDeviceNames{ | ||
| ID: networkID, | ||
| Namespace: fmt.Sprintf("ns_vm_%s", networkID), | ||
| TAP: fmt.Sprintf("tap_%s", networkID), // 12 chars | ||
| Bridge: fmt.Sprintf("br-%s", networkID), | ||
| } | ||
| return fmt.Sprintf("br-%s", id) | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Persistence-level retry on UNIQUE constraint recommended.
Since IDs are stateless now, implement a small retry loop (e.g., 3–5 attempts) around DB insert on UNIQUE(bridge_name) violation to keep call sites simple and robust.
🛠️ Refactor suggestion
Don’t ignore crypto/rand errors and normalize to lowercase for interface-name safety.
Handle rand.Reader errors and ensure the name is lowercase to avoid surprises on systems that expect lowercased netdev names.
import (
"crypto/rand"
"encoding/base32"
"fmt"
+ "io"
+ "strings"
)
@@
-// Generates an for the bridge, it's not guaranteed unique until it's saved to the DB
+// GenerateID returns a 15-char bridge name: "br-" + 12 base32 chars (~60 bits).
+// Not guaranteed unique; enforce a UNIQUE constraint in DB and retry on conflict.
func GenerateID() string {
- b := make([]byte, 8)
- rand.Read(b)
- id := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:12]
-
- return fmt.Sprintf("br-%s", id)
+ b := make([]byte, 8)
+ if _, err := io.ReadFull(rand.Reader, b); err != nil {
+ panic(fmt.Errorf("crypto/rand failed: %w", err))
+ }
+ enc := base32.StdEncoding.WithPadding(base32.NoPadding)
+ id := enc.EncodeToString(b)
+ id = strings.ToLower(id[:12])
+ return "br-" + id
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Generates an for the bridge, it's not guaranteed unique until it's saved to the DB | |
| func GenerateID() string { | |
| b := make([]byte, 8) | |
| rand.Read(b) | |
| id := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:12] | |
| // GenerateDeviceNames creates a consistent set of network device names | |
| func GenerateDeviceNames(networkID string) *NetworkDeviceNames { | |
| return &NetworkDeviceNames{ | |
| ID: networkID, | |
| Namespace: fmt.Sprintf("ns_vm_%s", networkID), | |
| TAP: fmt.Sprintf("tap_%s", networkID), // 12 chars | |
| Bridge: fmt.Sprintf("br-%s", networkID), | |
| } | |
| return fmt.Sprintf("br-%s", id) | |
| } | |
| import ( | |
| "crypto/rand" | |
| "encoding/base32" | |
| "fmt" | |
| "io" | |
| "strings" | |
| ) | |
| // GenerateID returns a 15-char bridge name: "br-" + 12 base32 chars (~60 bits). | |
| // Not guaranteed unique; enforce a UNIQUE constraint in DB and retry on conflict. | |
| func GenerateID() string { | |
| b := make([]byte, 8) | |
| if _, err := io.ReadFull(rand.Reader, b); err != nil { | |
| panic(fmt.Errorf("crypto/rand failed: %w", err)) | |
| } | |
| enc := base32.StdEncoding.WithPadding(base32.NoPadding) | |
| id := enc.EncodeToString(b) | |
| id = strings.ToLower(id[:12]) | |
| return "br-" + id | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/network/idgen.go around lines 12-19, the code
ignores crypto/rand errors and does not normalize the interface name to
lowercase; change GenerateID to return (string, error), call rand.Read correctly
and check its error, return a wrapped error on failure, then base32-encode the
bytes, trim to 12 chars and run strings.ToLower on the result before formatting
the final "br-%s" string; update imports (crypto/rand, strings) and adjust all
callers to handle the returned error.
| logger.Debug("network allocated", | ||
| slog.Any("network_cidr", netConfig.BaseNetwork.String()), | ||
| slog.Any("network_id", na.ID), | ||
| ) | ||
|
|
||
| logger.Debug("bridge allocated", | ||
| slog.Any("network_cidr", netConfig.BaseNetwork.String()), | ||
| slog.String("bridge_id", br), | ||
| ) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Use concrete slog types.
Prefer slog.String for CIDR and slog.Int64 for IDs.
- logger.Debug("network allocated",
- slog.Any("network_cidr", netConfig.BaseNetwork.String()),
- slog.Any("network_id", na.ID),
+ logger.Debug("network allocated",
+ slog.String("network_cidr", netConfig.BaseNetwork.String()),
+ slog.Int64("network_id", na.ID),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.Debug("network allocated", | |
| slog.Any("network_cidr", netConfig.BaseNetwork.String()), | |
| slog.Any("network_id", na.ID), | |
| ) | |
| logger.Debug("bridge allocated", | |
| slog.Any("network_cidr", netConfig.BaseNetwork.String()), | |
| slog.String("bridge_id", br), | |
| ) | |
| logger.Debug("network allocated", | |
| slog.String("network_cidr", netConfig.BaseNetwork.String()), | |
| slog.Int64("network_id", na.ID), | |
| ) | |
| logger.Debug("bridge allocated", | |
| slog.Any("network_cidr", netConfig.BaseNetwork.String()), | |
| slog.String("bridge_id", br), | |
| ) |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/deployment.go around lines 92 to 101, the
logs use slog.Any for the CIDR and network ID; change those fields to use
concrete slog types: use slog.String("network_cidr",
netConfig.BaseNetwork.String()) for the CIDR and slog.Int64("network_id", na.ID)
for the network ID (keep slog.String for the bridge_id if it is a string).
Update the two logger.Debug calls accordingly so they pass the proper typed slog
attributes.
| logger.DebugContext(ctx, "creating vms", slog.Int64("count", int64(vmCount))) | ||
| for vm := range vmCount { | ||
| id := s.generateVMID(ctx) | ||
|
|
||
| vmid := fmt.Sprintf("ud-%s", id) | ||
|
|
||
| // Pop an available IP from the pool | ||
| ipRow, popErr := s.queries.PopAvailableIPJSON(ctx, req.Msg.Deployment.GetDeploymentId()) | ||
| if popErr != nil { | ||
| logger.ErrorContext(ctx, "failed to pop available IP", | ||
| slog.String("error", popErr.Error()), | ||
| ) | ||
| return nil, connect.NewError(connect.CodeInternal, popErr) | ||
| } | ||
|
|
||
| // Now allocate the IP with all required fields | ||
| ipAlloc, allocErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ | ||
| VmID: vmid, | ||
| IpAddr: ipRow.JsonExtract.(string), | ||
| NetworkAllocationID: ipRow.ID, | ||
| }) | ||
| if allocErr != nil { | ||
| logger.ErrorContext(ctx, "failed to pop available IP", | ||
| slog.String("error", allocErr.Error()), | ||
| ) | ||
| return nil, connect.NewError(connect.CodeInternal, allocErr) | ||
| } | ||
|
|
||
| // CREATION call here | ||
|
|
||
| logger.Debug("created vm", | ||
| slog.String("id", ipAlloc.VmID), | ||
| slog.String("ip", ipAlloc.IpAddr), | ||
| slog.Any("requested", vmCount), | ||
| slog.Any("fulfilled", vm), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Loop is invalid; field name mismatch; not returning real IDs.
- for vm := range vmCount is not legal Go.
- PopAvailableIPJSON row field should use the alias (AllocatedIp), not JsonExtract.
- Collect vmIDs and return them; don’t return static placeholders.
Apply:
- logger.DebugContext(ctx, "creating vms", slog.Int64("count", int64(vmCount)))
- for vm := range vmCount {
+ logger.DebugContext(ctx, "creating vms", slog.Int("count", int(vmCount)))
+ vmIDs := make([]string, 0, int(vmCount))
+ for i := 0; i < int(vmCount); i++ {
id := s.generateVMID(ctx)
vmid := fmt.Sprintf("ud-%s", id)
// Pop an available IP from the pool
ipRow, popErr := s.queries.PopAvailableIPJSON(ctx, req.Msg.Deployment.GetDeploymentId())
if popErr != nil {
logger.ErrorContext(ctx, "failed to pop available IP",
slog.String("error", popErr.Error()),
)
return nil, connect.NewError(connect.CodeInternal, popErr)
}
// Now allocate the IP with all required fields
ipAlloc, allocErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{
VmID: vmid,
- IpAddr: ipRow.JsonExtract.(string),
+ IpAddr: ipRow.AllocatedIp,
NetworkAllocationID: ipRow.ID,
})
if allocErr != nil {
logger.ErrorContext(ctx, "failed to pop available IP",
slog.String("error", allocErr.Error()),
)
return nil, connect.NewError(connect.CodeInternal, allocErr)
}
// CREATION call here
+ // TODO: invoke s.backend.CreateVM with config and vmid; handle rollback on failure.
logger.Debug("created vm",
slog.String("id", ipAlloc.VmID),
slog.String("ip", ipAlloc.IpAddr),
- slog.Any("requested", vmCount),
- slog.Any("fulfilled", vm),
+ slog.Int("requested", int(vmCount)),
+ slog.Int("fulfilled", i+1),
)
+ vmIDs = append(vmIDs, vmid)
}And update the response:
- return connect.NewResponse(&metaldv1.CreateDeploymentResponse{
- VmIds: []string{"ud-001", "ud-002", "ud-003"},
- }), nil
+ return connect.NewResponse(&metaldv1.CreateDeploymentResponse{
+ VmIds: vmIDs,
+ }), nil🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/deployment.go around lines 102–138, the
loop and IP fields are wrong and we never return real VM IDs: replace the
invalid "for vm := range vmCount" with a conventional index loop (e.g. for i :=
0; i < vmCount; i++) so we iterate vmCount times; read the popped IP from the
correct field (use ipRow.AllocatedIp or the DB alias instead of JsonExtract);
call AllocateIP and on success append ipAlloc.VmID to a vmIDs slice you build;
fix the error log for allocation to say "failed to allocate IP" when
appropriate; and when returning, return the collected vmIDs (and update the
debug log to use the requested count and fulfilled count from len(vmIDs) or
i+1). Ensure error returns still wrap the original DB errors with
connect.NewError(connect.CodeInternal, err).
| func (s *VMService) generateVMID(ctx context.Context) string { | ||
| b := make([]byte, 15) | ||
| rand.Read(b) | ||
| return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24] | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Consider making generateVMID return (string, error).
Propagating errors avoids silent fallbacks for identifiers.
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 49 to 53, change
generateVMID to return (string, error) instead of just string, use
crypto/rand.Read (or check the current rand.Read) and capture its error, return
"", err if the read fails, otherwise encode the bytes and return the ID with nil
error; update all callers to handle the returned error and propagate it
appropriately.
crypto/rand.Read error is ignored; add handling.
If rand.Read fails, you’ll silently encode zeroed bytes. Log and handle the error.
Apply:
func (s *VMService) generateVMID(ctx context.Context) string {
- b := make([]byte, 15)
- rand.Read(b)
- return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24]
+ b := make([]byte, 15)
+ if _, err := rand.Read(b); err != nil {
+ s.logger.LogAttrs(ctx, slog.LevelError, "generateVMID: crypto/rand failed", slog.String("error", err.Error()))
+ // Best-effort fallback: still return a value; consider surfacing the error instead.
+ }
+ return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24]
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *VMService) generateVMID(ctx context.Context) string { | |
| b := make([]byte, 15) | |
| rand.Read(b) | |
| return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24] | |
| } | |
| func (s *VMService) generateVMID(ctx context.Context) string { | |
| b := make([]byte, 15) | |
| if _, err := rand.Read(b); err != nil { | |
| s.logger.LogAttrs(ctx, slog.LevelError, "generateVMID: crypto/rand failed", slog.String("error", err.Error())) | |
| // Best-effort fallback: still return a value; consider surfacing the error instead. | |
| } | |
| return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24] | |
| } |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 49–53, rand.Read error
is ignored; change generateVMID to return (string, error), call rand.Read using
b := make([]byte, 15) and check err := rand.Read(b); if err != nil return "",
fmt.Errorf("generateVMID: rand.Read failed: %w", err), then continue to encode
and return the ID; update all callers to handle the error (propagate or log) and
ensure you import fmt (and adjust any tests).
| s.logger.Info("network allocated", | ||
| slog.String("network_cidr", network.BaseNetwork), | ||
| slog.Any("network_cidr", network.BaseNetwork), | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Use slog.String for CIDR.
BaseNetwork appears to be a string; slog.Any is noisy here.
- slog.Any("network_cidr", network.BaseNetwork),
+ slog.String("network_cidr", network.BaseNetwork),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| s.logger.Info("network allocated", | |
| slog.String("network_cidr", network.BaseNetwork), | |
| slog.Any("network_cidr", network.BaseNetwork), | |
| ) | |
| s.logger.Info("network allocated", | |
| slog.String("network_cidr", network.BaseNetwork), | |
| ) |
🤖 Prompt for AI Agents
In go/deploy/metald/internal/service/vm.go around lines 101 to 103, the logger
uses slog.Any for network.BaseNetwork which is a string and produces noisy
output; replace slog.Any("network_cidr", network.BaseNetwork) with
slog.String("network_cidr", network.BaseNetwork) so the CIDR is logged cleanly
as a string (no other changes needed).
- Reimplement network management/creation - Rabbit is gonna ruin this commit message
11cf7bd to
df5c3f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (14)
go/deploy/spire/Makefile (1)
23-25: Fix production trust domain typo.Use cloud, not cliud.
-TRUST_DOMAIN = production.unkey.cliud +TRUST_DOMAIN = production.unkey.cloudgo/deploy/metald/internal/network/network.go (1)
3-118: Remove dead commented API from internal/network/network.go
- Delete the large commented-out subnet API in go/deploy/metald/internal/network/network.go (or gate it behind a build tag); leaving it commented invites drift.
- Add a short doc.go in the package pointing callers to the maintained APIs: network.GenerateAvailableIPs (iputil.go) and CreateBridge (bridge.go).
- Repo scan found no live references to the old symbols (CalculateIndex, CalculateIndexOneBased, GetSubnetInfo, GetSubnetInfoByIndex, GetNetwork, GetGateway, ValidateIdentifier, GetAllSubnetsInRange, IsValidIndex); callers import package network but use CreateBridge/GenerateAvailableIPs — removal is safe.
go/deploy/metald/sqlc/schema.sql (3)
11-17: Enforce 1:1 mapping and add basic validations.Prevent double-allocation of the same network, validate JSON, and add cascade to keep FK integrity.
CREATE TABLE network_allocations ( id INTEGER PRIMARY KEY, deployment_id TEXT NOT NULL UNIQUE, network_id INTEGER NOT NULL, bridge_name TEXT NOT NULL UNIQUE, - available_ips TEXT NOT NULL, -- JSON array: ["10.0.0.18","10.0.0.19",...] + available_ips TEXT NOT NULL CHECK (json_valid(available_ips)), -- JSON array: ["10.0.0.18","10.0.0.19",...] allocated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - FOREIGN KEY (network_id) REFERENCES networks(id) + FOREIGN KEY (network_id) REFERENCES networks(id) ON DELETE CASCADE, + UNIQUE(network_id) );
3-7: Constrain boolean column.Make intent explicit for is_allocated.
CREATE TABLE networks ( id INTEGER PRIMARY KEY, base_network TEXT UNIQUE NOT NULL, -- CIDR notation: "10.0.0.16/28" - is_allocated INTEGER NOT NULL DEFAULT 0 + is_allocated INTEGER NOT NULL DEFAULT 0 CHECK (is_allocated IN (0,1)) );
19-27: Cascade child rows on parent delete.Avoid orphaned ip_allocations when a network allocation is removed.
FOREIGN KEY (network_allocation_id) REFERENCES network_allocations(id), + -- Keep child rows in sync with parent lifecycle + FOREIGN KEY (network_allocation_id) REFERENCES network_allocations(id) ON DELETE CASCADE, UNIQUE(network_allocation_id, ip_addr)go/deploy/metald/cmd/metald/main.go (2)
312-347: Expose /health on the main server.You instantiate a health handler but never register it on the primary mux; only the Prometheus server serves /health. Most LB/ingress health checks will probe the main port. Add it here.
- // Create unified health handler - healthHandler := healthpkg.Handler("metald", getVersion(), startTime) + // Create unified health handler + healthHandler := healthpkg.Handler("metald", getVersion(), startTime) @@ - path, handler := metaldv1connect.NewVmServiceHandler(vmService, + path, handler := metaldv1connect.NewVmServiceHandler(vmService, connect.WithInterceptors(interceptorList...), ) mux.Handle(path, handler) + // Health on main server + mux.HandleFunc("/health", healthHandler)
318-336: Auth/interceptor change may block health checks; verify behavior.You removed tenant-auth options; ensure default interceptors don’t inadvertently require auth for /health or RPCs you expect to be public. If auth is required, add an explicit bypass in the interceptor package or mount /health before auth.
Would you like a patch in the interceptors package to whitelist /health?
go/deploy/metald/internal/service/vm.go (3)
105-110: Compile-time mismatch with sqlc AllocateIP signature.AllocateIP requires VmID, IpAddr, NetworkAllocationID. Here only VmID is provided. Also, CreateVm lacks a network context to select a pool.
- Fix by requiring deployment_id (or network_allocation_id) in CreateVmRequest and passing all three params, or route single-VM creation through CreateDeployment.
- Until then, remove this call to avoid broken DB writes.
Example:
- ip, ipErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ - VmID: req.Msg.GetVmId(), - }) + // TODO: require deployment_id and use PopAvailableIPJSON + AllocateIP with IpAddr and NetworkAllocationID.
112-151: VM ID/IP mapping inconsistency.You allocate IPs (incorrectly) for req.Msg.VmId but backend returns vmID; response logs/use vmID while IP is tied to a different ID.
- Generate the VM ID first, use it consistently for AllocateIP and backend.CreateVM, and return that same ID.
- vmID, err := s.backend.CreateVM(ctx, config) + genID, genErr := s.generateVMID(ctx) + if genErr != nil { + return nil, connect.NewError(connect.CodeInternal, genErr) + } + vmID, err := s.backend.CreateVM(ctx, config /* pass genID if backend supports */)Also update any AllocateIP to use vmID.
145-151: Hard-coded port.Port 35428 should come from backend or config; avoid magic numbers in API responses.
- Port: 35428, + Port: info.Port, // or from backend response/configIf backend doesn’t expose it yet, extend Backend.CreateVM to return endpoint details.
go/deploy/metald/sqlc/queries.sql (1)
25-31: Name the JSON column alias to produce a typed field from sqlc.Use an alias like allocated_ip to avoid JsonExtract interface{} and driver-dependent types.
-RETURNING json_extract(available_ips, '$[0]') as allocated_ip, id; +RETURNING json_extract(available_ips, '$[0]') AS allocated_ip, id;Then re-run sqlc so PopAvailableIPJSONRow has AllocatedIp string.
go/deploy/metald/internal/database/queries.sql.go (2)
117-122: Typed count preferred.sqlc generated GetAvailableIPCount returns interface{}; cast issues likely. Make SQL cast to INTEGER so sqlc infers int64.
-SELECT json_array_length(available_ips) as count +SELECT CAST(json_array_length(available_ips) AS INTEGER) AS countRe-run sqlc to get
GetAvailableIPCount(ctx string) (int64, error).
200-216: Rename json_extract alias to allocated_ip and regenerate.Avoid interface{} + type assertions at call sites.
-RETURNING json_extract(available_ips, '$[0]') as allocated_ip, id +RETURNING json_extract(available_ips, '$[0]') AS allocated_ip, idThen re-run sqlc so:
type PopAvailableIPJSONRow struct { AllocatedIp string `db:"allocated_ip" json:"allocated_ip"` ID int64 `db:"id" json:"id"` } ```<!-- review_comment_end --> </blockquote></details> <details> <summary>go/deploy/metald/internal/service/deployment.go (1)</summary><blockquote> `140-143`: **Return actual IDs instead of placeholders.** ```diff - return connect.NewResponse(&metaldv1.CreateDeploymentResponse{ - VmIds: []string{"ud-001", "ud-002", "ud-003"}, - }), nil + return connect.NewResponse(&metaldv1.CreateDeploymentResponse{ + VmIds: vmIDs, + }), nil
♻️ Duplicate comments (18)
go/deploy/spire/Makefile (1)
128-128: Remove stale create-users prerequisite from install-server.This target no longer exists; current Makefile will fail to resolve it. (Repeat from prior review.)
-install-server: download-spire create-users setup-directories ## Install SPIRE server +install-server: download-spire setup-directories ## Install SPIRE servergo/deploy/metald/internal/service/vm.go (2)
101-103: Use slog.String for CIDR.BaseNetwork is string; avoid slog.Any.
- slog.Any("network_cidr", network.BaseNetwork), + slog.String("network_cidr", network.BaseNetwork),
49-53: Handle crypto/rand errors; return (string, error).Ignoring rand.Read can yield zeroed IDs. Return an error and propagate to callers.
-func (s *VMService) generateVMID(ctx context.Context) string { - b := make([]byte, 15) - rand.Read(b) - return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24] -} +func (s *VMService) generateVMID(ctx context.Context) (string, error) { + b := make([]byte, 15) + if _, err := rand.Read(b); err != nil { + return "", fmt.Errorf("generateVMID: rand.Read failed: %w", err) + } + return base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:24], nil +}go/deploy/metald/internal/network/iputil.go (3)
9-24: Return Go slice (or add a helper) and validate IPv4.Returning JSON couples callers; prefer []string (or add a GenerateAvailableIPsJSON wrapper). Also validate IPv4 and handle small subnets.
-func GenerateAvailableIPs(cidr string) (string, error) { +func GenerateAvailableIPs(cidr string) ([]string, error) { _, ipnet, err := net.ParseCIDR(cidr) if err != nil { - return "", fmt.Errorf("invalid CIDR: %w", err) + return nil, fmt.Errorf("invalid CIDR: %w", err) } - // Calculate the number of hosts - ones, bits := ipnet.Mask.Size() - hostBits := bits - ones - numHosts := (1 << hostBits) - 2 // Exclude network and broadcast + base := ipnet.IP.To4() + if base == nil { + return nil, fmt.Errorf("only IPv4 CIDRs are supported: %s", cidr) + } + ones, bits := ipnet.Mask.Size() // expect 32 + hostBits := uint(bits - ones) + total := uint64(1) << hostBits + usable := uint64(0) + if total > 3 { + usable = total - 3 // exclude network, gateway, broadcast + } - if numHosts < 1 { - return "[]", nil // No usable IPs in this network + if usable == 0 { + return []string{}, nil }
26-39: Cap output size to avoid memory blowups on broad CIDRs and guard broadcast calc.- ips := make([]string, 0, numHosts-1) // -1 to also exclude gateway + const maxIPs uint64 = 65536 + count := usable + if count > maxIPs { + count = maxIPs + } + ips := make([]string, 0, int(count)) @@ - ip := ipnet.IP.Mask(ipnet.Mask) + ip := base.Mask(ipnet.Mask) @@ - broadcast := getBroadcast(ipnet) - for i := 0; i < numHosts-1 && !ip.Equal(broadcast); i++ { + broadcast := getBroadcast(ipnet) + if broadcast == nil { + return nil, fmt.Errorf("failed to compute broadcast for %s", cidr) + } + for i := uint64(0); i < count && !ip.Equal(broadcast); i++ { ips = append(ips, ip.String()) incIP(ip) } - - // Convert to JSON array - jsonBytes, err := json.Marshal(ips) - if err != nil { - return "", fmt.Errorf("failed to marshal IPs to JSON: %w", err) - } - - return string(jsonBytes), nil + return ips, nilOptionally add:
func GenerateAvailableIPsJSON(cidr string) (string, error) { ips, err := GenerateAvailableIPs(cidr); if err != nil { return "", err }; b, err := json.Marshal(ips); if err != nil { return "", err }; return string(b), nil } ```<!-- review_comment_end --> --- `60-71`: **Handle non-IPv4 input in getBroadcast.** To4() may be nil; return nil to let caller fail fast. ```diff func getBroadcast(ipnet *net.IPNet) net.IP { - ip := ipnet.IP.To4() + ip := ipnet.IP.To4() + if ip == nil { + return nil + } mask := ipnet.Mask broadcast := make(net.IP, len(ip))go/deploy/metald/internal/database/queries.sql.go (1)
59-70: BridgeName: enforce IFNAMSIZ in schema and validate in app.Add a CHECK (length <= 15) in schema. This prevents persisting invalid names that netlink rejects.
go/deploy/metald/internal/network/bridge.go (3)
10-19: Make CreateBridge idempotent (don’t early-return).If the bridge exists, still ensure IP address and link-up; use AddrReplace.
- link, linkErr := netlink.LinkByName(cfg.BridgeName) - if linkErr == nil { - logger.Debug("bridge exists", - slog.String("bridge", cfg.BridgeName), - slog.Int("mtu", link.Attrs().MTU), - ) - return link.Attrs().Name, nil - } + link, linkErr := netlink.LinkByName(cfg.BridgeName) + var br netlink.Link + if linkErr == nil { + logger.Debug("bridge exists", + slog.String("bridge", cfg.BridgeName), + slog.Int("mtu", link.Attrs().MTU), + ) + br = link + } else { + bridge := &netlink.Bridge{LinkAttrs: netlink.LinkAttrs{Name: cfg.BridgeName}} //nolint:exhaustruct + if err := netlink.LinkAdd(bridge); err != nil { + logger.Error("failed to create bridge", slog.String("bridge", cfg.BridgeName), slog.String("error", err.Error())) + return "", fmt.Errorf("failed to create bridge: %w", err) + } + var brErr error + br, brErr = netlink.LinkByName(cfg.BridgeName) + if brErr != nil { + return "", fmt.Errorf("failed to get bridge: %w", brErr) + } + }
40-51: Wrong address: using network address; assign first host with correct mask.Assign a usable host IP (e.g., .1) and replace existing addr.
- addr, err := netlink.ParseAddr(cfg.BaseNetwork.String()) + prefixSize, _ := cfg.BaseNetwork.Mask.Size() + gwIP, err := firstHostIPv4(cfg.BaseNetwork.IP) + if err != nil { + return "", fmt.Errorf("failed to derive bridge IP: %w", err) + } + addr, err := netlink.ParseAddr(fmt.Sprintf("%s/%d", gwIP.String(), prefixSize)) if err != nil { return "", fmt.Errorf("failed to parse bridge IP: %w", err) } - - if err := netlink.AddrAdd(br, addr); err != nil { + if err := netlink.AddrReplace(br, addr); err != nil { logger.Error("failed to add IP to bridge", slog.String("bridge", cfg.BridgeName), slog.String("error", err.Error()), ) - return "", fmt.Errorf("failed to add IP to bridge: %w", err) + return "", fmt.Errorf("failed to configure IP on bridge: %w", err) }Add helper:
// firstHostIPv4 returns the first usable IPv4 in the subnet (network IP + 1). func firstHostIPv4(ip net.IP) (net.IP, error) { ip4 := ip.To4() if ip4 == nil { return nil, fmt.Errorf("only IPv4 is supported") } out := make(net.IP, 4); copy(out, ip4) for i := 3; i >= 0; i-- { out[i]++; if out[i] != 0 { break } } return out, nil }(Remember to import net.)
62-67: Message nit: “bridge configured” is clearer for idempotent flow.- logger.Debug("bridge created", + logger.Debug("bridge configured",go/deploy/metald/internal/service/deployment.go (6)
19-25: Span name: use deployment-specific name.Improves trace readability.
- ctx, span := s.tracer.Start(ctx, "metald.vm.create", + ctx, span := s.tracer.Start(ctx, "metald.deployment.create",
39-45: Return the correct error variable.bnErr is the parse error; you’re returning netErr.
- return nil, connect.NewError(connect.CodeInternal, netErr) + return nil, connect.NewError(connect.CodeInternal, bnErr)
47-54: Avoid hard-coded DNS resolvers.Make DNSServers configurable or use infra defaults.
- DNSServers: []string{"4.2.2.2", "4.2.2.1"}, + DNSServers: s.resolveDNSServersFromConfigOrDefault(),Add a helper to read from config/env.
56-67: Cleanup message and rollback path.You release the network, not a bridge; fix log. Consider wrapping AllocateNetwork→CreateBridge→CreateNetworkAllocation in a tx and releasing/rolling back on any failure.
- logger.Debug("cleaned up bridge") + logger.Debug("released network after bridge creation failure")I can propose a tx-scoped flow with s.queries if desired.
92-100: Use concrete slog types.- slog.Any("network_cidr", netConfig.BaseNetwork.String()), - slog.Any("network_id", na.ID), + slog.String("network_cidr", netConfig.BaseNetwork.String()), + slog.Int64("network_id", na.ID),
102-138: Invalid loop; unsafe type assertion; return real VM IDs.
- for vm := range vmCount is invalid.
- ipRow.JsonExtract may be []byte; assertion to string can panic.
- Collect and return actual IDs instead of static placeholders.
- logger.DebugContext(ctx, "creating vms", slog.Int64("count", int64(vmCount))) - for vm := range vmCount { - id := s.generateVMID(ctx) + logger.DebugContext(ctx, "creating vms", slog.Int("count", int(vmCount))) + vmIDs := make([]string, 0, int(vmCount)) + for i := 0; i < int(vmCount); i++ { + id, idErr := s.generateVMID(ctx) + if idErr != nil { return nil, connect.NewError(connect.CodeInternal, idErr) } vmid := fmt.Sprintf("ud-%s", id) @@ - ipAlloc, allocErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ + // Safely coerce allocated IP from sqlc row until we alias it + var ipStr string + switch v := ipRow.JsonExtract.(type) { + case string: ipStr = v + case []byte: ipStr = string(v) + default: + return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("unexpected IP type %T", v)) + } + ipAlloc, allocErr := s.queries.AllocateIP(ctx, database.AllocateIPParams{ VmID: vmid, - IpAddr: ipRow.JsonExtract.(string), + IpAddr: ipStr, NetworkAllocationID: ipRow.ID, }) @@ - logger.Debug("created vm", + logger.Debug("created vm", slog.String("id", ipAlloc.VmID), slog.String("ip", ipAlloc.IpAddr), - slog.Any("requested", vmCount), - slog.Any("fulfilled", vm), + slog.Int("requested", int(vmCount)), + slog.Int("fulfilled", i+1), ) + vmIDs = append(vmIDs, vmid) }And update the response (below).
go/deploy/metald/internal/network/idgen.go (2)
9-12: Fix stale/incomplete comments; convert to proper doc comment.These lines reference the removed IDGenerator and contain a fragment. Replace with a concise doc comment for GenerateID and note the DB uniqueness contract.
-// IDGenerator generates short, unique IDs for network devices -// Network interface names in Linux are limited to 15 characters, - -// Generates an for the bridge, it's not guaranteed unique until it's saved to the DB +// GenerateID returns a Linux-safe 15-char bridge name: "br-" + 12 base32 chars. +// Not guaranteed unique; enforce UNIQUE(bridge_name) at persistence and retry on conflict. +// Linux netdev names are limited to 15 characters.
13-19: Don’t ignore crypto/rand errors; normalize to lowercase; minor string concat nit.rand.Read error must be handled; lowercase avoids surprises with netdev naming; concatenation avoids an unnecessary Sprintf.
-func GenerateID() string { - b := make([]byte, 8) - rand.Read(b) - id := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(b)[:12] - - return fmt.Sprintf("br-%s", id) -} +func GenerateID() string { + b := make([]byte, 8) + if _, err := io.ReadFull(rand.Reader, b); err != nil { + panic(fmt.Errorf("idgen: crypto/rand failure: %w", err)) + } + enc := base32.StdEncoding.WithPadding(base32.NoPadding) + id := strings.ToLower(enc.EncodeToString(b)[:12]) + return "br-" + id +}Add required imports:
import ( "crypto/rand" "encoding/base32" "fmt" + "io" + "strings" )If panicking here is undesirable, consider changing the signature to
func GenerateID() (string, error)and propagate errors to callers. I can help refactor call sites if desired.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (15)
go/deploy/metald/cmd/metald/main.go(1 hunks)go/deploy/metald/internal/backend/firecracker/asset_manager.go(0 hunks)go/deploy/metald/internal/database/models.go(1 hunks)go/deploy/metald/internal/database/queries.sql.go(4 hunks)go/deploy/metald/internal/network/bridge.go(1 hunks)go/deploy/metald/internal/network/idgen.go(1 hunks)go/deploy/metald/internal/network/iputil.go(1 hunks)go/deploy/metald/internal/network/network.go(1 hunks)go/deploy/metald/internal/service/auth.go(0 hunks)go/deploy/metald/internal/service/deployment.go(1 hunks)go/deploy/metald/internal/service/vm.go(3 hunks)go/deploy/metald/sqlc/queries.sql(1 hunks)go/deploy/metald/sqlc/schema.sql(1 hunks)go/deploy/pkg/observability/interceptors/options.go(0 hunks)go/deploy/spire/Makefile(1 hunks)
💤 Files with no reviewable changes (3)
- go/deploy/pkg/observability/interceptors/options.go
- go/deploy/metald/internal/service/auth.go
- go/deploy/metald/internal/backend/firecracker/asset_manager.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-10T04:47:17.076Z
Learnt from: imeyer
PR: unkeyed/unkey#3941
File: go/deploy/spire/Makefile:110-110
Timestamp: 2025-09-10T04:47:17.076Z
Learning: In the SPIRE Makefile (go/deploy/spire/Makefile), the create-users target has been removed, so any references to create-users in target dependencies should also be removed. Both install-agent and install-server targets had create-users dependencies that needed to be cleaned up.
Applied to files:
go/deploy/spire/Makefile
🧬 Code graph analysis (2)
go/deploy/metald/internal/database/queries.sql.go (1)
go/deploy/metald/internal/database/models.go (1)
NetworkAllocation(25-32)
go/deploy/metald/internal/service/deployment.go (6)
go/deploy/metald/internal/service/vm.go (1)
VMService(26-34)go/pkg/otel/tracing/trace.go (1)
Start(59-62)go/deploy/metald/internal/network/idgen.go (1)
GenerateID(13-19)go/deploy/metald/internal/network/bridge.go (1)
CreateBridge(11-68)go/deploy/metald/internal/network/iputil.go (1)
GenerateAvailableIPs(11-48)go/deploy/metald/internal/database/queries.sql.go (2)
CreateNetworkAllocationParams(65-70)AllocateIPParams(19-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test Go API Local / Test
- GitHub Check: Build / Build
- GitHub Check: Test API / API Test Local
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
go/deploy/spire/Makefile (2)
110-110: LGTM: removed nonexistent dependency.Dropping create-users from install-agent is correct per recent cleanup.
111-121: Incorrect — systemd units run as root; no user creation requiredBoth go/deploy/spire/contrib/systemd/spire-agent.service and go/deploy/spire/contrib/systemd/spire-server.service set User=root, so the install targets do not need to create dedicated spire users; remove/resolve the original suggestion.
Likely an incorrect or invalid review comment.
go/deploy/metald/sqlc/schema.sql (1)
11-15: Migration warning for NOT NULL bridge_name.If this alters an existing DB, adding a NOT NULL column without a default in SQLite will fail unless the table is empty. Provide a migration that backfills bridge_name (e.g., br-) and then adds the NOT NULL + UNIQUE constraint.
go/deploy/metald/internal/network/idgen.go (1)
5-5: LGTM on base32 choice.Using base32 without padding is appropriate for fixed-length, interface-safe IDs.
Graphite Automations"Post a GIF when PR approved" took an action on this PR • (09/11/25)1 gif was posted to this PR based on Andreas Thomas's automation. |

What does this PR do?
Fixes # (issue)
If there is not an issue for this, please create one first. This is used to tracking purposes and also helps use understand why this PR exists
Type of change
How should this be tested?
Checklist
Required
pnpm buildpnpm fmtconsole.logsgit pull origin mainAppreciated
Summary by CodeRabbit
New Features
Refactor