Skip to content

feat: add lets encrypt challenges#4471

Merged
chronark merged 10 commits intomainfrom
feat/ctrl-lets-encrypt-challenges
Dec 9, 2025
Merged

feat: add lets encrypt challenges#4471
chronark merged 10 commits intomainfrom
feat/ctrl-lets-encrypt-challenges

Conversation

@Flo4604
Copy link
Member

@Flo4604 Flo4604 commented Dec 4, 2025

What does this PR do?

Fixes #4054

We got automated certificates again!

Adds route53 as certificate provider (aws)
Uses self scheduling restate job to automatically refresh soon to be expiring certs or ones that we weren't able to issue yet.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • Enhancement (small improvements)
  • New/Old feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Ran make fmt on /go directory
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary

@vercel
Copy link

vercel bot commented Dec 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
dashboard Error Error Dec 9, 2025 5:22pm
engineering Ready Ready Preview Comment Dec 9, 2025 5:22pm

@changeset-bot
Copy link

changeset-bot bot commented Dec 4, 2025

⚠️ No Changeset found

Latest commit: 01b0c96

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

This change re-enables ACME certificate management functionality as a Restate cron job. Route53 is introduced as an alternative DNS-01 provider alongside Cloudflare. A centralized in-memory caching layer is added for domains and challenges. Provider implementations are refactored to use a generic wrapper pattern that handles DNS operations and database persistence.

Changes

Cohort / File(s) Summary
Configuration & CLI
go/apps/ctrl/config.go, go/cmd/ctrl/main.go
Added Route53Config struct with AWS credentials and region. Extended AcmeConfig with Route53 field and EmailDomain. Added CLI flags for ACME email domain and Route53 settings (access key, secret, region, hosted zone ID).
Caching Infrastructure
go/apps/ctrl/internal/caches/caches.go, go/apps/ctrl/services/acme/service.go, go/apps/ctrl/services/acme/certificate_verification.go
Introduced centralized Caches component with Domains (5m fresh, 10m stale) and Challenges (10s fresh, 30s stale) caches. Service and config extended with DomainCache and ChallengeCache fields. Certificate verification refactored to use SWR-cached lookups instead of direct DB queries.
Provider Abstraction & Refactoring
go/apps/ctrl/services/acme/providers/provider.go, go/apps/ctrl/services/acme/providers/cloudflare_provider.go, go/apps/ctrl/services/acme/providers/http_provider.go
Created generic Provider wrapper with DNSProvider interface for DNS-01/HTTP-01 challenges. Implements Present (with DB challenge recording), CleanUp, and Timeout methods. CloudflareProvider and HTTPProvider refactored to return generic Provider, removing direct DB interaction; now accept DomainCache in config.
Route53 Provider
go/apps/ctrl/services/acme/providers/route53_provider.go
Added Route53Config struct with credentials, region, and optional HostedZoneID. NewRoute53Provider builds Route53 DNS provider and wraps it in generic Provider with DB and DomainCache integration.
ACME User & Utilities
go/apps/ctrl/services/acme/user.go, go/apps/ctrl/services/acme/pem.go
Added EmailDomain field to AcmeUser and UserConfig, propagated into email generation. GetOrCreateUser now auto-completes ACME registration and persists Registration URI. Added GetCertificateExpiry function to extract certificate expiration as Unix milliseconds.
Certificate Workflows
go/apps/ctrl/workflows/certificate/service.go, go/apps/ctrl/workflows/certificate/process_challenge_handler.go, go/apps/ctrl/workflows/certificate/renew_handler.go
Service extended with EmailDomain, DefaultDomain, DNSProvider, and HTTPProvider fields. ProcessChallenge updated with named return form and CertificateID field in EncryptedCertificate. New RenewExpiringCertificates handler implements self-scheduling cron job: queries expiring challenges, triggers renewals per domain, schedules next run with 24h delay and idempotency key.
ACME Bootstrapping
go/apps/ctrl/run.go
Added environment variable LEGO_DISABLE_CNAME_SUPPORT. Wired Caches container, clock dependency, and HTTP-01/DNS-01 providers (Cloudflare or Route53 per config). Bootstrap helper ensures wildcard domain and initial ACME challenge exist. Certificate renewal cron scheduled on startup via Restate client.
Database Operations
go/pkg/db/custom_domain_upsert.sql_generated.go, go/pkg/db/bulk_custom_domain_upsert.sql_generated.go, go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go, go/pkg/db/querier_generated.go, go/pkg/db/querier_bulk_generated.go
Added UpsertCustomDomain and UpsertCustomDomain (bulk) operations with ON DUPLICATE KEY UPDATE. Added FindCustomDomainByDomainOrWildcard query preferring exact domain matches. Querier interface extended with new methods.
SQL Query Definitions
go/pkg/db/queries/custom_domain_upsert.sql, go/pkg/db/queries/custom_domain_find_by_domain_or_wildcard.sql
New SQL files defining upsert and domain lookup operations for custom_domains table.
Protocol & Utilities
go/proto/hydra/v1/certificate.proto, go/pkg/db/handle_err_no_rows.go, go/go.mod
Added RenewExpiringCertificates RPC with request/response messages. Added IsNotFound utility for sql.ErrNoRows comparison. Added aws/aws-sdk-go-v2/service/route53 indirect dependency.

Sequence Diagram

sequenceDiagram
    actor Restate as Restate Client
    participant CertSvc as Certificate Service
    participant DB as Database
    participant Provider as DNS/HTTP Provider
    participant ACME as ACME Client
    
    Note over Restate,ACME: Certificate Renewal Cron
    
    Restate->>CertSvc: RenewExpiringCertificates(ctx, req)
    activate CertSvc
    
    CertSvc->>DB: Query expiring/waiting challenges<br/>(DNS01, HTTP01)
    DB-->>CertSvc: List of challenges
    
    loop For each challenge domain
        CertSvc->>Provider: Resolve domain from cache/DB<br/>(exact or wildcard match)
        Provider-->>CertSvc: CustomDomain
        
        CertSvc->>ACME: Create client with user/registration
        ACME-->>CertSvc: Client ready
        
        CertSvc->>Provider: ProcessChallenge (fire-and-forget)
        activate Provider
        Provider->>Provider: Present DNS/HTTP record
        Provider->>DB: Record ACME challenge state<br/>(DomainID, Status, Token, etc.)
        DB-->>Provider: ✓ Persisted
        Provider-->>CertSvc: Done
        deactivate Provider
        
        CertSvc->>CertSvc: Throttle (brief sleep)
    end
    
    Note over CertSvc: Collect metrics<br/>(checked, triggered, failed)
    
    CertSvc->>Restate: Schedule next run<br/>with 24h delay & idempotency key
    Restate-->>CertSvc: ✓ Scheduled
    
    deactivate CertSvc
    CertSvc-->>Restate: RenewExpiringCertificatesResponse<br/>(counts, failed_domains)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–75 minutes

  • Provider wrapper logic (provider.go): Complex DNS challenge recording with DB state persistence, error wrapping, and domain resolution logic (lines dense with context-aware error handling).
  • Certificate renewal cron (renew_handler.go): New self-scheduling job with fire-and-forget semantics, Restate idempotency key generation, and multi-domain batch processing.
  • Caching integration across service boundaries (verification, service initialization, provider config): Requires understanding cache TTL values, SWR patterns, and invalidation points.
  • Provider refactoring (Cloudflare, HTTP): Type alias introduction, method relocation, generic wrapper instantiation—subtle changes affecting public API surface.
  • Database schema & operations: New upsert/find queries with ON DUPLICATE KEY UPDATE and domain preference logic (exact vs. wildcard).
  • Run.go wiring: Large refactor with conditional provider setup, bootstrap logic, and multi-component initialization order dependencies.
  • High heterogeneity: Changes span providers, workflows, caching, configuration, and database layers without strong repetitive patterns.

Areas requiring extra attention:

  • Provider.Present() domain resolution and DB recording logic—ensure error boundaries and cache coherency
  • RenewExpiringCertificates idempotency key derivation and Restate scheduling semantics
  • Cache initialization TTL values and their impact on certificate/challenge staleness
  • Route53 credentials validation in config and runtime error handling
  • Custom domain upsert ON DUPLICATE KEY UPDATE correctness with challenge_type field

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: add lets encrypt challenges' directly summarizes the primary change, clearly indicating that Let's Encrypt challenge support has been added to the system.
Description check ✅ Passed The PR description includes issue reference (#4054), feature summary, type of change selection, completed checklist items, and addresses the core objective of re-enabling ACME functionality.
Linked Issues check ✅ Passed The PR successfully implements all core requirements from issue #4054: re-enables ACME certificate management functionality using a Restate cron job, adds Route53 as a certificate provider, and includes automated certificate renewal logic.
Out of Scope Changes check ✅ Passed All code changes are directly related to implementing ACME certificate functionality and Route53 support as specified in issue #4054. No unrelated or extraneous changes were detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/ctrl-lets-encrypt-challenges

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member Author

Flo4604 commented Dec 4, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vercel vercel bot temporarily deployed to Preview – engineering December 4, 2025 12:59 Inactive
@vercel vercel bot temporarily deployed to Preview – dashboard December 4, 2025 12:59 Inactive
@Flo4604
Copy link
Member Author

Flo4604 commented Dec 4, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
go/apps/ctrl/workflows/certificate/service.go (1)

54-64: Validate required configuration fields.

The new fields DNSProvider, EmailDomain, and DefaultDomain (lines 60-62) are assigned without validation. If these are required for the service to function correctly (e.g., DNSProvider for DNS-01 challenges), the constructor should validate them before creating the service instance.

Consider adding validation:

 func New(cfg Config) *Service {
+	// Validate required fields
+	if cfg.DNSProvider == nil {
+		panic("DNSProvider is required")
+	}
+	if cfg.EmailDomain == "" {
+		panic("EmailDomain is required")
+	}
+	if cfg.DefaultDomain == "" {
+		panic("DefaultDomain is required")
+	}
+	
 	return &Service{
 		UnimplementedCertificateServiceServer: hydrav1.UnimplementedCertificateServiceServer{},
 		db:                                    cfg.DB,
 		vault:                                 cfg.Vault,
 		logger:                                cfg.Logger,
 		emailDomain:                           cfg.EmailDomain,
 		defaultDomain:                         cfg.DefaultDomain,
 		dnsProvider:                           cfg.DNSProvider,
 	}
 }

Alternatively, if any of these fields are optional, document this in the Config struct comments and handle nil/empty cases appropriately in the methods that use them.

🧹 Nitpick comments (10)
go/apps/ctrl/config.go (2)

53-65: AcmeConfig additions (EmailDomain & Route53) are coherently modeled.

EmailDomain, Cloudflare, and Route53 are grouped cleanly under AcmeConfig, which will make it easier to extend to additional providers later. Very minor nit: the Config struct comment at Line 155 still reads ACME/Cloudflare Configuration even though Route53 is now supported—renaming that to just ACME Configuration would avoid future confusion, but it can be done opportunistically.


219-236: Route53 validation mirrors Cloudflare; consider also validating EmailDomain when ACME is enabled.

The new Route53 block correctly enforces that AccessKeyID, SecretAccessKey, and Region are non-empty when ACME + Route53 are enabled, mirroring the existing Cloudflare API token check from earlier work. This will fail fast on misconfigurations, which is good. Based on learnings, this keeps Route53 in line with the prior Cloudflare validation behavior.

You might additionally want to reject an empty EmailDomain whenever ACME is enabled, so misconfigured registration email domains are caught at startup instead of at issuance time. One way to do that:

 func (c Config) Validate() error {
-	// Validate Cloudflare configuration if enabled
-	if c.Acme.Enabled && c.Acme.Cloudflare.Enabled {
+	if c.Acme.Enabled {
+		if err := assert.NotEmpty(c.Acme.EmailDomain, "acme email domain is required when acme is enabled"); err != nil {
+			return err
+		}
+	}
+
+	// Validate Cloudflare configuration if enabled
+	if c.Acme.Enabled && c.Acme.Cloudflare.Enabled {
 		if err := assert.NotEmpty(c.Acme.Cloudflare.ApiToken, "cloudflare API token is required when cloudflare is enabled"); err != nil {
 			return err
 		}
 	}
 
 	// Validate Route53 configuration if enabled
 	if c.Acme.Enabled && c.Acme.Route53.Enabled {
 		if err := assert.All(
 			assert.NotEmpty(c.Acme.Route53.AccessKeyID, "route53 access key ID is required when route53 is enabled"),
 			assert.NotEmpty(c.Acme.Route53.SecretAccessKey, "route53 secret access key is required when route53 is enabled"),
 			assert.NotEmpty(c.Acme.Route53.Region, "route53 region is required when route53 is enabled"),
 		); err != nil {
 			return err
 		}
 	}
go/cmd/ctrl/main.go (1)

103-116: ACME CLI flags cover email domain and Route53 options in a sensible way.

The new acme-email-domain and acme-route53-* flags (with matching UNKEY_ACME_* env vars and a reasonable default region) round out ACME configuration nicely. The default acme-email-domain="unkey.com" is convenient for your hosted environment; if you expect many self‑hosted users, you might eventually consider dropping the default to force an explicit choice, but that’s more of a product call than a correctness issue.

go/apps/ctrl/internal/caches/caches.go (1)

22-43: Consider validating or defaulting the Logger field.

The Clock field is defaulted if nil (lines 23-26), but the Logger field at line 32 is passed directly to cache.New without validation. If a nil Logger is passed and the cache implementation requires a valid logger, this could cause issues.

Consider adding validation or a default logger:

 func New(cfg Config) (*Caches, error) {
 	clk := cfg.Clock
 	if clk == nil {
 		clk = clock.New()
 	}
+	
+	if cfg.Logger == nil {
+		return nil, fmt.Errorf("logger is required")
+	}
 
 	domains, err := cache.New(cache.Config[string, db.CustomDomain]{

Alternatively, if cache.New already validates the logger internally, document this requirement in the Config struct comments.

go/apps/ctrl/workflows/certificate/renew_handler.go (1)

49-76: Consider adding pagination or limits to prevent overwhelming the system.

The renewal loop processes all executable challenges without any limit (line 49). If thousands of certificates need renewal, this could overwhelm the system with renewal requests, despite the 100ms throttle delay.

Consider one of these approaches:

  1. Process in batches: Limit the number of renewals per run (e.g., max 100), and let subsequent cron runs handle the rest.
  2. Add a configurable limit: Allow operators to tune the batch size based on system capacity.
  3. Track progress: Store progress state and resume from where the last run left off.

Example with a limit:

+const maxRenewalsPerRun = 100
+
 func (s *Service) RenewExpiringCertificates(
 	ctx restate.ObjectContext,
 	req *hydrav1.RenewExpiringCertificatesRequest,
 ) (*hydrav1.RenewExpiringCertificatesResponse, error) {
 	// ... existing code ...
 
 	var failedDomains []string
 	renewalsTriggered := int32(0)
 
-	for _, challenge := range challenges {
+	for i, challenge := range challenges {
+		if i >= maxRenewalsPerRun {
+			s.logger.Info("reached max renewals per run", "limit", maxRenewalsPerRun)
+			break
+		}
+		
 		s.logger.Info("triggering certificate renewal",
go/apps/ctrl/services/acme/user.go (1)

71-109: Consider logging success after registration URI persistence.

The registration completion logic is well-structured with proper conditional handling. However, you only log a warning on failure at line 107 but don't log success. For consistency and observability, consider adding an info log when the registration URI is successfully persisted.

 		if updateErr := db.Query.UpdateAcmeUserRegistrationURI(ctx, cfg.DB.RW(), db.UpdateAcmeUserRegistrationURIParams{
 			ID:              foundUser.ID,
 			RegistrationUri: sql.NullString{Valid: true, String: reg.URI},
 		}); updateErr != nil {
 			cfg.Logger.Warn("failed to persist registration URI", "error", updateErr)
+		} else {
+			cfg.Logger.Info("persisted registration URI", "workspace_id", cfg.WorkspaceID)
 		}
go/apps/ctrl/services/acme/providers/provider.go (2)

94-124: Present() creates a detached context, losing trace propagation.

Using context.Background() at line 96 discards any tracing/logging context that may have been passed through the ACME challenge flow. Since the lego challenge.Provider interface doesn't accept a context, consider at minimum logging a trace ID if available from the caller, or documenting this limitation.

If trace context is important for debugging ACME challenges, consider storing a context reference in the Provider struct during construction or using a context-aware wrapper pattern. For now, this is acceptable given lego's API constraints, but worth noting for observability.


126-136: CleanUp intentionally swallows errors — consider adding a metric.

The decision to log warnings but not fail on cleanup errors is reasonable (DNS records can often be cleaned up manually, and failing would block certificate issuance). However, for operational visibility, consider emitting a metric or structured log field that can be easily alerted on if cleanup failures become frequent.

go/apps/ctrl/workflows/certificate/process_challenge_handler.go (2)

85-85: Step comment numbering is off — says "Step 5" but should be "Step 4".

Minor documentation inconsistency: the comment says "Step 5" but this is actually the 4th step in the flow (after resolve domain, claim challenge, obtain certificate).

-	// Step 5: Persist certificate to DB
+	// Step 4: Persist certificate to DB

118-141: The workspaceID parameter in getOrCreateAcmeClient is unused.

The function signature accepts a workspaceID parameter (named _ on line 121), but globalAcmeUserID is always used instead. Either remove the parameter or document why it's reserved for future use.

If the intention is to always use a global ACME user, simplify the signature:

-func (s *Service) getOrCreateAcmeClient(ctx context.Context, _ string) (*lego.Client, error) {
+func (s *Service) getOrCreateAcmeClient(ctx context.Context) (*lego.Client, error) {

And update the call site at line 147:

-	client, err := s.getOrCreateAcmeClient(ctx, workspaceID)
+	client, err := s.getOrCreateAcmeClient(ctx)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa48d4c and 4292117.

⛔ Files ignored due to path filters (3)
  • go/gen/proto/hydra/v1/certificate.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • go/gen/proto/hydra/v1/certificate_restate.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • go/go.sum is excluded by !**/*.sum
📒 Files selected for processing (21)
  • go/apps/ctrl/config.go (2 hunks)
  • go/apps/ctrl/internal/caches/caches.go (1 hunks)
  • go/apps/ctrl/run.go (5 hunks)
  • go/apps/ctrl/services/acme/pem.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/route53_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/user.go (5 hunks)
  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6 hunks)
  • go/apps/ctrl/workflows/certificate/renew_handler.go (1 hunks)
  • go/apps/ctrl/workflows/certificate/service.go (4 hunks)
  • go/cmd/ctrl/main.go (2 hunks)
  • go/go.mod (1 hunks)
  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go (1 hunks)
  • go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (1 hunks)
  • go/pkg/db/custom_domain_upsert.sql_generated.go (1 hunks)
  • go/pkg/db/querier_bulk_generated.go (1 hunks)
  • go/pkg/db/querier_generated.go (2 hunks)
  • go/pkg/db/queries/custom_domain_find_by_domain_or_wildcard.sql (1 hunks)
  • go/pkg/db/queries/custom_domain_upsert.sql (1 hunks)
  • go/proto/hydra/v1/certificate.proto (2 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-09-11T14:24:40.988Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/deployment/deploy_workflow.go:326-334
Timestamp: 2025-09-11T14:24:40.988Z
Learning: The InsertDomains method in the bulk queries uses ON DUPLICATE KEY UPDATE, making it an upsert operation that is idempotent and safe for retries, despite the "Insert" naming convention.

Applied to files:

  • go/pkg/db/querier_generated.go
  • go/pkg/db/custom_domain_upsert.sql_generated.go
  • go/pkg/db/queries/custom_domain_upsert.sql
  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go
  • go/pkg/db/querier_bulk_generated.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/config.go
  • go/cmd/ctrl/main.go
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/config.go
  • go/cmd/ctrl/main.go
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go
📚 Learning: 2025-07-17T14:24:20.403Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3631
File: go/pkg/db/bulk_keyring_insert.sql.go:23-25
Timestamp: 2025-07-17T14:24:20.403Z
Learning: In go/pkg/db/bulk_keyring_insert.sql.go and similar bulk insert generated files, hardcoded zero values for fields like size_approx and size_last_updated_at are intentional and reflect the original SQL query structure, not missing parameters.

Applied to files:

  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go
  • go/pkg/db/querier_bulk_generated.go
📚 Learning: 2025-09-12T08:01:20.792Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/pkg/db/acme_challenge_update_verified_with_expiry.sql_generated.go:31-39
Timestamp: 2025-09-12T08:01:20.792Z
Learning: Do not review or suggest changes to files with sql_generated.go suffix or other files marked as auto-generated (containing "Code generated by" comments), as these are generated by tools like sqlc and changes would be overwritten on regeneration.

Applied to files:

  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go
  • go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/cmd/ctrl/main.go
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/run.go
🧬 Code graph analysis (13)
go/pkg/db/querier_generated.go (4)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (1)
  • FindCustomDomainByDomainOrWildcardParams (20-24)
go/pkg/db/models_generated.go (1)
  • CustomDomain (727-734)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/apps/ctrl/workflows/certificate/renew_handler.go (3)
go/gen/proto/hydra/v1/certificate.pb.go (6)
  • RenewExpiringCertificatesRequest (129-135)
  • RenewExpiringCertificatesRequest (148-148)
  • RenewExpiringCertificatesRequest (163-165)
  • RenewExpiringCertificatesResponse (174-181)
  • RenewExpiringCertificatesResponse (194-194)
  • RenewExpiringCertificatesResponse (209-211)
go/pkg/db/models_generated.go (3)
  • AcmeChallengesChallengeType (16-16)
  • AcmeChallengesChallengeTypeDNS01 (20-20)
  • AcmeChallengesChallengeTypeHTTP01 (19-19)
go/pkg/db/acme_challenge_list_executable.sql_generated.go (1)
  • ListExecutableChallengesRow (21-25)
go/pkg/db/custom_domain_upsert.sql_generated.go (3)
go/pkg/db/models_generated.go (1)
  • CustomDomainsChallengeType (144-144)
go/pkg/db/queries.go (1)
  • Queries (3-3)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/apps/ctrl/services/acme/user.go (3)
go/pkg/db/models_generated.go (1)
  • AcmeUser (639-646)
go/pkg/db/acme_user_update_registration_uri.sql_generated.go (1)
  • UpdateAcmeUserRegistrationURIParams (17-20)
go/pkg/db/types/null_string.go (1)
  • NullString (10-10)
go/apps/ctrl/services/acme/providers/route53_provider.go (3)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/apps/ctrl/config.go (1)
  • Route53Config (35-51)
go/apps/ctrl/services/acme/providers/provider.go (3)
  • Provider (34-39)
  • NewProvider (49-66)
  • ProviderConfig (41-46)
go/apps/ctrl/internal/caches/caches.go (2)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/clock/interface.go (1)
  • Clock (12-18)
go/apps/ctrl/config.go (2)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (1)
  • Route53Config (14-24)
go/pkg/db/bulk_custom_domain_upsert.sql_generated.go (3)
go/pkg/db/queries.go (1)
  • BulkQueries (4-4)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/cmd/ctrl/main.go (4)
go/pkg/cli/flag.go (4)
  • String (419-453)
  • Default (364-416)
  • EnvVar (320-339)
  • Bool (497-535)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1)
  • CloudflareConfig (13-18)
go/apps/ctrl/config.go (2)
  • CloudflareConfig (27-33)
  • Route53Config (35-51)
go/apps/ctrl/services/acme/providers/route53_provider.go (1)
  • Route53Config (14-24)
go/apps/ctrl/workflows/certificate/service.go (3)
go/apps/ctrl/services/acme/service.go (1)
  • Service (9-13)
go/apps/ctrl/workflows/routing/service.go (1)
  • Service (18-23)
go/apps/ctrl/services/acme/providers/provider.go (1)
  • DNSProvider (26-30)
go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (3)
go/pkg/db/queries.go (1)
  • Queries (3-3)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/pkg/db/models_generated.go (1)
  • CustomDomain (727-734)
go/pkg/db/querier_bulk_generated.go (1)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (5)
go/gen/proto/hydra/v1/certificate.pb.go (6)
  • ProcessChallengeRequest (25-31)
  • ProcessChallengeRequest (44-44)
  • ProcessChallengeRequest (59-61)
  • ProcessChallengeResponse (77-83)
  • ProcessChallengeResponse (96-96)
  • ProcessChallengeResponse (111-113)
go/apps/ctrl/services/acme/user.go (2)
  • GetOrCreateUser (49-112)
  • UserConfig (41-47)
go/apps/ctrl/services/acme/pem.go (1)
  • GetCertificateExpiry (11-23)
go/pkg/uid/uid.go (1)
  • CertificatePrefix (53-53)
go/pkg/db/certificate_insert.sql_generated.go (1)
  • InsertCertificateParams (23-31)
🔇 Additional comments (23)
go/apps/ctrl/config.go (1)

35-51: Route53 ACME config struct matches provider needs and is clear.

Field names and comments line up with the runtime Route53Config in go/apps/ctrl/services/acme/providers/route53_provider.go (access key, secret, region, hosted zone override), so wiring from CLI → config → provider should be straightforward. No changes needed here.

go/apps/ctrl/services/acme/pem.go (1)

10-23: PEM certificate expiry helper is correct and straightforward.

Decoding the first PEM block, guarding against nil, then using x509.ParseCertificate and NotAfter.UnixMilli() is exactly what’s needed here. Error messages are clear and wrapped, and this should work fine with typical leaf-first certificate bundles.

go/pkg/db/querier_generated.go (2)

168-187: Wildcard-aware custom domain lookup behavior looks correct.

The new FindCustomDomainByDomainOrWildcard method and its SQL comment implement the expected behavior: check both the exact domain and its wildcard variant, order so that an exact match wins (CASE WHEN domain = ? THEN 0 ELSE 1 END), and LIMIT 1 to avoid ambiguity. This lines up with the existing FindCustomDomainByDomain and the CustomDomain model shape.


2114-2122: UpsertCustomDomain interface matches the SQL upsert semantics and is idempotent.

The UpsertCustomDomain entry mirrors the new SQL (INSERT ... ON DUPLICATE KEY UPDATE workspace_id, challenge_type, updated_at) and the generated UpsertCustomDomainParams struct, giving you a clean, idempotent upsert for custom_domains. This is consistent with prior “insert but really upsert” patterns described in earlier learnings for domain-related queries.

go/pkg/db/querier_bulk_generated.go (1)

16-16: Bulk upsert for custom domains integrates cleanly with BulkQuerier.

UpsertCustomDomain(ctx, db, []UpsertCustomDomainParams) follows the same pattern as other bulk insert/upsert methods and matches the generated params struct, giving you an efficient batched path for domain upserts.

go/pkg/db/queries/custom_domain_upsert.sql (1)

1-7: Custom domain upsert SQL is idempotent and matches the generated Go API.

The UpsertCustomDomain query inserts (id, workspace_id, domain, challenge_type, created_at) and on duplicate key updates workspace_id, challenge_type, and updated_at. This gives you a clean, retry‑safe upsert semantics for custom_domains, similar to the existing “insert but upsert” patterns called out in previous learnings.

go/pkg/db/queries/custom_domain_find_by_domain_or_wildcard.sql (1)

1-6: Domain-or-wildcard lookup query correctly prefers exact matches.

WHERE domain IN (?, ?) combined with ORDER BY CASE WHEN domain = ? THEN 0 ELSE 1 END LIMIT 1 gives you the intended behavior: try the exact domain and its wildcard variant and return at most one row, preferring the exact match when both exist. This fits the new FindCustomDomainByDomainOrWildcard method in the generated Go code.

go/cmd/ctrl/main.go (1)

207-220: Ctrl config wiring for ACME EmailDomain and Route53 matches CLI and config structs.

Acme.Enabled, Acme.EmailDomain, Cloudflare, and Route53 are all populated from their respective flags/env vars, and the Route53 fields (Enabled, AccessKeyID, SecretAccessKey, Region, HostedZoneID) align with ctrl.Route53Config and the provider’s expectations. This should work smoothly with the new validation logic in ctrl.Config.Validate().

go/go.mod (1)

79-79: Route53 service module addition supports the lego-based Route53 ACME provider.

Adding github.com/aws/aws-sdk-go-v2/service/route53 v1.53.1 as an indirect dependency is justified: it's pulled in by lego's route53 DNS provider (github.com/go-acme/lego/v4/providers/dns/route53), used in route53_provider.go. This fits naturally with the existing aws-sdk-go-v2 stack (config, credentials, s3) already in use.

go/proto/hydra/v1/certificate.proto (1)

17-20: LGTM! Well-designed API for certificate renewal.

The RenewExpiringCertificates RPC is well-documented with clear guidance on its singleton key pattern and periodic execution model. The request/response structure provides appropriate configurability and observability metrics.

go/apps/ctrl/workflows/certificate/renew_handler.go (1)

87-93: Verify idempotency key behavior for same-day reruns.

The idempotency key at line 92 uses only the date (formatted as "2006-01-02"), which prevents duplicate schedules on the same day. However, if the job is manually triggered multiple times in a day, each run will attempt to schedule the next run for the same nextRunDate, and the idempotency key will prevent all but the first from succeeding.

This is likely the intended behavior, but verify that:

  1. Manual reruns during the same day won't cause issues with scheduling.
  2. The 24-hour delay is always calculated from the current time, not a fixed time of day.

If you want the job to run at a consistent time each day (e.g., always at midnight), consider using a fixed timestamp rather than adding 24 hours to the current time.

go/apps/ctrl/services/acme/user.go (3)

22-31: LGTM on the EmailDomain addition and GetEmail refactor.

The new EmailDomain field properly decouples the email domain from hardcoded values, and GetEmail() correctly constructs the email using the workspace ID and configurable domain.


49-56: Good defensive error handling added.

The added error wrapping distinguishes between "not found" (which triggers registration) and other errors (which now propagate with context). This is a solid improvement.


114-172: LGTM on the register function updates.

The register function correctly:

  1. Includes EmailDomain in the new user creation
  2. Persists the registration URI after successful ACME registration
  3. Uses proper error wrapping throughout

The flow is consistent with the GetOrCreateUser function's handling.

go/apps/ctrl/services/acme/providers/provider.go (3)

48-66: LGTM on NewProvider validation.

The validation using assert.All ensures all required dependencies are provided before constructing the provider. This is a clean pattern for constructor validation.


68-92: LGTM on resolveDomain with SWR caching.

The domain resolution logic correctly:

  1. Builds a cache key covering both exact and wildcard lookups
  2. Uses stale-while-revalidate caching to reduce DB load
  3. Returns a specific sentinel error for not-found cases

11-11: The import path is correct and no issue exists.

The import github.com/unkeyed/unkey/go/internal/services/caches at line 11 points to go/internal/services/caches/caches.go, which exists in the repository. While go/apps/ctrl/internal/caches/ also exists, the file correctly imports from the shared internal services package rather than the app-specific one. This is a valid import path and will compile without errors.

Likely an incorrect or invalid review comment.

go/apps/ctrl/run.go (2)

357-371: Idempotency key is static — restarts will skip the initial send after first run.

The idempotency key "cert-renewal-cron-startup" is constant. After the first successful send, subsequent restarts will be deduplicated by Restate and won't trigger a new renewal check. While this prevents duplicate crons on rapid restarts, it may also prevent the cron from being re-triggered if it fails or is canceled.

Consider whether this is the intended behavior, or if you want to use a time-based idempotency key (e.g., including the date) similar to how the renewal handler schedules its next run.


492-527: LGTM on bootstrapWildcardDomain logic.

The function correctly:

  1. Checks for existing wildcard domain before creating
  2. Uses an internal workspace ID for platform-managed resources
  3. Creates both the custom domain and an initial ACME challenge with "waiting" status
  4. Logs appropriately at each step

The error handling is defensive (logs and returns rather than propagating), which is appropriate for a bootstrap operation that shouldn't block startup.

go/apps/ctrl/workflows/certificate/process_challenge_handler.go (4)

65-70: Defer compensation logic is correct but consider clarifying the condition.

The defer correctly marks the challenge as failed when an error occurs. The resp.Status == "failed" check handles the case at lines 79-83 where obtainCertificate fails but returns a response with "failed" status instead of an error. This is intentional but could benefit from a comment explaining why both conditions are checked.


181-186: Good fallback for certificate expiry parsing failure.

Using a 90-day default (matching Let's Encrypt's certificate validity) when parsing fails is a reasonable fallback that ensures the renewal cron can still function.


205-223: LGTM on persistCertificate.

The function correctly uses the pre-generated certificate ID from EncryptedCertificate and leverages the upsert pattern for handling both new certificates and renewals.


225-236: Compensation runs inside Restate context correctly.

Using restate.Run for the compensation ensures the failure status update is durable and will complete even if the process crashes during cleanup.

@Flo4604
Copy link
Member Author

Flo4604 commented Dec 4, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (1)

72-101: Fix step numbering: jumps from Step 3 to Step 5.

The step comments skip Step 4: Line 72 says "Step 3", Line 85 says "Step 5", and Line 93 says "Step 6". Either renumber sequentially or add a comment explaining what Step 4 was (if removed).

-	// Step 5: Persist certificate to DB
+	// Step 4: Persist certificate to DB
 	certID, err := restate.Run(ctx, func(stepCtx restate.RunContext) (string, error) {
 		return s.persistCertificate(stepCtx, dom, req.GetDomain(), cert)
 	}, restate.WithName("persist certificate"))
 	if err != nil {
 		return nil, err
 	}

-	// Step 6: Mark challenge as verified
+	// Step 5: Mark challenge as verified
 	_, err = restate.Run(ctx, func(stepCtx restate.RunContext) (restate.Void, error) {
🧹 Nitpick comments (2)
go/apps/ctrl/run.go (1)

525-542: Consider extracting the hardcoded workspace ID to a constant.

The hardcoded "unkey_internal" workspace ID is used for platform-managed resources. Consider defining this as a package-level constant for maintainability and to ensure consistency if it's used elsewhere.

+const internalWorkspaceID = "unkey_internal"
+
 func bootstrapWildcardDomain(ctx context.Context, database db.Database, logger logging.Logger, defaultDomain string) {
 	wildcardDomain := "*." + defaultDomain
 	// ...
-	workspaceID := "unkey_internal"
+	workspaceID := internalWorkspaceID
go/apps/ctrl/services/acme/providers/http_provider.go (1)

17-20: Consider removing unused db field from httpDNS.

The db field is assigned in NewHTTPProvider (line 59) but never referenced in Present or CleanUp methods. Since the comment states "The actual DB update is handled by the generic Provider wrapper," this field appears to be dead code.

 type httpDNS struct {
-	db     db.Database
 	logger logging.Logger
 }

Also update the constructor:

 		DNS: &httpDNS{
-			db:     cfg.DB,
 			logger: cfg.Logger,
 		},
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4292117 and fe3c721.

📒 Files selected for processing (7)
  • go/apps/api/routes/v2_identities_list_identities/handler.go (1 hunks)
  • go/apps/ctrl/run.go (6 hunks)
  • go/apps/ctrl/services/acme/providers/http_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/route53_provider.go (1 hunks)
  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6 hunks)
  • go/apps/ctrl/workflows/certificate/service.go (4 hunks)
  • go/pkg/db/handle_err_no_rows.go (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-08-21T15:54:45.198Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3825
File: go/internal/services/usagelimiter/limit.go:38-0
Timestamp: 2025-08-21T15:54:45.198Z
Learning: In go/internal/services/usagelimiter/limit.go, the UpdateKeyCreditsDecrement operation cannot be safely wrapped with db.WithRetry due to the lack of idempotency mechanisms in the current tech stack. Retrying this non-idempotent write operation risks double-charging users if the first attempt commits but the client sees a transient error.

Applied to files:

  • go/apps/api/routes/v2_identities_list_identities/handler.go
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/apps/ctrl/run.go
🧬 Code graph analysis (6)
go/apps/ctrl/workflows/certificate/service.go (4)
go/apps/ctrl/services/acme/service.go (1)
  • Service (9-13)
go/apps/ctrl/workflows/routing/service.go (1)
  • Service (18-23)
go/apps/ctrl/services/acme/providers/provider.go (1)
  • DNSProvider (26-30)
go/apps/ctrl/services/acme/providers/http_provider.go (1)
  • HTTPProvider (45-45)
go/apps/api/routes/v2_identities_list_identities/handler.go (1)
go/pkg/db/handle_err_no_rows.go (1)
  • IsNotFound (10-12)
go/apps/ctrl/services/acme/providers/route53_provider.go (2)
go/apps/ctrl/config.go (1)
  • Route53Config (35-51)
go/apps/ctrl/services/acme/providers/provider.go (3)
  • Provider (34-39)
  • NewProvider (49-66)
  • ProviderConfig (41-46)
go/apps/ctrl/run.go (6)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (22-43)
  • Config (17-20)
go/pkg/uid/uid.go (1)
  • New (103-152)
go/apps/ctrl/services/acme/providers/http_provider.go (3)
  • NewHTTPProvider (54-64)
  • HTTPConfig (47-51)
  • HTTPProvider (45-45)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (2)
  • NewCloudflareProvider (21-38)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (2)
  • NewRoute53Provider (33-55)
  • Route53Config (13-23)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (7)
go/pkg/db/models_generated.go (1)
  • Certificate (703-711)
go/apps/ctrl/workflows/certificate/service.go (1)
  • Service (20-29)
go/gen/proto/hydra/v1/certificate.pb.go (6)
  • ProcessChallengeRequest (25-31)
  • ProcessChallengeRequest (44-44)
  • ProcessChallengeRequest (59-61)
  • ProcessChallengeResponse (77-83)
  • ProcessChallengeResponse (96-96)
  • ProcessChallengeResponse (111-113)
go/apps/ctrl/services/acme/user.go (2)
  • GetOrCreateUser (49-112)
  • UserConfig (41-47)
go/pkg/retry/retry.go (2)
  • Attempts (83-88)
  • Backoff (93-98)
go/apps/ctrl/services/acme/pem.go (1)
  • GetCertificateExpiry (11-23)
go/pkg/db/certificate_insert.sql_generated.go (1)
  • InsertCertificateParams (23-31)
go/apps/ctrl/services/acme/providers/http_provider.go (1)
go/apps/ctrl/services/acme/providers/provider.go (1)
  • ProviderConfig (41-46)
🔇 Additional comments (20)
go/pkg/db/handle_err_no_rows.go (1)

8-11: LGTM!

Clean utility function for consistent not-found error handling across the codebase. Using errors.Is correctly handles wrapped errors.

go/apps/api/routes/v2_identities_list_identities/handler.go (1)

107-111: LGTM!

Good adoption of the centralized db.IsNotFound helper for consistent not-found error handling.

go/apps/ctrl/services/acme/providers/route53_provider.go (1)

25-55: LGTM! Previous thread-safety concern has been addressed.

The os.Setenv("LEGO_DISABLE_CNAME_SUPPORT", "true") call has been correctly moved to application startup in run.go (line 60), and the documentation here properly notes this prerequisite. The implementation follows the established pattern from CloudflareProvider.

go/apps/ctrl/workflows/certificate/service.go (2)

22-28: LGTM!

The Service struct is cleanly extended with the new ACME provider fields. The inline comments clearly distinguish DNS-01 (wildcard certs) from HTTP-01 (regular certs) use cases.


43-55: LGTM!

Config fields are well-documented and properly aligned with the Service struct fields.

go/apps/ctrl/run.go (4)

57-60: LGTM! Thread-safety concern addressed.

Setting the environment variable once at application startup before any concurrent execution is the correct fix for the previously identified thread-safety issue.


260-313: LGTM! Previous DomainCache concerns addressed.

The cache initialization at lines 265-270 properly precedes provider creation, and DomainCache is correctly passed to all providers (Cloudflare at line 290, Route53 at line 305, HTTP at line 276).


391-403: LGTM! Idempotent cron startup.

Using restate.WithIdempotencyKey("cert-renewal-cron-startup") ensures that multiple application restarts don't create duplicate cron jobs.


544-559: Database constraints already prevent duplicate ACME challenges.

The acme_challenges table enforces a PRIMARY KEY constraint on domain_id, making duplicates impossible at the database level. The code's guard checking for existing custom domains (FindCustomDomainByDomain) ensures the function returns early on subsequent restarts, preventing any insertion attempt. The proposed fix references FindAcmeChallengeByDomainID, which does not exist in the codebase—only FindAcmeChallengeByToken is available. No changes are needed.

Likely an incorrect or invalid review comment.

go/apps/ctrl/services/acme/providers/http_provider.go (3)

22-35: LGTM!

The Present and CleanUp methods correctly delegate actual challenge handling to the generic Provider wrapper while maintaining proper logging. The no-op CleanUp is appropriate since HTTP-01 tokens can remain in DB until overwritten.


37-41: LGTM!

The timeout values (90s timeout, 3s interval) are appropriate for HTTP-01 challenges which typically resolve faster than DNS-based challenges.


43-64: LGTM!

The type alias provides a clear public interface while reusing the generic Provider implementation. The HTTPConfig struct aligns with ProviderConfig requirements, and the constructor properly wires all dependencies.

go/apps/ctrl/workflows/certificate/process_challenge_handler.go (8)

20-26: LGTM!

The EncryptedCertificate struct is well-designed to hold all necessary certificate data including the encrypted private key for secure storage.


65-70: LGTM!

The saga compensation correctly uses a defer to capture the final error state, ensuring markChallengeFailed is called on any failure path after claiming the challenge. The nil check on resp prevents panic when returning early with an error.


121-124: LGTM!

The wildcard detection correctly identifies domains starting with *. and handles edge cases by checking minimum length.


126-160: LGTM!

The provider selection logic correctly routes wildcard domains to DNS-01 and regular domains to HTTP-01. The nil checks provide clear error messages when providers aren't configured.


178-199: LGTM!

The backoff calculation now correctly uses (attempt-1) to produce the documented delays of 30s, 60s, 120s with a 5-minute cap. The retry logic is appropriate for handling transient ACME failures.


201-206: LGTM!

The 90-day fallback matches Let's Encrypt's standard certificate validity period, and logging the warning ensures visibility when parsing fails.


208-222: LGTM!

Encrypting the private key before storage is essential for security. Using dom.WorkspaceID as the keyring provides proper key isolation per workspace.


245-256: LGTM!

The compensation correctly logs failures without propagating errors, ensuring the main workflow's error state isn't overwritten. Discarding the outer restate.Run error is intentional since this is best-effort cleanup.

@Flo4604 Flo4604 marked this pull request as ready for review December 4, 2025 16:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
go/apps/ctrl/run.go (1)

526-561: Consider extracting "unkey_internal" workspace ID to a package-level constant.

The hardcoded "unkey_internal" string should be defined as a constant for clarity and to enable reuse across the codebase.

The empty Token and Authorization fields in the bootstrap challenge record are intentional and correct—the renewal cron queries challenges by status (via ListExecutableChallenges, which filters WHERE status = 'waiting'), not by token, so this record will be properly discovered during renewal.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe3c721 and 13acff3.

📒 Files selected for processing (5)
  • go/apps/ctrl/internal/caches/caches.go (1 hunks)
  • go/apps/ctrl/run.go (7 hunks)
  • go/apps/ctrl/services/acme/certificate_verification.go (2 hunks)
  • go/apps/ctrl/services/acme/service.go (1 hunks)
  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-15T18:12:01.503Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 3952
File: go/apps/ctrl/run.go:212-213
Timestamp: 2025-09-15T18:12:01.503Z
Learning: In Go connect-go generated handlers, mux.Handle can accept New*ServiceHandler functions directly even though they return (string, http.Handler) tuples. The pattern mux.Handle(ctrlv1connect.New*ServiceHandler(...)) is valid and compiles successfully in unkey codebase.

Applied to files:

  • go/apps/ctrl/services/acme/service.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/run.go
  • go/apps/ctrl/services/acme/certificate_verification.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-07-15T14:59:30.212Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3560
File: go/deploy/metald/internal/database/repository.go:0-0
Timestamp: 2025-07-15T14:59:30.212Z
Learning: go/deploy/metald cannot currently import helpers from go/pkg/db because it is not yet part of the main Go module; avoid suggesting such imports until the modules are unified.

Applied to files:

  • go/apps/ctrl/services/acme/certificate_verification.go
🧬 Code graph analysis (3)
go/apps/ctrl/internal/caches/caches.go (6)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallenge (627-637)
go/pkg/clock/interface.go (1)
  • Clock (12-18)
go/apps/ctrl/services/acme/service.go (2)
  • New (25-33)
  • Config (18-23)
go/apps/ctrl/workflows/certificate/service.go (2)
  • New (58-69)
  • Config (34-55)
go/apps/ctrl/config.go (1)
  • Config (95-171)
go/apps/ctrl/services/acme/service.go (5)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallenge (627-637)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (23-58)
  • Config (18-21)
go/apps/ctrl/workflows/certificate/service.go (3)
  • New (58-69)
  • Config (34-55)
  • Service (20-29)
go/apps/ctrl/workflows/routing/service.go (3)
  • New (40-47)
  • Config (28-37)
  • Service (18-23)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (4)
go/pkg/db/models_generated.go (2)
  • Certificate (703-711)
  • CustomDomain (727-734)
go/apps/ctrl/workflows/certificate/service.go (2)
  • Service (20-29)
  • New (58-69)
go/apps/ctrl/services/acme/user.go (2)
  • GetOrCreateUser (49-112)
  • UserConfig (41-47)
go/pkg/db/handle_err_no_rows.go (1)
  • IsNotFound (10-12)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Go API Local / Test
  • GitHub Check: Lint Go Code / Lint
  • GitHub Check: Test API / API Test Local
  • GitHub Check: Build / Build
  • GitHub Check: Test Packages / Test
🔇 Additional comments (14)
go/apps/ctrl/services/acme/certificate_verification.go (2)

20-35: LGTM on domain cache lookup with SWR pattern.

The stale-while-revalidate cache pattern is correctly implemented with proper error handling - checking for non-NotFound errors first, then handling cache misses via cache.Null or db.IsNotFound.


37-55: Challenge cache lookup is well-structured.

The composite key format (domainID|token) ensures uniqueness, and the short TTL (10s fresh / 30s stale from caches.go) is appropriate for challenge data that changes during the ACME flow.

go/apps/ctrl/services/acme/service.go (1)

10-32: LGTM on cache dependency injection.

The cache fields follow the established service configuration pattern in the codebase (e.g., routing.Service, certificate.Service). The caches are properly wired from the centralized caches.Caches container.

go/apps/ctrl/internal/caches/caches.go (1)

23-57: Well-designed cache configuration.

Good choices:

  • Clock injection for testability with sensible default
  • Short TTL (10s/30s) for challenges that change during ACME flow
  • Longer TTL (5m/10m) for relatively stable domain data
  • Proper error propagation from cache initialization
go/apps/ctrl/run.go (4)

260-266: LGTM on cache initialization placement.

Caches are correctly initialized before the ACME providers that depend on them.


268-313: Provider setup correctly wires domain cache.

All three providers (HTTP, Cloudflare, Route53) now receive the DomainCache as required. The previous review comments about missing cache initialization have been addressed.


389-403: LGTM on renewal cron initialization.

Good practices:

  • Idempotency key prevents duplicate crons across restarts
  • Send() for fire-and-forget is appropriate for cron startup
  • Non-fatal warning on failure allows service to continue

57-60: The environment variable timing here is correct. The lego library reads LEGO_DISABLE_CNAME_SUPPORT from the environment at the time providers are created, not at import/init time. Since os.Setenv is called at line 60 and providers are created much later (lines 268-313), the variable is reliably available when needed. The existing code comment adequately documents the requirement.

go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6)

65-70: Compensation logic has a subtle issue with response handling.

The defer checks resp != nil && resp.Status == "failed", but at line 78-83 when obtainCertificate fails, you return a response with Status: "failed" and err = nil. This correctly triggers the compensation.

However, the current code has (resp *hydrav1.ProcessChallengeResponse, err error) as named returns. If any later step fails with a non-nil error, resp will be nil, and the check resp != nil && resp.Status == "failed" won't match, but err != nil will. This is correct.

One edge case: if err != nil at line 78, you still return nil for error - verify this is intentional for Restate workflow semantics (returning a "failed" status vs propagating error).


176-192: Backoff calculation now matches documentation.

The formula 30<<(attempt-1) correctly produces 30s, 60s, 120s delays for attempts 1, 2, 3, matching the comment. The past review issue has been addressed.


218-248: Certificate ID handling for renewals is now correct.

The code properly handles the renewal case by:

  1. Checking for existing certificate by hostname (line 224)
  2. Reusing the existing ID for renewals (lines 228-231)
  3. Only using the new ID for first-time issuance

This addresses the past review concern about ID mismatches.


126-160: LGTM on ACME client setup with provider selection.

Clean separation of concerns:

  • Single global ACME user for simplified certificate management
  • DNS-01 for wildcard domains (required by ACME spec)
  • HTTP-01 for regular domains (faster, no DNS propagation)
  • Clear error messages when required providers are missing

250-261: Compensation handler correctly uses best-effort semantics.

Ignoring the error from restate.Run is appropriate for compensation - the operation is best-effort and shouldn't fail the main workflow if it can't mark the challenge as failed. The error is logged for debugging.


162-216: LGTM on certificate obtain flow.

Good practices:

  • Comment explains why ACME client creation is inside this function (not serializable)
  • Sensible 90-day default expiry matches Let's Encrypt standard
  • Private key encryption via Vault before storage
  • Retry with exponential backoff for transient ACME failures

@Flo4604 Flo4604 force-pushed the feat/ctrl-lets-encrypt-challenges branch from 13acff3 to 7b70331 Compare December 4, 2025 17:36
@vercel vercel bot temporarily deployed to Preview – engineering December 4, 2025 17:37 Inactive
@vercel vercel bot temporarily deployed to Preview – dashboard December 4, 2025 17:37 Inactive
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
go/apps/ctrl/services/acme/user.go (1)

103-108: Consider the impact of silently failing to persist the registration URI.

If UpdateAcmeUserRegistrationURI fails, subsequent calls will repeatedly trigger ACME registration (lines 90-99), potentially hitting rate limits. While a warning log is reasonable for transient failures, consider whether repeated failures warrant returning an error to avoid unnecessary ACME calls.

go/apps/ctrl/config.go (1)

227-236: LGTM on Route53 validation, but consider validating EmailDomain.

The Route53 validation follows the same pattern as Cloudflare. However, EmailDomain (line 58) is used in ACME registration emails but isn't validated when ACME is enabled. Consider adding:

 // Validate Cloudflare configuration if enabled
 if c.Acme.Enabled && c.Acme.Cloudflare.Enabled {
+    if err := assert.NotEmpty(c.Acme.EmailDomain, "acme email domain is required when acme is enabled"); err != nil {
+        return err
+    }
     if err := assert.NotEmpty(c.Acme.Cloudflare.ApiToken, "cloudflare API token is required when cloudflare is enabled"); err != nil {
         return err
     }
 }

Or validate it when any DNS provider is enabled.

go/apps/ctrl/workflows/certificate/service.go (1)

58-68: Consider validating the new configuration fields.

The New constructor doesn't validate that EmailDomain, DefaultDomain, DNSProvider, or HTTPProvider are properly set. If these are required for ACME operations, missing values could cause runtime failures later.

Consider adding validation similar to other services in the codebase:

func New(cfg Config) (*Service, error) {
	if cfg.EmailDomain == "" {
		return nil, errors.New("email domain is required")
	}
	if cfg.DefaultDomain == "" {
		return nil, errors.New("default domain is required")
	}
	// ... rest of initialization
}

Alternatively, if these fields are optional depending on which challenge types are enabled, document that clearly.

go/apps/ctrl/services/acme/providers/provider.go (1)

71-92: Consider documenting the cache key format.

The composite cache key domain + "|" + wildcardDomain is a good approach for caching both exact and wildcard lookups together. However, if the domain itself could contain |, this could cause cache key collisions.

Consider using a separator that's invalid in domain names, or add a brief comment explaining the key format:

 func (p *Provider) resolveDomain(ctx context.Context, domain string) (db.CustomDomain, error) {
 	wildcardDomain := "*." + domain
+	// Cache key combines exact and wildcard domains for single lookup caching
 	cacheKey := domain + "|" + wildcardDomain
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13acff3 and 7b70331.

⛔ Files ignored due to path filters (3)
  • go/gen/proto/hydra/v1/certificate.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • go/gen/proto/hydra/v1/certificate_restate.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • go/go.sum is excluded by !**/*.sum
📒 Files selected for processing (26)
  • go/apps/api/routes/v2_identities_list_identities/handler.go (1 hunks)
  • go/apps/ctrl/config.go (2 hunks)
  • go/apps/ctrl/internal/caches/caches.go (1 hunks)
  • go/apps/ctrl/run.go (7 hunks)
  • go/apps/ctrl/services/acme/certificate_verification.go (2 hunks)
  • go/apps/ctrl/services/acme/pem.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/http_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/provider.go (1 hunks)
  • go/apps/ctrl/services/acme/providers/route53_provider.go (1 hunks)
  • go/apps/ctrl/services/acme/service.go (1 hunks)
  • go/apps/ctrl/services/acme/user.go (5 hunks)
  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6 hunks)
  • go/apps/ctrl/workflows/certificate/renew_handler.go (1 hunks)
  • go/apps/ctrl/workflows/certificate/service.go (4 hunks)
  • go/cmd/ctrl/main.go (2 hunks)
  • go/go.mod (1 hunks)
  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go (1 hunks)
  • go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (1 hunks)
  • go/pkg/db/custom_domain_upsert.sql_generated.go (1 hunks)
  • go/pkg/db/handle_err_no_rows.go (1 hunks)
  • go/pkg/db/querier_bulk_generated.go (1 hunks)
  • go/pkg/db/querier_generated.go (2 hunks)
  • go/pkg/db/queries/custom_domain_find_by_domain_or_wildcard.sql (1 hunks)
  • go/pkg/db/queries/custom_domain_upsert.sql (1 hunks)
  • go/proto/hydra/v1/certificate.proto (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • go/pkg/db/querier_generated.go
  • go/apps/api/routes/v2_identities_list_identities/handler.go
  • go/go.mod
  • go/pkg/db/bulk_custom_domain_upsert.sql_generated.go
  • go/pkg/db/queries/custom_domain_find_by_domain_or_wildcard.sql
  • go/apps/ctrl/workflows/certificate/renew_handler.go
  • go/proto/hydra/v1/certificate.proto
  • go/cmd/ctrl/main.go
  • go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.
📚 Learning: 2025-09-11T14:24:40.988Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/deployment/deploy_workflow.go:326-334
Timestamp: 2025-09-11T14:24:40.988Z
Learning: The InsertDomains method in the bulk queries uses ON DUPLICATE KEY UPDATE, making it an upsert operation that is idempotent and safe for retries, despite the "Insert" naming convention.

Applied to files:

  • go/pkg/db/queries/custom_domain_upsert.sql
  • go/pkg/db/querier_bulk_generated.go
  • go/pkg/db/custom_domain_upsert.sql_generated.go
📚 Learning: 2025-07-17T14:24:20.403Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3631
File: go/pkg/db/bulk_keyring_insert.sql.go:23-25
Timestamp: 2025-07-17T14:24:20.403Z
Learning: In go/pkg/db/bulk_keyring_insert.sql.go and similar bulk insert generated files, hardcoded zero values for fields like size_approx and size_last_updated_at are intentional and reflect the original SQL query structure, not missing parameters.

Applied to files:

  • go/pkg/db/querier_bulk_generated.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/config.go
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/config.go
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/services/acme/certificate_verification.go
  • go/apps/ctrl/services/acme/providers/cloudflare_provider.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-07-15T14:59:30.212Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3560
File: go/deploy/metald/internal/database/repository.go:0-0
Timestamp: 2025-07-15T14:59:30.212Z
Learning: go/deploy/metald cannot currently import helpers from go/pkg/db because it is not yet part of the main Go module; avoid suggesting such imports until the modules are unified.

Applied to files:

  • go/apps/ctrl/services/acme/certificate_verification.go
📚 Learning: 2025-09-15T18:12:01.503Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 3952
File: go/apps/ctrl/run.go:212-213
Timestamp: 2025-09-15T18:12:01.503Z
Learning: In Go connect-go generated handlers, mux.Handle can accept New*ServiceHandler functions directly even though they return (string, http.Handler) tuples. The pattern mux.Handle(ctrlv1connect.New*ServiceHandler(...)) is valid and compiles successfully in unkey codebase.

Applied to files:

  • go/apps/ctrl/services/acme/service.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-10-30T15:10:52.743Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 4190
File: go/internal/services/keys/verifier.go:51-53
Timestamp: 2025-10-30T15:10:52.743Z
Learning: PR #4190 for unkeyed/unkey is focused solely on database schema and query changes for identity-based credits. It adds IdentityCredits and KeyCredits fields to structs and queries, but does not implement the priority enforcement logic in the usagelimiter. The logic implementation is intentionally deferred to a later PR in the stack.

Applied to files:

  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go
🧬 Code graph analysis (12)
go/apps/ctrl/services/acme/providers/provider.go (5)
go/pkg/cache/interface.go (2)
  • Cache (7-49)
  • Null (59-59)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallengesStatusPending (62-62)
go/pkg/db/queries.go (1)
  • Query (29-29)
go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (1)
  • FindCustomDomainByDomainOrWildcardParams (20-24)
go/internal/services/caches/op.go (1)
  • DefaultFindFirstOp (9-22)
go/pkg/db/querier_bulk_generated.go (2)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/pkg/db/custom_domain_upsert.sql_generated.go (3)
go/pkg/db/models_generated.go (1)
  • CustomDomainsChallengeType (144-144)
go/pkg/db/queries.go (1)
  • Queries (3-3)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/apps/ctrl/config.go (2)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (1)
  • Route53Config (13-23)
go/apps/ctrl/services/acme/certificate_verification.go (8)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallenge (627-637)
go/pkg/db/queries.go (1)
  • Query (29-29)
go/internal/services/caches/op.go (1)
  • DefaultFindFirstOp (9-22)
go/pkg/db/handle_err_no_rows.go (1)
  • IsNotFound (10-12)
go/pkg/cache/interface.go (1)
  • Null (59-59)
go/apps/ctrl/internal/caches/caches.go (1)
  • New (23-58)
go/apps/ctrl/services/acme/service.go (1)
  • New (25-33)
go/pkg/db/acme_challenge_find_by_token.sql_generated.go (1)
  • FindAcmeChallengeByTokenParams (16-20)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (4)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/db/models_generated.go (1)
  • CustomDomain (727-734)
go/apps/ctrl/config.go (1)
  • CloudflareConfig (27-33)
go/apps/ctrl/services/acme/providers/provider.go (3)
  • Provider (34-39)
  • NewProvider (49-66)
  • ProviderConfig (41-46)
go/apps/ctrl/internal/caches/caches.go (4)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallenge (627-637)
go/pkg/clock/interface.go (1)
  • Clock (12-18)
go/apps/ctrl/config.go (1)
  • Config (95-171)
go/apps/ctrl/services/acme/user.go (4)
go/pkg/db/models_generated.go (1)
  • AcmeUser (639-646)
go/pkg/db/queries.go (1)
  • Query (29-29)
go/pkg/db/acme_user_update_registration_uri.sql_generated.go (1)
  • UpdateAcmeUserRegistrationURIParams (17-20)
go/pkg/db/types/null_string.go (1)
  • NullString (10-10)
go/apps/ctrl/services/acme/service.go (3)
go/pkg/cache/interface.go (1)
  • Cache (7-49)
go/pkg/db/models_generated.go (2)
  • CustomDomain (727-734)
  • AcmeChallenge (627-637)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (23-58)
  • Config (18-21)
go/apps/ctrl/run.go (7)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (23-58)
  • Config (18-21)
go/apps/ctrl/services/acme/providers/provider.go (2)
  • Provider (34-39)
  • DNSProvider (26-30)
go/apps/ctrl/services/acme/providers/http_provider.go (3)
  • NewHTTPProvider (54-64)
  • HTTPConfig (47-51)
  • HTTPProvider (45-45)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (2)
  • NewCloudflareProvider (21-38)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (2)
  • NewRoute53Provider (33-55)
  • Route53Config (13-23)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/pkg/db/acme_challenge_insert.sql_generated.go (1)
  • InsertAcmeChallengeParams (37-47)
go/apps/ctrl/workflows/certificate/service.go (2)
go/apps/ctrl/services/acme/providers/provider.go (1)
  • DNSProvider (26-30)
go/apps/ctrl/services/acme/providers/http_provider.go (1)
  • HTTPProvider (45-45)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (8)
go/pkg/db/models_generated.go (2)
  • Certificate (703-711)
  • CustomDomain (727-734)
go/apps/ctrl/workflows/certificate/service.go (2)
  • Service (20-29)
  • New (58-69)
go/apps/ctrl/services/acme/user.go (2)
  • GetOrCreateUser (49-112)
  • UserConfig (41-47)
go/pkg/retry/retry.go (2)
  • Attempts (83-88)
  • Backoff (93-98)
go/apps/ctrl/services/acme/pem.go (1)
  • GetCertificateExpiry (11-23)
go/gen/proto/vault/v1/service.pb.go (3)
  • EncryptRequest (104-110)
  • EncryptRequest (123-123)
  • EncryptRequest (138-140)
go/pkg/uid/uid.go (1)
  • CertificatePrefix (53-53)
go/pkg/db/certificate_insert.sql_generated.go (1)
  • InsertCertificateParams (23-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test API / API Test Local
  • GitHub Check: Test Go API Local / Test
  • GitHub Check: Lint Go Code / Lint
  • GitHub Check: Test Packages / Test
  • GitHub Check: Build / Build
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (34)
go/pkg/db/handle_err_no_rows.go (1)

8-12: LGTM!

Clean utility function using errors.Is for proper error chain handling. This centralizes not-found error detection correctly.

go/apps/ctrl/services/acme/user.go (3)

22-31: LGTM!

Good refactoring to make the email domain configurable via struct field rather than hardcoded value.


41-47: LGTM!

Clear struct extension with documented purpose for the email domain field.


114-173: LGTM!

The register function correctly handles the full user creation flow. The stricter error handling on line 169 (vs. warning-only on line 107) makes sense: for new users, failing to persist the registration URI would leave an orphaned user record.

go/pkg/db/queries/custom_domain_upsert.sql (1)

1-7: LGTM!

Clean upsert query using MySQL's ON DUPLICATE KEY UPDATE. The separate placeholder for updated_at in the UPDATE clause is correct since it's not included in the INSERT values. Based on learnings, this pattern is idempotent and safe for retries.

go/pkg/db/querier_bulk_generated.go (1)

16-16: LGTM!

Generated interface method follows the established pattern and is correctly placed.

go/apps/ctrl/services/acme/pem.go (1)

10-23: LGTM!

Clean implementation for extracting certificate expiry. Correctly handles the common case where only the first (leaf) certificate's expiry is needed from a potential chain.

go/apps/ctrl/services/acme/certificate_verification.go (2)

30-35: LGTM!

The error handling correctly distinguishes between "not found" (returns 404) and other errors (returns 500). The dual check for cache.Null and db.IsNotFound(err) covers both cached null values and fresh not-found results.


37-55: LGTM!

The composite cache key format and challenge lookup follow the same pattern as the domain lookup. The key construction is straightforward and the error handling is consistent.

go/apps/ctrl/services/acme/service.go (1)

10-32: LGTM!

Clean extension of the service with domain and challenge caches via dependency injection. The pattern is consistent with other service dependencies (db, logger).

go/apps/ctrl/config.go (2)

35-52: LGTM!

Well-documented Route53 configuration struct following the same pattern as CloudflareConfig. The HostedZoneID comment helpfully explains when the bypass is needed.


53-65: LGTM!

AcmeConfig cleanly extended to support both Cloudflare and Route53 DNS providers, plus the configurable email domain.

go/apps/ctrl/internal/caches/caches.go (1)

1-58: LGTM! Well-structured cache container with appropriate TTL settings.

The short TTL for challenges (10s/30s) is appropriate for the time-sensitive ACME flow, and the longer TTL for domains (5m/10m) balances cache efficiency with data freshness. The clock injection pattern supports testability.

go/pkg/db/custom_domain_upsert.sql_generated.go (1)

1-49: Generated code looks correct.

This is sqlc-generated code implementing an idempotent upsert operation. The ON DUPLICATE KEY UPDATE pattern is consistent with other upsert operations in this codebase. Based on learnings, this naming convention is established practice.

go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1)

13-37: Clean refactor to use the generic Provider wrapper.

The simplified implementation correctly delegates to the generic Provider while maintaining consistent configuration (TTL, propagation timeout) with the Route53 provider. The DomainCache requirement will be validated by NewProvider.

go/apps/ctrl/services/acme/providers/provider.go (2)

94-124: Note: context.Background() is a constraint of the lego interface.

The lego challenge.Provider interface doesn't accept a context parameter, so using context.Background() here is unavoidable. This is worth noting for future maintainers, as it means these operations cannot be cancelled externally.

The DB update after DNS presentation follows an "at-least-once" pattern - if the DB update fails after DNS record creation, the record exists but isn't tracked. This is acceptable since the record will be cleaned up by TTL or subsequent attempts.


126-136: CleanUp intentionally swallows errors - verify this is the desired behavior.

The CleanUp method logs DNS cleanup failures as warnings but returns nil, which means cleanup errors won't propagate. This is a reasonable choice to avoid blocking the certificate flow, but could leave orphaned DNS records.

Is this the intended behavior? Orphaned TXT records will eventually expire based on TTL (10 minutes), but if cleanup consistently fails, you may want alerting on these warnings.

go/apps/ctrl/services/acme/providers/route53_provider.go (1)

25-33: The environment variable LEGO_DISABLE_CNAME_SUPPORT is already properly set to "true" at application startup in go/apps/ctrl/run.go:60, confirming the documented requirement is fulfilled and no additional action is needed.

go/apps/ctrl/run.go (6)

57-60: LGTM! Good guard against CNAME resolution issues.

Setting this environment variable before provider creation prevents lego from following wildcard CNAMEs to load balancers, which would cause Route53 zone lookup failures. The comment clearly explains the purpose.


260-266: LGTM! Cache initialization addresses prior review feedback.

The shared caches are properly initialized before provider setup, resolving the previous issue about missing DomainCache parameters. Creating caches unconditionally aligns with the comment that they're needed for verification endpoints regardless of ACME configuration.


268-313: LGTM! Provider setup properly addresses past issues.

Both DNS providers (Cloudflare and Route53) now receive the DomainCache parameter, resolving the critical issues from previous reviews. The separation between HTTP-01 (regular domains) and DNS-01 (wildcard domains) is clear and well-commented.


315-325: LGTM! Appropriate timeout for DNS-01 challenges.

The 15-minute inactivity timeout accounts for DNS propagation delays (5-10 minutes) during DNS-01 challenges. All required ACME configuration parameters are properly wired.


383-403: LGTM! Good use of idempotency and appropriate error handling.

The idempotency key prevents duplicate cron jobs on restart, which is essential for reliable operation. Bootstrap and cron startup appropriately run only when DNS-01 is available (required for wildcard certificates). Using warnings instead of hard failures for these operations is sensible since they're initialization tasks.


441-446: LGTM! Cache wiring is complete.

Both DomainCache and ChallengeCache are properly passed to the ACME service, enabling efficient domain and challenge lookups.

Based on learnings, the auth interceptor is intentionally kept during the demo phase.

go/apps/ctrl/services/acme/providers/http_provider.go (2)

15-41: LGTM! Refactoring aligns with generic provider architecture.

The httpDNS type correctly implements the DNSProvider interface. The Present and CleanUp methods serve as hooks for the generic Provider wrapper (from provider.go), which handles actual database persistence. The comments clearly explain this delegation pattern.

The 90-second timeout and 3-second interval are appropriate for HTTP-01 challenges, which resolve faster than DNS-01.


43-63: LGTM! Clean provider construction pattern.

The type alias HTTPProvider = Provider makes the provider's purpose explicit while leveraging the generic Provider wrapper. The DomainCache is properly passed through to enable efficient domain lookups during challenge processing. This pattern is consistent with other providers in the PR (Cloudflare, Route53).

go/apps/ctrl/workflows/certificate/process_challenge_handler.go (8)

20-26: LGTM! CertificateID field enables proper certificate tracking.

Adding the CertificateID field to the struct enables proper certificate identification and management, particularly important for renewal scenarios handled in persistCertificate.


36-39: LGTM! Named returns enable saga compensation pattern.

The named return values allow the deferred compensation function (lines 65-70) to inspect both the error and response status, which is essential for properly marking failed challenges.


65-70: LGTM! Well-implemented saga compensation pattern.

The deferred compensation ensures that challenges are properly marked as failed if any error occurs or if the workflow explicitly returns a "failed" status. This prevents challenges from remaining in the "pending" state indefinitely and is a solid implementation of the saga pattern for distributed workflows.


118-124: LGTM! Single global ACME user simplifies certificate management.

Using a fixed global ACME user ID for all certificates simplifies account management and aligns with common ACME deployment patterns. The isWildcard helper is correct and clear.


126-160: LGTM! Clear DNS-01 vs HTTP-01 provider selection.

The function correctly routes wildcard domains to DNS-01 challenges and regular domains to HTTP-01 challenges. Provider availability is checked before use, with clear error messages. The logging helps with debugging by indicating which challenge type is being used.


162-216: LGTM! Robust certificate obtainment with proper error handling.

The backoff calculation 30<<(attempt-1) correctly produces the documented sequence (30s, 60s, 120s), addressing previous review feedback. The retry logic with exponential backoff is appropriate for ACME operations which can be transiently unavailable.

The 90-day fallback for certificate expiry is reasonable (matches Let's Encrypt standard), and vault-based private key encryption ensures secure storage. The newly generated CertificateID at line 211 is correctly handled in persistCertificate which preserves existing IDs during renewals.


218-248: LGTM! Properly handles certificate ID during renewals.

This function correctly addresses the previous critical issue about certificate ID mismatches on renewals. By checking for an existing certificate (lines 224-231) and preserving its ID, the function ensures that renewals update the existing row rather than creating orphaned IDs. The returned ID matches what's actually stored in the database.


250-261: LGTM! Appropriate error handling for compensation logic.

The function correctly marks challenges as failed during compensation. Wrapping in restate.Run ensures durability, and logging errors without propagating them is appropriate for cleanup operations—you don't want compensation failures to cascade.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
go/apps/ctrl/workflows/certificate/renew_handler.go (1)

87-95: Use injected clock for testability.

Line 87 uses time.Now() directly, making this code difficult to test deterministically. The Service struct should have a clock.Clock field, allowing tests to control time progression.

This was flagged in a prior review. Consider adding clock injection for testability.

🧹 Nitpick comments (5)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (3)

139-157: Consider logging the challenge type decision before provider setup.

The logic correctly selects DNS-01 for wildcards and HTTP-01 for regular domains. However, the info logs appear after the provider setup attempt. If setup fails, the user won't know which path was taken.

 	// Wildcard certificates require DNS-01 challenge
 	// Regular domains use HTTP-01 (faster, no DNS propagation wait)
 	if isWildcard(domain) {
+		s.logger.Info("wildcard domain detected, using DNS-01 challenge", "domain", domain)
 		if s.dnsProvider == nil {
 			return nil, fmt.Errorf("DNS provider required for wildcard certificate: %s", domain)
 		}
 		if err := client.Challenge.SetDNS01Provider(s.dnsProvider); err != nil {
 			return nil, fmt.Errorf("failed to set DNS-01 provider: %w", err)
 		}
-		s.logger.Info("using DNS-01 challenge for wildcard domain", "domain", domain)
 	} else {
+		s.logger.Info("regular domain, using HTTP-01 challenge", "domain", domain)
 		if s.httpProvider == nil {
 			return nil, fmt.Errorf("HTTP provider required for certificate: %s", domain)
 		}
 		if err := client.Challenge.SetHTTP01Provider(s.httpProvider); err != nil {
 			return nil, fmt.Errorf("failed to set HTTP-01 provider: %w", err)
 		}
-		s.logger.Info("using HTTP-01 challenge for domain", "domain", domain)
 	}

162-162: Remove unused parameter.

The _ string parameter (originally workspaceID) is not used in the function body. If it's intentionally unused, consider removing it from the signature to avoid confusion.

-func (s *Service) obtainCertificate(ctx context.Context, _ string, dom db.CustomDomain, domain string) (EncryptedCertificate, error) {
+func (s *Service) obtainCertificate(ctx context.Context, dom db.CustomDomain, domain string) (EncryptedCertificate, error) {

Note: This would require updating the caller at line 76:

-		return s.obtainCertificate(stepCtx, req.GetWorkspaceId(), dom, req.GetDomain())
+		return s.obtainCertificate(stepCtx, dom, req.GetDomain())

251-262: Errors from markChallengeFailed are silently discarded.

The return values from restate.Run are discarded with _, _ =. While logging occurs inside, if the entire restate.Run call fails (not just the inner DB update), that error is lost. Consider at minimum logging at the call site.

 func (s *Service) markChallengeFailed(ctx restate.ObjectContext, domainID string) {
-	_, _ = restate.Run(ctx, func(stepCtx restate.RunContext) (restate.Void, error) {
+	_, runErr := restate.Run(ctx, func(stepCtx restate.RunContext) (restate.Void, error) {
 		if updateErr := db.Query.UpdateAcmeChallengeStatus(stepCtx, s.db.RW(), db.UpdateAcmeChallengeStatusParams{
 			DomainID:  domainID,
 			Status:    db.AcmeChallengesStatusFailed,
 			UpdatedAt: sql.NullInt64{Valid: true, Int64: time.Now().UnixMilli()},
 		}); updateErr != nil {
 			s.logger.Error("failed to update challenge status", "error", updateErr, "domain_id", domainID)
 		}
 		return restate.Void{}, nil
 	}, restate.WithName("mark failed"))
+	if runErr != nil {
+		s.logger.Error("failed to run mark-failed step", "error", runErr, "domain_id", domainID)
+	}
 }
go/apps/ctrl/workflows/certificate/renew_handler.go (1)

27-28: DaysBeforeExpiry field is unused.

The DaysBeforeExpiry field from the request (line 27) is never read in this handler. The query ListExecutableChallenges seems to have its own logic for determining which certificates need processing. Either use this field or document that it's reserved for future use.

Also applies to: 91-91

go/apps/ctrl/run.go (1)

534-535: Document or make the internal workspace ID configurable.

The hardcoded "unkey_internal" workspace ID is used for platform-managed resources. Consider extracting this to a constant or making it configurable to avoid magic strings.

+const (
+	// internalWorkspaceID is used for platform-managed resources like wildcard certificates
+	internalWorkspaceID = "unkey_internal"
+)
+
 func bootstrapWildcardDomain(ctx context.Context, database db.Database, logger logging.Logger, defaultDomain string) {
 	...
-	workspaceID := "unkey_internal"
+	workspaceID := internalWorkspaceID
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b70331 and 717add7.

📒 Files selected for processing (4)
  • go/apps/ctrl/run.go (7 hunks)
  • go/apps/ctrl/services/acme/user.go (5 hunks)
  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go (6 hunks)
  • go/apps/ctrl/workflows/certificate/renew_handler.go (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-10-30T15:10:52.743Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 4190
File: go/internal/services/keys/verifier.go:51-53
Timestamp: 2025-10-30T15:10:52.743Z
Learning: PR #4190 for unkeyed/unkey is focused solely on database schema and query changes for identity-based credits. It adds IdentityCredits and KeyCredits fields to structs and queries, but does not implement the priority enforcement logic in the usagelimiter. The logic implementation is intentionally deferred to a later PR in the stack.

Applied to files:

  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/workflows/certificate/process_challenge_handler.go
  • go/apps/ctrl/run.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-07-15T14:59:30.212Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3560
File: go/deploy/metald/internal/database/repository.go:0-0
Timestamp: 2025-07-15T14:59:30.212Z
Learning: go/deploy/metald cannot currently import helpers from go/pkg/db because it is not yet part of the main Go module; avoid suggesting such imports until the modules are unified.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-15T18:12:01.503Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 3952
File: go/apps/ctrl/run.go:212-213
Timestamp: 2025-09-15T18:12:01.503Z
Learning: In Go connect-go generated handlers, mux.Handle can accept New*ServiceHandler functions directly even though they return (string, http.Handler) tuples. The pattern mux.Handle(ctrlv1connect.New*ServiceHandler(...)) is valid and compiles successfully in unkey codebase.

Applied to files:

  • go/apps/ctrl/run.go
🧬 Code graph analysis (3)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (5)
go/pkg/db/models_generated.go (1)
  • Certificate (703-711)
go/apps/ctrl/workflows/certificate/service.go (2)
  • Service (20-29)
  • New (58-69)
go/gen/proto/hydra/v1/certificate.pb.go (6)
  • ProcessChallengeRequest (25-31)
  • ProcessChallengeRequest (44-44)
  • ProcessChallengeRequest (59-61)
  • ProcessChallengeResponse (77-83)
  • ProcessChallengeResponse (96-96)
  • ProcessChallengeResponse (111-113)
go/apps/ctrl/services/acme/user.go (2)
  • GetOrCreateUser (49-114)
  • UserConfig (41-47)
go/pkg/db/certificate_insert.sql_generated.go (1)
  • InsertCertificateParams (23-31)
go/apps/ctrl/run.go (9)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (23-58)
  • Config (18-21)
go/apps/ctrl/config.go (3)
  • Config (95-171)
  • CloudflareConfig (27-33)
  • Route53Config (35-51)
go/pkg/clock/interface.go (1)
  • Clock (12-18)
go/apps/ctrl/services/acme/providers/http_provider.go (3)
  • NewHTTPProvider (54-64)
  • HTTPConfig (47-51)
  • HTTPProvider (45-45)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (2)
  • NewCloudflareProvider (21-38)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (2)
  • NewRoute53Provider (33-55)
  • Route53Config (13-23)
go/gen/proto/hydra/v1/certificate.pb.go (3)
  • RenewExpiringCertificatesRequest (129-135)
  • RenewExpiringCertificatesRequest (148-148)
  • RenewExpiringCertificatesRequest (163-165)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/pkg/db/models_generated.go (2)
  • CustomDomainsChallengeTypeDNS01 (148-148)
  • AcmeChallengesStatusWaiting (61-61)
go/apps/ctrl/services/acme/user.go (4)
go/pkg/db/models_generated.go (1)
  • AcmeUser (639-646)
go/pkg/db/queries.go (1)
  • Query (29-29)
go/pkg/db/acme_user_update_registration_uri.sql_generated.go (1)
  • UpdateAcmeUserRegistrationURIParams (17-20)
go/pkg/db/types/null_string.go (1)
  • NullString (10-10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Go API Local / Test
  • GitHub Check: Lint Go Code / Lint
  • GitHub Check: Build / Build
  • GitHub Check: Test API / API Test Local
  • GitHub Check: Test Packages / Test
🔇 Additional comments (18)
go/apps/ctrl/workflows/certificate/process_challenge_handler.go (5)

21-26: LGTM on the EncryptedCertificate struct.

The struct cleanly encapsulates all certificate data needed for the workflow, including the new CertificateID and ExpiresAt fields for tracking certificate lifecycle.


118-124: LGTM on the global ACME user approach and wildcard detection.

Using a single global ACME account simplifies key management. The isWildcard helper correctly identifies wildcard domains by checking for the *. prefix.


177-184: Backoff formula now matches the documented delays.

The formula 30 << (attempt-1) correctly produces delays of 30s, 60s, 120s (capped at 5min) for attempts 1, 2, 3, matching the comment. This addresses the prior review feedback.


219-249: Certificate ID mismatch issue addressed.

The code now correctly checks for an existing certificate by hostname and reuses its ID on renewal. This ensures the returned certID matches what's stored in the database, addressing the prior review feedback about ID mismatches.


65-70: Defer compensation behavior is correct as written.

The defer at line 30-35 is properly positioned after the successful claim (line 28). With named return values, the defer captures references to resp and err which are evaluated when the function returns. The compensation logic correctly triggers markChallengeFailed for all failure scenarios after claiming:

  • If err != nil (failures at obtain, persist, or mark verified steps)
  • If response status is "failed" (certificate obtain returns failed status but nil error)

Early returns before the claim (lines 14-15, 26-27) execute before the defer is registered, so no compensation runs for errors before claiming—which is correct behavior. The code is sound.

go/apps/ctrl/services/acme/user.go (4)

22-31: LGTM on EmailDomain integration.

The EmailDomain field allows configurable ACME registration emails. The GetEmail() method correctly composes the email using the workspace ID and domain.


71-84: LGTM on user rehydration with optional registration.

The code correctly initializes the user with the decrypted key and only sets Registration when a valid URI exists in the database.


116-175: LGTM on the register flow.

The registration flow correctly:

  1. Generates a new ECDSA P-256 key
  2. Encrypts and persists the key before registration
  3. Registers with the CA
  4. Persists the registration URI

The order ensures the key is safely stored before any CA interaction.


92-111: Registration URI persistence failure does not require fatal error handling.

ACME specification (RFC 8555) makes the newAccount operation idempotent by account key. If registration is attempted again with the same key, the server must return the existing account with a 200 OK response—it will not reject the duplicate registration. This means re-attempting registration if the URI persists fails is safe and will consistently return the same account. The current approach of logging a warning is appropriate.

go/apps/ctrl/workflows/certificate/renew_handler.go (3)

11-17: LGTM on constants for renewal scheduling.

Clear constants for the 24-hour renewal interval and the singleton key pattern.


25-43: LGTM on the Restate durable workflow pattern.

The handler correctly uses restate.Run to durably query for executable challenges, ensuring the query is replayed correctly on workflow resume.


49-76: Fire-and-forget pattern handles failures gracefully.

The loop correctly:

  • Triggers each renewal via Send() (non-blocking)
  • Collects failed domains for reporting
  • Uses restate.Sleep for throttling (durable sleep)

The 100ms delay prevents overwhelming the system while still being responsive.

go/apps/ctrl/run.go (6)

58-61: LGTM on disabling CNAME support for lego.

Setting LEGO_DISABLE_CNAME_SUPPORT prevents lego from following CNAMEs that could cause Route53 zone lookup failures for wildcard domains.


261-268: LGTM on cache initialization.

Caches are correctly initialized before provider setup, addressing prior review feedback about missing DomainCache.


270-315: LGTM on ACME provider setup.

Both HTTP-01 and DNS-01 providers are correctly configured with the shared DomainCache. The DNS provider selection (Cloudflare vs Route53) is mutually exclusive as expected.


519-528: LGTM on not-found error handling.

The code correctly uses db.IsNotFound(err) instead of comparing to sql.ErrNoRows directly, addressing prior review feedback.


445-450: LGTM on ACME service handler wiring.

The ACME service correctly receives both DomainCache and ChallengeCache for efficient lookups during HTTP-01 challenge verification.


550-561: Token and Authorization fields are intentionally empty initially and are populated during the ACME challenge flow.

The challenge is created with empty Token and Authorization fields because these values are only provided by the ACME server during the challenge lifecycle. When the renewal cron triggers ProcessChallenge, the ACME workflow calls the DNS provider's Present method (via provider.go), which receives the actual token and key authorization from the lego library and updates the database with UpdateAcmeChallengePending. The VerifyCertificate service then reads these populated values to return the authorization to clients.

This design is correct—initializing with empty strings and populating during the ACME flow prevents storing incomplete data.

@vercel vercel bot temporarily deployed to Preview – engineering December 9, 2025 17:20 Inactive
@vercel vercel bot temporarily deployed to Preview – dashboard December 9, 2025 17:22 Inactive
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
go/apps/ctrl/run.go (1)

416-432: Remove or relax the overly restrictive renewal cron condition.

The renewal cron is gated by cfg.Acme.Enabled && dnsProvider != nil, but this is incorrect. The RenewExpiringCertificates handler explicitly supports both DNS-01 (lines 31-34 in renew_handler.go) and HTTP-01 challenges. The ProcessChallenge workflow dynamically selects the appropriate provider based on domain type: DNS-01 for wildcards and HTTP-01 for regular domains (lines 141-157 in process_challenge_handler.go).

This condition prevents HTTP-01 certificate renewals for regular domains, even though the workflow supports them. Regular domains configured with HTTP-01 will fail to renew because the cron job never starts.

Correct the condition to: cfg.Acme.Enabled && (dnsProvider != nil || httpProvider != nil)

🧹 Nitpick comments (1)
go/apps/ctrl/config.go (1)

233-242: Consider validating HostedZoneID when auto-discovery is expected to fail.

The validation correctly requires AccessKeyID, SecretAccessKey, and Region when Route53 is enabled. However, HostedZoneID is optional and only validated when provided. According to the comment in the struct definition (line 48-49), HostedZoneID is "required when domains have CNAMEs that confuse the zone lookup."

While making it always required would be too strict, consider whether the validation should warn or document when it should be set based on domain configuration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 717add7 and 01b0c96.

⛔ Files ignored due to path filters (1)
  • go/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • go/apps/ctrl/config.go (2 hunks)
  • go/apps/ctrl/run.go (7 hunks)
  • go/cmd/ctrl/main.go (2 hunks)
  • go/go.mod (1 hunks)
  • go/pkg/db/querier_generated.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • go/go.mod
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-09-11T14:13:12.124Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/acme/providers/cloudflare_provider.go:56-69
Timestamp: 2025-09-11T14:13:12.124Z
Learning: For the default domain in the Unkey system (go/apps/ctrl/services/acme/providers/cloudflare_provider.go), only wildcard certificates are issued. When domain == p.defaultDomain, the code should always convert to the wildcard form ("*." + domain) and not attempt fallback lookups to the base domain form.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token using assert.NotEmpty() when both c.Acme.Enabled and c.Acme.Cloudflare.Enabled are true, with the error message "cloudflare API token is required when cloudflare is enabled".

Applied to files:

  • go/apps/ctrl/run.go
  • go/cmd/ctrl/main.go
  • go/apps/ctrl/config.go
📚 Learning: 2025-09-11T14:12:30.570Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/run.go:292-299
Timestamp: 2025-09-11T14:12:30.570Z
Learning: In go/apps/ctrl/config.go, the Validate() method already includes validation for Cloudflare API token when ACME and Cloudflare are both enabled, checking that cfg.Acme.Cloudflare.ApiToken is not empty.

Applied to files:

  • go/apps/ctrl/run.go
  • go/cmd/ctrl/main.go
  • go/apps/ctrl/config.go
📚 Learning: 2025-07-15T14:59:30.212Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3560
File: go/deploy/metald/internal/database/repository.go:0-0
Timestamp: 2025-07-15T14:59:30.212Z
Learning: go/deploy/metald cannot currently import helpers from go/pkg/db because it is not yet part of the main Go module; avoid suggesting such imports until the modules are unified.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-25T18:49:28.948Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 4010
File: go/apps/ctrl/run.go:224-233
Timestamp: 2025-09-25T18:49:28.948Z
Learning: In go/apps/ctrl/run.go, mcstepp prefers to keep ACME service handlers wrapped with auth interceptor during the demo phase for simplicity, even though typically ACME routes would be exempt from authentication. This is part of the temporary demo authentication system.

Applied to files:

  • go/apps/ctrl/run.go
  • go/cmd/ctrl/main.go
📚 Learning: 2025-09-15T18:12:01.503Z
Learnt from: mcstepp
Repo: unkeyed/unkey PR: 3952
File: go/apps/ctrl/run.go:212-213
Timestamp: 2025-09-15T18:12:01.503Z
Learning: In Go connect-go generated handlers, mux.Handle can accept New*ServiceHandler functions directly even though they return (string, http.Handler) tuples. The pattern mux.Handle(ctrlv1connect.New*ServiceHandler(...)) is valid and compiles successfully in unkey codebase.

Applied to files:

  • go/apps/ctrl/run.go
📚 Learning: 2025-09-11T14:24:40.988Z
Learnt from: Flo4604
Repo: unkeyed/unkey PR: 3944
File: go/apps/ctrl/services/deployment/deploy_workflow.go:326-334
Timestamp: 2025-09-11T14:24:40.988Z
Learning: The InsertDomains method in the bulk queries uses ON DUPLICATE KEY UPDATE, making it an upsert operation that is idempotent and safe for retries, despite the "Insert" naming convention.

Applied to files:

  • go/pkg/db/querier_generated.go
🧬 Code graph analysis (4)
go/apps/ctrl/run.go (9)
go/apps/ctrl/internal/caches/caches.go (2)
  • New (23-58)
  • Config (18-21)
go/pkg/uid/uid.go (1)
  • New (103-152)
go/apps/ctrl/config.go (2)
  • Config (95-177)
  • CloudflareConfig (27-33)
go/pkg/clock/interface.go (1)
  • Clock (12-18)
go/apps/ctrl/services/acme/providers/http_provider.go (3)
  • NewHTTPProvider (54-64)
  • HTTPConfig (47-51)
  • HTTPProvider (45-45)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (2)
  • NewCloudflareProvider (21-38)
  • CloudflareConfig (13-18)
go/pkg/db/handle_err_no_rows.go (1)
  • IsNotFound (10-12)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/pkg/db/models_generated.go (2)
  • CustomDomainsChallengeTypeDNS01 (148-148)
  • AcmeChallengesStatusWaiting (61-61)
go/pkg/db/querier_generated.go (4)
go/pkg/db/interface.go (1)
  • DBTX (29-34)
go/pkg/db/custom_domain_find_by_domain_or_wildcard.sql_generated.go (1)
  • FindCustomDomainByDomainOrWildcardParams (20-24)
go/pkg/db/models_generated.go (1)
  • CustomDomain (769-776)
go/pkg/db/custom_domain_upsert.sql_generated.go (1)
  • UpsertCustomDomainParams (22-29)
go/cmd/ctrl/main.go (2)
go/pkg/cli/flag.go (4)
  • String (419-453)
  • Default (364-416)
  • EnvVar (320-339)
  • Bool (497-535)
go/apps/ctrl/config.go (2)
  • CloudflareConfig (27-33)
  • Route53Config (35-51)
go/apps/ctrl/config.go (2)
go/apps/ctrl/services/acme/providers/cloudflare_provider.go (1)
  • CloudflareConfig (13-18)
go/apps/ctrl/services/acme/providers/route53_provider.go (1)
  • Route53Config (13-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test API / API Test Local
  • GitHub Check: Lint Go Code / Lint
  • GitHub Check: Test Go API Local / Test
  • GitHub Check: Test Packages / Test
  • GitHub Check: Build / Build
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (13)
go/pkg/db/querier_generated.go (2)

180-187: LGTM!

The FindCustomDomainByDomainOrWildcard method correctly implements the logic to find a custom domain by either exact match or wildcard pattern, prioritizing exact matches via the ORDER BY CASE clause.


2142-2150: LGTM. This generated method correctly implements the upsert pattern. The only call site in go/apps/ctrl/run.go uses a hardcoded "unkey_internal" workspace for platform-managed ACME certificate infrastructure, so workspace reassignment is controlled and intentional.

go/cmd/ctrl/main.go (3)

116-116: LGTM!

The acme-email-domain flag is properly configured with a default value and environment variable support. The default domain "unkey.com" aligns with the organization's namespace for ACME registration emails.


235-241: LGTM!

The Route53 configuration is properly wired from CLI flags into the AcmeConfig struct, mapping all five Route53-specific fields correctly.


118-127: Route53 credential validation is already enforced in the config layer.

The Validate() method in go/apps/ctrl/config.go (lines 234-238) ensures that when acme-route53-enabled is true, the following fields are required: AccessKeyID, SecretAccessKey, and Region. The CLI-level flags are correctly deferred to config validation, making this deferral appropriate.

go/apps/ctrl/config.go (2)

35-51: LGTM!

The Route53Config struct is well-documented and follows the same pattern as CloudflareConfig. The comment on HostedZoneID clearly explains when it's needed (bypassing auto-discovery for domains with wildcard CNAMEs).


57-64: LGTM!

The AcmeConfig struct is properly extended with EmailDomain and Route53 fields, maintaining consistency with the existing Cloudflare field structure. The comments clearly document the purpose of each field.

go/apps/ctrl/run.go (6)

286-293: LGTM!

The caches are initialized unconditionally (even when ACME is disabled) because they're needed for the ACME verification endpoint regardless of provider configuration. This is a good design decision that ensures the verification endpoint is always available.


295-340: LGTM! DNS provider selection follows proper exclusive pattern.

The provider setup correctly implements:

  1. HTTP-01 provider is always created when ACME is enabled (for non-wildcard domains)
  2. DNS-01 provider is conditionally created based on Cloudflare OR Route53 being enabled
  3. Proper error handling for provider creation failures
  4. Both providers receive the shared DomainCache from the caches module

The else if ensures only one DNS provider is active at a time, which is the correct behavior.


342-352: LGTM! Certificate service properly configured with extended timeout.

The certificate service is correctly wired with:

  1. All required dependencies (database, vault, logger, caches)
  2. New EmailDomain and DefaultDomain configuration
  3. Both DNSProvider and HTTPProvider for handling wildcard and regular certificates
  4. Extended inactivity timeout (15 minutes) appropriate for DNS-01 propagation delays

The configuration aligns with the service's requirements for ACME challenge handling.


411-414: Bootstrap condition correctly gates wildcard domain creation.

The bootstrap is properly gated by:

  • cfg.Acme.Enabled - ACME must be enabled
  • dnsProvider != nil - DNS-01 provider must be configured (wildcards require DNS-01)
  • cfg.DefaultDomain != "" - Must have a default domain

This ensures the wildcard domain is only created when all prerequisites are met.


470-475: LGTM! ACME service properly wired with domain and challenge caches.

The ACME service handler is correctly configured with DomainCache and ChallengeCache from the shared caches module, enabling proper challenge verification and domain lookups.


539-593: LGTM! Bootstrap helper properly implements domain and challenge creation.

The bootstrapWildcardDomain function correctly:

  1. Constructs wildcard domain with "*." prefix (line 542)
  2. Uses db.IsNotFound(err) for not-found detection (line 550), addressing past review feedback
  3. Generates unique domain ID using uid.New(uid.DomainPrefix) (line 556)
  4. Uses "unkey_internal" workspace for platform-managed resources (line 560)
  5. Creates domain with DNS-01 challenge type (line 566)
  6. Creates ACME challenge with waiting status so renewal cron picks it up (line 581)
  7. Properly handles errors with informative logging

The implementation is robust and follows the codebase's patterns.

Comment on lines +58 to +61
// Disable CNAME following in lego to prevent it from following wildcard CNAMEs
// (e.g., *.example.com -> loadbalancer.aws.com) and failing Route53 zone lookup.
// Must be set before creating any ACME DNS providers.
os.Setenv("LEGO_DISABLE_CNAME_SUPPORT", "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n go/apps/ctrl/run.go | head -80 | tail -30

Repository: unkeyed/unkey

Length of output: 1104


🏁 Script executed:

rg -n "LEGO_DISABLE_CNAME_SUPPORT" go/apps/ctrl/run.go -B 5 -A 5

Repository: unkeyed/unkey

Length of output: 466


🏁 Script executed:

fd "cloudflare" go/apps/ctrl --type f

Repository: unkeyed/unkey

Length of output: 116


🏁 Script executed:

sed -n '50,300p' go/apps/ctrl/run.go | head -150

Repository: unkeyed/unkey

Length of output: 4600


🏁 Script executed:

cat go/apps/ctrl/services/acme/providers/cloudflare_provider.go

Repository: unkeyed/unkey

Length of output: 1094


🏁 Script executed:

sed -n '61,400p' go/apps/ctrl/run.go | grep -n -A 5 -B 5 "cfg.Acme\|ACME" | head -100

Repository: unkeyed/unkey

Length of output: 3909


🏁 Script executed:

rg -n "cfg\.Acme\.Enabled|if.*Acme" go/apps/ctrl/run.go -B 2 -A 2

Repository: unkeyed/unkey

Length of output: 1728


🌐 Web query:

lego LEGO_DISABLE_CNAME_SUPPORT cloudflare DNS-01 challenge impact

💡 Result:

Short answer: LEGO_DISABLE_CNAME_SUPPORT controls whether the lego ACME client follows CNAMEs when resolving dns‑01 challenge names. If set (true), lego will NOT follow CNAME aliases; if unset (default) lego will follow them. [1][2]

Impact when using Cloudflare and dns‑01:

  • If you use CNAME aliases (e.g., delegating _acme-challenge.example.com to another zone or to acme-dns), disabling CNAME support stops lego from following that alias and will look for/create TXT records in the original zone — causing validation failures. [1][4]
  • Cloudflare features (proxying / CNAME flattening or a proxied record) can already break dns‑01 verification because the TXT/CNAME may not resolve as expected; combining that with LEGO_DISABLE_CNAME_SUPPORT=true makes debugging harder and is likely to fail if the TXT actually lives under the aliased zone. See Cloudflare CNAME/CNAME-flattening notes. [3]
  • Some users/tools (e.g., Traefik integrations) mention toggling this env var when CNAME handling causes issues; often the root problem is DNS configuration or resolver choices rather than lego itself. Setting explicit DNS resolvers or fixing the CNAME/proxy settings on Cloudflare is another remedy. [5][4]

Recommendations:

  1. If you delegate _acme-challenge via CNAME to a different provider (acme-dns or another Cloudflare zone), leave CNAME support enabled (unset LEGO_DISABLE_CNAME_SUPPORT). [1][4]
  2. Ensure the CNAME record is DNS‑only (not proxied) and that Cloudflare CNAME flattening won’t interfere. [3]
  3. If you must disable CNAME following, make sure the TXT records are writable in the zone lego will query (or change your provider config so the TXT is created where lego expects). [1][4]
  4. If failures persist, check logs, confirm TXT resolution from public resolvers, or configure explicit resolvers for the DNS challenge. [5]

Sources:

  • lego CLI options (LEGO_DISABLE_CNAME_SUPPORT description). [1]
  • lego project features (CNAME support default). [2]
  • Cloudflare DNS docs on CNAME / flattening / proxied records. [3]
  • go-acme/lego GitHub issue discussing CNAME/dns‑01 delegation problems. [4]
  • Traefik community post referencing LEGO_DISABLE_CNAME_SUPPORT and resolver workarounds. [5]

Environment variable set globally before ACME providers are created; clarify Cloudflare DNS-01 impact.

The LEGO_DISABLE_CNAME_SUPPORT setting is correctly placed after validation but before any ACME DNS providers are instantiated. However, this global environment variable affects both Route53 and Cloudflare providers:

  • Route53: Disabling CNAME following prevents issues with wildcard CNAME records (e.g., *.example.com → loadbalancer.aws.com) that would fail zone lookups.
  • Cloudflare: Disabling CNAME support will break dns-01 challenges if the setup delegates _acme-challenge via CNAME to another zone or relies on CNAME flattening. This is only safe if using Cloudflare's standard direct zone record management (not CNAME-delegated challenges).

If Cloudflare users employ CNAME-based challenge delegation, this environment variable will cause dns-01 validation failures. Consider either:

  1. Documenting that Cloudflare integration requires direct zone records (not CNAME delegation), or
  2. Making this setting conditional per-provider (more complex but provider-specific).
🤖 Prompt for AI Agents
In go/apps/ctrl/run.go around lines 58-61 the global
os.Setenv("LEGO_DISABLE_CNAME_SUPPORT","true") unconditionally disables CNAME
following for all ACME DNS providers which breaks Cloudflare setups that
delegate _acme-challenge via CNAME; change this so the env var is only set when
using providers that require CNAME following disabled (e.g., Route53) or make it
configurable: detect the configured DNS provider(s) before creating providers
and either (a) set LEGO_DISABLE_CNAME_SUPPORT only if the provider is
Route53/other non-CNAME-delegated provider, or (b) add a config
flag/documentation indicating Cloudflare requires direct zone records and do not
set the env var when Cloudflare is in use. Ensure the check runs before provider
construction and keep the default behavior unchanged for existing Route53 users.

@chronark chronark merged commit dbb7bda into main Dec 9, 2025
21 of 22 checks passed
@chronark chronark deleted the feat/ctrl-lets-encrypt-challenges branch December 9, 2025 17:26
Flo4604 added a commit that referenced this pull request Dec 10, 2025
* feat: add lets encrypt challenges

* always disable cname following

* cleanup some code

* cleanup some code

* cleanup some code

* cleanup some code

* cleanup some code

* fix golint issues

* fix golint issues
chronark added a commit that referenced this pull request Dec 19, 2025
* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

* feat: inject env vars into pod spec via Krane

* feat: add customer-workload service account for pod isolation

* remove gw from k8s manifest, add agent fix ctrl vault for certs

* seperate master keys too

* add inital webhook stuff

* add generated stuff

* adjust comments

* use otel lgtm stack in k8s too

* fix some rabbit comments

* fix some rabbit comments

* get rid of some unncessary comments

* actually add unkey env cmd gitignores...

* fix golint issues

* Fix/update validation issues status label (#4478)

* fix: update API key status label from 'Potential issues' to 'High Error Rate'

Changed the validation-issues status label to more clearly communicate
that the key is receiving invalid requests, rather than implying the
API or key itself is broken.

Changes:
- Label: 'Potential issues' → 'High Error Rate'
- Tooltip: Updated to clarify that requests are invalid (rate limited,
  unauthorized, etc.) rather than suggesting system issues

Fixes #4474

* chore: apply biome formatting

* fix: update status label to 'Elevated Rejections' per review

---------

Co-authored-by: CodeReaper <148160799+MichaelUnkey@users.noreply.github.com>

* chore: Remove un-used UI components (#4472)

* removed un used components

* updated members refs

---------

Co-authored-by: James P <james@unkey.dev>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* perf: fix n+1 (#4484)

* fix: add 403 error when 0 key verification perms (#4483)

* fix: add 403 error when 0 key verification perms

* cleanup tests

* feat: add environment variables db schema and queries (#4450)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars (#4451)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* feat: add GetPullToken

* feat: dashboard UI for environment variables management (#4452)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* feat: decrypt env vars in CTRL workflow before passing to Krane (#4453)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* feat: inject env vars into pod spec via Krane (#4454)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

* feat: inject env vars into pod spec via Krane

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* feat: add customer-workload service account for pod isolation (#4455)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

* feat: inject env vars into pod spec via Krane

* feat: add customer-workload service account for pod isolation

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* remove gw from k8s manifest, add agent fix ctrl vault for certs (#4463)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

* feat: inject env vars into pod spec via Krane

* feat: add customer-workload service account for pod isolation

* remove gw from k8s manifest, add agent fix ctrl vault for certs

* seperate master keys too

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* chore: Make Stripe Great Again (#4479)

* fix: Make stripe webhooks more robust

* chore: Move alert to UI (#4485)

* Moved alert to ui and swapped usages

* feat: better env var injection (#4468)

* feat: add environment variables db schema and queries

* fix db query

* feat: add SecretsConfig proto for encrypted env vars

* [autofix.ci] apply automated fixes

* feat: dashboard UI for environment variables management

* fix comment and rename file

* fix file export name

* Remove unnecessary comments from add-env-vars

* add toasts for environment variable operations

* [autofix.ci] apply automated fixes

* fix: add try/catch error handling to env var mutations

* unfmt file

* [autofix.ci] apply automated fixes

* feat: decrypt env vars in CTRL workflow before passing to Krane

* feat: inject env vars into pod spec via Krane

* feat: add customer-workload service account for pod isolation

* remove gw from k8s manifest, add agent fix ctrl vault for certs

* seperate master keys too

* add inital webhook stuff

* add generated stuff

* adjust comments

* use otel lgtm stack in k8s too

* fix some rabbit comments

* fix some rabbit comments

* get rid of some unncessary comments

* actually add unkey env cmd gitignores...

* fix golint issues (#4477)

* [autofix.ci] apply automated fixes

* fix fmt

* linter be happy

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Andreas Thomas <dev@chronark.com>

* make token pod owned

* feat: add lets encrypt challenges (#4471)

* feat: add lets encrypt challenges

* always disable cname following

* cleanup some code

* cleanup some code

* cleanup some code

* cleanup some code

* cleanup some code

* fix golint issues

* fix golint issues

* fmt

* remove old webhook code

* remove old webhook code

* make build id not optiona

* cleanup

* cleanup

* fmt

* fmt

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: abhay <88815641+theabhayprajapati@users.noreply.github.com>
Co-authored-by: CodeReaper <148160799+MichaelUnkey@users.noreply.github.com>
Co-authored-by: James P <james@unkey.dev>
Co-authored-by: Andreas Thomas <dev@chronark.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable ACME functionality as Restate cron job

2 participants