Skip to content

Conversation

@ddl-dclegg
Copy link
Contributor

@ddl-dclegg ddl-dclegg commented Oct 2, 2025

What issue does this pull request solve?

After merging #353, the Hephaestus controller was experiencing a crash loop with the error:

Error: listen tcp :9443: bind: address already in use

The controller code in pkg/controller/start.go was trying to start the webhook server twice:

  1. Manual Start (lines 183-191): A goroutine explicitly called opts.WebhookServer.Start(ctx)
  2. Automatic Start: Controller-runtime's manager automatically starts the webhook server when mgr.Start() is called

Sequence of Events Leading to Crash

[Time 0] Manual webhook server goroutine starts
         ↓
[Time 0+] Webhook server binds to port 9443 successfully
         ↓
[Time 0+] mgr.Start() called
         ↓
[Time 0+] Controller-runtime detects webhook server already running
         ↓
[Time 0+] Manager enters shutdown sequence immediately
         ↓
[Time 0+] All contexts are canceled (causing "context canceled" errors)
         ↓
[Time 0+] Worker pool initialization fails
         ↓
[Time 0+] Controllers shut down before starting
         ↓
[Time 0+] Second webhook server attempts to bind during cleanup
         ↓
[Error] "bind: address already in use" because first server still owns port
         ↓
[Result] Container crashes and restarts (CrashLoopBackOff)

What is the solution?

The fix removes the manual webhook server initialization since controller-runtime handles this automatically:

Changes Made

  1. Removed orphaned context creation (line 175)
  2. Removed manual webhook server goroutine (lines 183-191)
  3. Added explanatory comment about automatic startup

Modified Code

Before:

opts.WebhookServer = webhook.NewServer(webhookOpts)
ctx, _ := context.WithCancel(context.Background())

// ... webhook registration code ...

go func() {
    log.Info("Starting webhook server in background")
    if err := opts.WebhookServer.Start(ctx); err != nil && err != http.ErrServerClosed {
        log.Error(err, "Webhook server failed unexpectedly", "port", cfg.WebhookPort)
    } else {
        log.Info("Webhook server shut down gracefully")
    }
}()

log.Info("Creating new controller manager")

After:

opts.WebhookServer = webhook.NewServer(webhookOpts)

// ... webhook registration code ...

// Webhook server will be started automatically by controller-runtime manager
log.Info("Creating new controller manager")

Testing

Aftter setting the manager image to quay.io/domino/hephaestus:sha-2d28d8c (this build):

  1. Controller starts normally
  2. Webhook server is registered but not started manually
  3. Manager starts successfully
  4. Manager automatically starts the webhook server (once)
  5. Worker pool initializes successfully
  6. Controllers enter their reconciliation loops
  7. No crash, no restart

@ddl-dclegg ddl-dclegg marked this pull request as ready for review October 2, 2025 16:52
@ddl-dclegg ddl-dclegg merged commit 4c16910 into main Oct 2, 2025
5 checks passed
@ddl-dclegg ddl-dclegg deleted the heph-webhook-startup branch October 2, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants