fix(target-allocator): add TLS certificate hot-reload for mTLS#4597
Conversation
The Target Allocator now automatically reloads TLS certificates when they are renewed by cert-manager. Previously, certificate renewals required a pod restart because certificates were only loaded once at startup. The fix uses fsnotify to watch the certificate directory and dynamically reloads certificates via the GetCertificate callback, enabling seamless certificate rotation without downtime. Fixes #4368 Signed-off-by: Charlie Le <charlie_le@apple.com>
pavolloffay
left a comment
There was a problem hiding this comment.
Great work. Just some minor comments.
|
|
||
| // In Kubernetes, secret updates create a new symlink target. | ||
| // We look for Create or Write events on any file in the directory. | ||
| if event.Op&(fsnotify.Create|fsnotify.Write) != 0 { |
There was a problem hiding this comment.
This implementation seems fine. In the contrib repo we had a race condition in one watcher impl https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/44104/changes#diff-6e4e17a95db08bf6e5ca6307cea3be19b08d3f3efe62bd6808ec4ffec7871938L128
There was a problem hiding this comment.
Thanks for the review and link to another implementation, @pavolloffay. This implementation looks at the cert directory instead of the actual files, so we should be good.
|
Hi @jaronoff97, could you help review? Thanks! |
|
|
||
| // In Kubernetes, secret updates create a new symlink target. | ||
| // We look for Create or Write events on any file in the directory. | ||
| if event.Op&(fsnotify.Create|fsnotify.Write) != 0 { |
There was a problem hiding this comment.
Don't we also need to check if the event involves the file we're interested in? We could have other files in this directory. Technically, this would just cause spurious reloads, so not a big deal, but if we can avoid them easily, we should do so.
There was a problem hiding this comment.
Added a debouncer since we're dealing with symlinks.
|
Also, shouldn't this change make fsnotify a direct dependency in go.mod? |
jaronoff97
left a comment
There was a problem hiding this comment.
It looks like @swiatekm got to all the comments i was going to say and more haha, otherwise this looks great, excited for this functionality to be in here!
When Kubernetes updates a Secret/ConfigMap, it performs atomic updates using symlinks which generate 3 filesystem events (timestamped dir, temp symlink, final symlink). The TLS certificate reloader was triggering a reload for each event, resulting in 3 reloads instead of 1. This commit adds 100ms time-based debouncing to batch rapid events: - scheduleReload() resets timer on each event - Single reload fires after events settle - Thread-safe timer management with mutex - Notification channel for debounced reload signal Added comprehensive tests covering: - Multiple rapid events → 1 reload - Timer reset behavior - Kubernetes atomic update simulation - Single event handling - Reload failure recovery - Context cancellation - Concurrent events Signed-off-by: Charlie Le <charlie_le@apple.com>
Previously, the TLS certificate reloader only watched the directory containing the certificate file. If the key or CA files were in different directories, changes to those files would not trigger a reload. This commit adds support for watching multiple directories: - Collects all unique directories for cert, key, and CA files - Watches each unique directory for changes - Debouncing works across all watched directories - Events in any directory trigger a single debounced reload Added test coverage: - Certificates in 3 separate directories - Events across multiple directories → 1 reload - Individual directory updates still work Signed-off-by: Charlie Le <charlie_le@apple.com>
…eReload Previously, most TLS reloader tests called scheduleReload() directly, bypassing the fsnotify integration. This meant: - Tests didn't verify fsnotify events properly trigger reloads - Tests were unit tests of debouncing, not integration tests - Real-world behavior wasn't being tested Fixed 7 tests to trigger actual filesystem events: 1. TestCertificateReloader_DebounceMultipleEvents - Now creates temp files instead of calling scheduleReload() - Tests full fsnotify → debounce → reload flow 2. TestCertificateReloader_DebounceResetTimer - Creates actual files to trigger CREATE events - Verifies timer reset with real fsnotify events 3. TestCertificateReloader_SimulateKubernetesAtomicUpdate - Properly simulates K8s symlink pattern - Creates timestamped dir, temp symlink, final symlink - Writes to actual cert files to ensure reload works 4. TestCertificateReloader_SingleEventStillWorks - Creates single file to trigger fsnotify event - Tests legitimate single events aren't delayed 5. TestCertificateReloader_ReloadFailureDoesntStopWatching - Removed duplicate scheduleReload() calls - Relies only on fsnotify WRITE events from file updates 6. TestCertificateReloader_ContextCancellationDuringDebounce - Creates filesystem event before cancelling context - Tests cancellation during debounce window 7. TestCertificateReloader_ConcurrentEvents - Creates concurrent filesystem events from goroutines - Tests thread safety with real fsnotify events All tests now properly verify the full integration: fsnotify events → Watch() loop → scheduleReload() → debounce → Reload() Signed-off-by: Charlie Le <charlie_le@apple.com>
Updated go.mod, thanks! |
…counts Replace test verification approach from counting reload callbacks to comparing actual certificate values. This provides more direct validation that certificates are being reloaded correctly. Changes: - Remove testReloadCallback mechanism from CertificateReloader - Add getCertificateBytes helper to extract certificates for comparison - Update all 8 tests to verify certificate changes instead of counts - Tests now write actual new certificates to trigger real reloads Signed-off-by: Charlie Le <charlie_le@apple.com>
…bouncing The previous debounce implementation could indefinitely postpone certificate reloads if filesystem events arrived faster than the 100ms debounce delay. This adds a maximum wait time (1 second) to guarantee reloads happen even during continuous events, while preserving the existing debounce behavior for normal Kubernetes secret updates. Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CA certs still need to be handled as well, so I've set up ca_reloader to handle that. Signed-off-by: Charlie Le <charlie_le@apple.com>
swiatekm
left a comment
There was a problem hiding this comment.
Using controller-runtime's cert watcher is a big improvement. This mostly looks good to me right now, had a question and some comments about tests.
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
|
Tests are failing for filelog-receiver, which seems unrelated to this change. |
The previous implementation incorrectly verified all certificates in cs.PeerCertificates individually against the root CA. This fails for certificate chains with intermediates, as intermediate certificates won't verify directly against the root CA. This fix follows the standard X.509 verification pattern: - Only verify the leaf certificate (cs.PeerCertificates[0]) - Add intermediate certificates (cs.PeerCertificates[1:]) to the Intermediates pool - Add KeyUsages check for client authentication - Let x509.Certificate.Verify() build and validate the chain Added comprehensive tests covering: - Certificate chains with intermediates - Missing intermediate certificates (should fail) - Multiple intermediate certificates - No client certificate (should fail) Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
swiatekm
left a comment
There was a problem hiding this comment.
LGTM. Thank you for the change and the patience in responding to review comments!
|
Thanks for reviewing @swiatekm, is there anything else preventing this PR from being merged? |
It's a significant change, so I'd like at least one more review before merging. |
jaronoff97
left a comment
There was a problem hiding this comment.
looks great, thank you @CharlieTLe
Description: Fix TLS certificate hot-reload for mTLS connections in the Target Allocator.
The Target Allocator now automatically reloads TLS certificates when they are renewed
by cert-manager. Previously, certificate renewals required a pod restart because
certificates were only loaded once at startup.
The fix introduces a
CertificateReloaderstruct that:GetCertificatecallback for seamless TLS handshakes with updated certsThis enables seamless certificate rotation without downtime.
Link to tracking Issue(s):
Testing: Added e2e chainsaw test that verifies certificate rotation by:
cmctl renewto trigger certificate renewalDocumentation: Added chloggen changelog entry describing the fix.