Skip to content

fix(target-allocator): add TLS certificate hot-reload for mTLS#4597

Merged
swiatekm merged 20 commits into
open-telemetry:mainfrom
CharlieTLe:ta-mtls-bug
Jan 27, 2026
Merged

fix(target-allocator): add TLS certificate hot-reload for mTLS#4597
swiatekm merged 20 commits into
open-telemetry:mainfrom
CharlieTLe:ta-mtls-bug

Conversation

@CharlieTLe

Copy link
Copy Markdown
Contributor

Description: Fix TLS certificate hot-reload for mTLS connections in the Target Allocator.

The Target Allocator now automatically reloads TLS certificates when they are renewed
by cert-manager. Previously, certificate renewals required a pod restart because
certificates were only loaded once at startup.

The fix introduces a CertificateReloader struct that:

  • Uses fsnotify to watch the certificate directory for changes
  • Dynamically reloads certificates when files are modified
  • Provides a GetCertificate callback for seamless TLS handshakes with updated certs

This enables seamless certificate rotation without downtime.

Link to tracking Issue(s):

Testing: Added e2e chainsaw test that verifies certificate rotation by:

  1. Capturing the initial certificate date before renewal
  2. Running cmctl renew to trigger certificate renewal
  3. Verifying the new certificate date differs from the initial date
  4. Confirming the Target Allocator continues serving with the renewed certificate

Documentation: Added chloggen changelog entry describing the fix.

The Target Allocator now automatically reloads TLS certificates when
they are renewed by cert-manager. Previously, certificate renewals
required a pod restart because certificates were only loaded once at
startup.

The fix uses fsnotify to watch the certificate directory and
dynamically reloads certificates via the GetCertificate callback,
enabling seamless certificate rotation without downtime.

Fixes #4368

Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe requested a review from a team as a code owner January 9, 2026 00:26

@pavolloffay pavolloffay left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Just some minor comments.


// In Kubernetes, secret updates create a new symlink target.
// We look for Create or Write events on any file in the directory.
if event.Op&(fsnotify.Create|fsnotify.Write) != 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and link to another implementation, @pavolloffay. This implementation looks at the cert directory instead of the actual files, so we should be good.

@CharlieTLe

Copy link
Copy Markdown
Contributor Author

Hi @jaronoff97, could you help review? Thanks!

Comment thread cmd/otel-allocator/internal/config/tls_reloader.go Outdated

// In Kubernetes, secret updates create a new symlink target.
// We look for Create or Write events on any file in the directory.
if event.Op&(fsnotify.Create|fsnotify.Write) != 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to check if the event involves the file we're interested in? We could have other files in this directory. Technically, this would just cause spurious reloads, so not a big deal, but if we can avoid them easily, we should do so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a debouncer since we're dealing with symlinks.

Comment thread cmd/otel-allocator/internal/config/tls_reloader.go Outdated
@swiatekm

Copy link
Copy Markdown
Contributor

Also, shouldn't this change make fsnotify a direct dependency in go.mod?

@jaronoff97 jaronoff97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @swiatekm got to all the comments i was going to say and more haha, otherwise this looks great, excited for this functionality to be in here!

When Kubernetes updates a Secret/ConfigMap, it performs atomic updates
using symlinks which generate 3 filesystem events (timestamped dir,
temp symlink, final symlink). The TLS certificate reloader was
triggering a reload for each event, resulting in 3 reloads instead of 1.

This commit adds 100ms time-based debouncing to batch rapid events:
- scheduleReload() resets timer on each event
- Single reload fires after events settle
- Thread-safe timer management with mutex
- Notification channel for debounced reload signal

Added comprehensive tests covering:
- Multiple rapid events → 1 reload
- Timer reset behavior
- Kubernetes atomic update simulation
- Single event handling
- Reload failure recovery
- Context cancellation
- Concurrent events

Signed-off-by: Charlie Le <charlie_le@apple.com>
Previously, the TLS certificate reloader only watched the directory
containing the certificate file. If the key or CA files were in
different directories, changes to those files would not trigger
a reload.

This commit adds support for watching multiple directories:
- Collects all unique directories for cert, key, and CA files
- Watches each unique directory for changes
- Debouncing works across all watched directories
- Events in any directory trigger a single debounced reload

Added test coverage:
- Certificates in 3 separate directories
- Events across multiple directories → 1 reload
- Individual directory updates still work

Signed-off-by: Charlie Le <charlie_le@apple.com>
…eReload

Previously, most TLS reloader tests called scheduleReload() directly,
bypassing the fsnotify integration. This meant:
- Tests didn't verify fsnotify events properly trigger reloads
- Tests were unit tests of debouncing, not integration tests
- Real-world behavior wasn't being tested

Fixed 7 tests to trigger actual filesystem events:

1. TestCertificateReloader_DebounceMultipleEvents
   - Now creates temp files instead of calling scheduleReload()
   - Tests full fsnotify → debounce → reload flow

2. TestCertificateReloader_DebounceResetTimer
   - Creates actual files to trigger CREATE events
   - Verifies timer reset with real fsnotify events

3. TestCertificateReloader_SimulateKubernetesAtomicUpdate
   - Properly simulates K8s symlink pattern
   - Creates timestamped dir, temp symlink, final symlink
   - Writes to actual cert files to ensure reload works

4. TestCertificateReloader_SingleEventStillWorks
   - Creates single file to trigger fsnotify event
   - Tests legitimate single events aren't delayed

5. TestCertificateReloader_ReloadFailureDoesntStopWatching
   - Removed duplicate scheduleReload() calls
   - Relies only on fsnotify WRITE events from file updates

6. TestCertificateReloader_ContextCancellationDuringDebounce
   - Creates filesystem event before cancelling context
   - Tests cancellation during debounce window

7. TestCertificateReloader_ConcurrentEvents
   - Creates concurrent filesystem events from goroutines
   - Tests thread safety with real fsnotify events

All tests now properly verify the full integration:
fsnotify events → Watch() loop → scheduleReload() → debounce → Reload()

Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe

Copy link
Copy Markdown
Contributor Author

Also, shouldn't this change make fsnotify a direct dependency in go.mod?

Updated go.mod, thanks!

Comment thread cmd/otel-allocator/internal/config/tls_reloader.go Outdated
Comment thread cmd/otel-allocator/internal/config/tls_reloader.go Outdated
Comment thread cmd/otel-allocator/internal/config/tls_reloader_test.go Outdated
…counts

Replace test verification approach from counting reload callbacks to
comparing actual certificate values. This provides more direct validation
that certificates are being reloaded correctly.

Changes:
- Remove testReloadCallback mechanism from CertificateReloader
- Add getCertificateBytes helper to extract certificates for comparison
- Update all 8 tests to verify certificate changes instead of counts
- Tests now write actual new certificates to trigger real reloads

Signed-off-by: Charlie Le <charlie_le@apple.com>
…bouncing

The previous debounce implementation could indefinitely postpone certificate
reloads if filesystem events arrived faster than the 100ms debounce delay.
This adds a maximum wait time (1 second) to guarantee reloads happen even
during continuous events, while preserving the existing debounce behavior
for normal Kubernetes secret updates.

Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe requested a review from swiatekm January 16, 2026 19:18
Comment thread cmd/otel-allocator/internal/config/tls_reloader.go Outdated
CA certs still need to be handled as well, so I've set up ca_reloader to
handle that.

Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe requested a review from swiatekm January 23, 2026 01:27
@github-actions

github-actions Bot commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

 34 files  ±0  227 suites  ±0   2h 0m 19s ⏱️ - 3m 14s
 90 tests ±0   90 ✅ +1  0 💤 ±0  0 ❌  - 1 
231 runs  ±0  231 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit 6a9b4d5. ± Comparison against base commit b2970a1.

♻️ This comment has been updated with latest results.

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using controller-runtime's cert watcher is a big improvement. This mostly looks good to me right now, had a question and some comments about tests.

Comment thread cmd/otel-allocator/internal/config/config.go
Comment thread cmd/otel-allocator/internal/config/ca_reloader_test.go Outdated
Comment thread cmd/otel-allocator/internal/config/ca_reloader.go Outdated
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe

Copy link
Copy Markdown
Contributor Author

Tests are failing for filelog-receiver, which seems unrelated to this change.

The previous implementation incorrectly verified all certificates in
cs.PeerCertificates individually against the root CA. This fails for
certificate chains with intermediates, as intermediate certificates
won't verify directly against the root CA.

This fix follows the standard X.509 verification pattern:
- Only verify the leaf certificate (cs.PeerCertificates[0])
- Add intermediate certificates (cs.PeerCertificates[1:]) to the
  Intermediates pool
- Add KeyUsages check for client authentication
- Let x509.Certificate.Verify() build and validate the chain

Added comprehensive tests covering:
- Certificate chains with intermediates
- Missing intermediate certificates (should fail)
- Multiple intermediate certificates
- No client certificate (should fail)

Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the change and the patience in responding to review comments!

@CharlieTLe

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing @swiatekm, is there anything else preventing this PR from being merged?

@swiatekm

Copy link
Copy Markdown
Contributor

Thanks for reviewing @swiatekm, is there anything else preventing this PR from being merged?

It's a significant change, so I'd like at least one more review before merging.

@jaronoff97 jaronoff97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thank you @CharlieTLe

@swiatekm swiatekm merged commit 605a8d7 into open-telemetry:main Jan 27, 2026
85 of 88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Renewed certificates are not reloaded by target allocator

4 participants