Bug 1765044: Adds proxy support to ingress operator#334
Bug 1765044: Adds proxy support to ingress operator#334openshift-merge-robot merged 6 commits intoopenshift:masterfrom
Conversation
pkg/dns/aws/filewatcher.go
Outdated
|
|
||
| m.route53.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) | ||
| m.elb.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) | ||
| m.tags.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) |
There was a problem hiding this comment.
Should this be made thread-safe?
pkg/dns/aws/filewatcher.go
Outdated
| errChan <- m.fileWatcher.Start(stop) | ||
| }() | ||
|
|
||
| // Wait for the watcher to exit or an explicit stop. |
There was a problem hiding this comment.
this is a bit strange. Why launch two goroutines and then have a dummy waiter here? Can we just make m.fileWatcher.Start blocking, or is that not something we control?
pkg/dns/aws/filewatcher.go
Outdated
| // ca bundle of FileWatcher, returning false and the current ca | ||
| // bundle if the two are not equal. | ||
| func (m *Provider) caBundlesEqual() (bool, []byte, error) { | ||
| watchedCAs, err := ioutil.ReadFile(m.fileWatcher.GetFile()) |
There was a problem hiding this comment.
This is expensive. Can we just compare Stat mtimes, like I floated in golang/go#35887? I tested this locally, bumping the ConfigMap content and seeing the mtime change (although you need Stat to walk the symlink to see the real mtime in the linked file). On the other hand, reading the certs once a minute shouldn't be that bad, even if the reads are expensive ;).
There was a problem hiding this comment.
And I guess more fundamentally, if we're going to read the whole file off the disk, it seems like we might as well update the CAs without bothering to check if there were any changes. It's not obvious to me that a bytewise comparison in memory is all that much cheaper than just parsing out the certs into Go CAs.
There was a problem hiding this comment.
Another way might be to forgo fs notifications entirely and wrap access to the underlying file with a struct that implements a TTL for refresh. So like, calling CertFile.Get() would return cached data until its internal TTL (e.g. 30s, 1min) expires, at which point it will lock/refresh/re-cache/reset TTL. Such an approach wouldn't require any goroutines at all.
There was a problem hiding this comment.
Performance considerations seem pretty inconsequential in this context — we're talking about reading a secret mount, which I believe is memory mapped? We could unconditionally re-read the contents every 15 seconds or something without any diffing at all and I'm not sure it would have any consequential impact on the platform
pkg/util/certs/certs.go
Outdated
| } | ||
|
|
||
| // CertsFromPEM parses pemCerts into a list of certificates. | ||
| func CertsFromPEM(pemCerts []byte) ([]*x509.Certificate, error) { |
There was a problem hiding this comment.
It's really unfortunate that we need to reproduce this code instead of being able to use the stdlib's CertPool.AppendCertsFromPEM. I guess AWS doesn't take a CertPool for wherever this is being consumed? Can we bump Go's DefaultTransport TLSClientConfig RootCAs instead of talking to the AWS libraries directly?
There was a problem hiding this comment.
Oh, actually, you can drop CertsFromPEM and use AppendCertsFromPEM instead in ensureDNSClientsTLSTransport to construct a transport. Then pass that transport in instead of your current
m.route53.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs)
m.elb.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs)
m.tags.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs)which makes three identical transports.
There was a problem hiding this comment.
Annoying right? The cloud provider libs are idiosyncratic... Azure won't use the stdlib transport the way you would want either, so we have to wire up everything in a way the lib demands
pkg/dns/aws/filewatcher.go
Outdated
|
|
||
| m.route53.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) | ||
| m.elb.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) | ||
| m.tags.Client.Config.HTTPClient.Transport = certsutil.MakeTLSTransport(certs) |
There was a problem hiding this comment.
You cannot mutate the clients.
pkg/operator/watcher/filewatcher.go
Outdated
|
|
||
| // New returns a new FileWatcher watching the given file. | ||
| func New(file string) (*FileWatcher, error) { | ||
| var err error |
pkg/operator/watcher/filewatcher.go
Outdated
| if isRemove(event) { | ||
| if err := fw.watcher.Add(event.Name); err != nil { | ||
| log.Error(err, "error re-watching file %s", fw.file) | ||
| } |
There was a problem hiding this comment.
Return here? Or do we want to try to read the file after receiving a remove event?
There was a problem hiding this comment.
This is what the watcher sees when the trusted ca bundle configmap is updated (i.e. a new ca cert is added):
2019-12-09T19:55:31.185Z INFO operator.filewatcher watcher/filewatcher.go:113 watched file change {"event": "\"/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem\": REMOVE"}
pkg/dns/aws/filewatcher.go
Outdated
| if err := m.ensureDNSClientsTLSTransport(); err != nil { | ||
| log.Error(err, "failed to ensure dns client tls transport") | ||
| } | ||
| }, 1*time.Minute, stop) |
There was a problem hiding this comment.
Is this a second, polling-based watcher built on top of the fsnotify-based watcher? This is very confusing.
There was a problem hiding this comment.
@Miciah the fsnotify-based watcher is continuously watching fileName and updating currentData. startWatcher() starts the fsnotify-based watcher and starts a goroutine that periodically ensures dns clients are using the latest ca bundle by comparing the fsnotify-based watcher's currentData with the data from reading /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem. Let me know if you have any suggestions to improve. Maybe have handleEvent() to update the dns clients?
There was a problem hiding this comment.
What I find confusing is that you essentially have two watchers: a polling-based watcher and an fsnotify-based watcher, where the former is built atop the latter, and half the watcher logic is in the aws provider package and half is in the watcher package. Is the polling-based watcher logic going to be repeated in every provider implementation?
I would instead amend the fsnotify-based watcher to watch two channels: the fsnotify events channel and a ticker channel. Dan did something similar in openshift/router@4a4f5ab (that commit has some extra logic to distinguish between reloads triggered by fsnotify events and reloads triggered by ticks, but otherwise it does essentially what I am am suggesting). Although others may disagree, I find the single combined fsnotify- and ticker-based watcher easier to understand and reason about.
There was a problem hiding this comment.
@Miciah commit 4c6eb35 updated the PR and my testing against a live cluster produced expected results. However, the commit did not use a bool chan to update elb, tagging and route53 sessions. Commit 97eb21d implements the chan approach that you suggest. I'm waiting for my cluster to finish installing to test. I'll start testing this commit tomorrow morning PT. Thanks for the guidance!
|
@danehans: An error was encountered adding this pull request to the external tracker bugs for bug 1765044 on the Bugzilla server at https://bugzilla.redhat.com:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@danehans: An error was encountered adding this pull request to the external tracker bugs for bug 1765044 on the Bugzilla server at https://bugzilla.redhat.com:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@cliles: An error was encountered adding this pull request to the external tracker bugs for bug 1765044 on the Bugzilla server at https://bugzilla.redhat.com:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| } | ||
| if len(region) == 0 { | ||
| return nil, fmt.Errorf("region is required") | ||
| } |
There was a problem hiding this comment.
I'm not sure it is safe to move this validation from the session setup logic here to createDNSProvider. Here, we validate the region from the session configuration, which may be reading from ~/.aws/config or may be using the region from the cluster config, but createDNSProvider only has the cluster config, so it is not validating the same thing. @ironcladlou, do I have that right? It should only matter for local development though, so maybe it isn't worth the trouble to check the session config.
There was a problem hiding this comment.
@Miciah see #334 (comment), the check is bubbled up to the aws provider of createDNSProvider().
cmd/ingress-operator/start.go
Outdated
| case configv1.AWSPlatformType: | ||
| if len(platformStatus.AWS.Region) == 0 { | ||
| return nil, fmt.Errorf("region is required") | ||
| } |
There was a problem hiding this comment.
I wonder if this is backwards-compatible with existing storage (openshift/api@dedfb47)...
Is this new check necessary?
There was a problem hiding this comment.
Nevermind, I misread — by the time you get here, the platformStatus has been created by getPlatformStatus and should be normalized/valid no matter the source. I think your validation here is correct.
There was a problem hiding this comment.
@ironcladlou it's not a new check. I bubbled this check up from the dns/aws pkg.
pkg/dns/aws/filewatcher.go
Outdated
| m.elb = elb.New(sess, aws.NewConfig().WithRegion(m.config.Region)) | ||
| m.route53 = route53.New(sess) | ||
| m.tags = resourcegroupstaggingapi.New(sess, aws.NewConfig().WithRegion("us-east-1")) | ||
| m.lock.Unlock() |
There was a problem hiding this comment.
Is this actually thread-safe absent locks near the readers throughout the rest of the code?
There was a problem hiding this comment.
The FileWatcher struct uses a mutex to lock/unlock any mutations to the watched file data. The handleEvent() method uses a chan to indicate when the watcher updates the current data of the watched file. The ensureSessionTransport() updates the aws dns sessions when a value is received on the chan. StartWatcher() runs ensureSessionTransport() and filewatcher Start(), which blocks until the stop channel is closed and listens for fsnotify events and errors on chans. Let me know if you have any suggestions to improve thread safety.
There was a problem hiding this comment.
The problem is we have code such as the following:
outerError := m.tags.GetResourcesPages(&resourcegroupstaggingapi.GetResourcesInput{
ResourceTypeFilters: []*string{aws.String("route53:hostedzone")},
TagFilters: tagFilters,
}, f)To be thread safe, we now need to take a lock before using m.tags:
m.lock.Lock()
outerError := m.tags.GetResourcesPages(&resourcegroupstaggingapi.GetResourcesInput{
ResourceTypeFilters: []*string{aws.String("route53:hostedzone")},
TagFilters: tagFilters,
}, f)
m.lock.Unlock()And ditto for m.route53 and m.elb.
There was a problem hiding this comment.
The lock/unlock occurs here for that tagging api call, here for the route53 api call, and here for elb api calls.
I'm getting ready to push a commit that will cause the ingress-operator pod to gracefully restart when the ca bundle changes to simplify the implementation and remove the need to worry about runaway goroutines.
|
The operator already supports graceful shutdown. What if the operator were to shut down when the file changes just like when it receives a TERM? The operator would be immediately restarted and refresh its state with the latest cert contents. The shutdown approach doesn't require a sidecar container and would work for all current and future providers without the need for new state management code (assuming they're wiring up TLS in when constructing their initial clients, which they all should be anyway.) |
|
@ironcladlou is it common for customers to monitor operators? If so, do you think it would trigger an alarm and cause cluster admins to wonder why the operator restarted? |
|
/bugzilla refresh |
|
@ironcladlou: This pull request references Bugzilla bug 1765044, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@ironcladlou: This pull request references Bugzilla bug 1765044, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c9cbdc4 to
d122bec
Compare
Miciah
left a comment
There was a problem hiding this comment.
A couple minor stylistic comments. Nothing important enough to worry about if CI passes.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danehans, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@danehans: All pull requests linked via external trackers have merged. Bugzilla bug 1765044 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.3 |
|
@ironcladlou: new pull request created: #340 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kalexand-rh if we have any docs related to manually managing ingress dns records for proxy environments, the limitation can now be removed. |
|
@danehans: Bugzilla bug 1765044 is in an unrecognized state (VERIFIED) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I would like to get feedback on the implementation before proceeding with providers other than aws.Update: The implementation is provider agnostic.