-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reload Collectors without restart when ConfigMap changes #1903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks mike, added few comments.
I wonder if we should now revisit this mechanism:
func (dm *DelayManager) RunSyncDaemonSetWithDelayAndSkipNewCalls(delay time.Duration, retries int, dests *odigosv1.DestinationList, |
It was added to prevent a burst of restarts to the node collectors when sources a reconciled in a batch one-by-one. we can do it also as a followup PR. If we can remove it, we will have odigos pipeline starting 5 seconds earlier.
if err != nil { | ||
return nil, err | ||
} | ||
err = watcher.Add(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the provider docs:
// Provider is an interface that helps to retrieve a config map and watch for any
// changes to the config map. Implementations may load the config from a file,
// a database or any other source.
//
// The typical usage is the following:
//
// r, err := provider.Retrieve("file:/path/to/config")
// // Use r.Map; wait for watcher to be called.
// r.Close()
// r, err = provider.Retrieve("file:/path/to/config")
// // Use r.Map; wait for watcher to be called.
// r.Close()
// // repeat retrieve/wait/close cycle until it is time to shut down the Collector process.
// // ...
// provider.Shutdown()
Seems the Retrieve
function may be called once per file update which can accumulate over time and Add
many watchers on the same file.
Do we need to remove this watch
after invoking the wf
callback to ensure we are not leaking resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to worry about adding multiple watches on the same file from the same provider, the fsnotify docs say: A path can only be watched once; watching it more than once is a no-op and will not return an error
.
But you are right that this code will leak goroutines, so I added some handling for that with a boolean to track if a watcher loop is already running (so we don't start another one)
84df7b0
to
f8b2fdf
Compare
f8b2fdf
to
c1310ee
Compare
} | ||
|
||
// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times | ||
if !fmp.running { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use sync.Once
for this pattern? (instead of boolean and mutex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure sync.Once
is what we want. I think that will only let the routine be called once, ever, for this instance of the provider. What I'm doing here considers the case where the watcher closes (for whatever reason) and ends the goroutine, then Retrieve
is called again on the same Provider. In that case, we should start up the watcher again. sync.Once
won't start it a 2nd time
collector/builder-config.yaml
Outdated
- gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v0.106.0 | ||
- gomod: go.opentelemetry.io/collector/confmap/provider/httpprovider v0.106.0 | ||
- gomod: go.opentelemetry.io/collector/confmap/provider/httpsprovider v0.106.0 | ||
- gomod: go.opentelemetry.io/collector/confmap/provider/yamlprovider v0.106.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are those needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1903 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove them now that we have a provider and we don't use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably will still want the env provider at least, others I think can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, left a few non-blocking comments (except maybe the concurrent access to the running
flag in the provider)
// Calculate the hash of the config data and the secrets version hash, this is used to make sure the gateway will restart when the config changes | ||
configDataHash := common.Sha256Hash(fmt.Sprintf("%s-%s", configData, secretsVersionHash)) | ||
// Use the hash of the secrets to make sure the gateway will restart when the secrets (mounted as environment variables) changes | ||
configDataHash := common.Sha256Hash(secretsVersionHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the configData
argument from this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I did, since we don't want a change in the configdata to trigger a new rollout, we don't need to hash the config data anymore. is that what you were asking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the function signature of syncDeployment
:
func syncDeployment(dests *odigosv1.DestinationList, gateway *odigosv1.CollectorsGroup, configData string,
ctx context.Context, c client.Client, scheme *runtime.Scheme, imagePullSecrets []string, odigosVersion string) (*appsv1.Deployment, error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, updated in 6d53ab0
// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times | ||
if !fmp.running { | ||
go func() { | ||
fmp.running = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is possible non-safe concurrent access to this flag since this is set inside a goroutine, and being read from the outer Retrieve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this to lock around the variable. this feels a little like "communicating by sharing memory" but I'm not sure of a better way to do what we want here
if !fmp.running { | ||
go func() { | ||
fmp.running = true | ||
defer func() { fmp.done <- struct{}{} }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code is invoked from a flow which isn't Shutdown, this call will block indefinitely, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I switched this back to using a waitgroup which I think will fix that
2e20ea1
to
fc3a40f
Compare
Kubernetes automatically updates mounted configmaps in a pod (see https://kubernetes.io/docs/tutorials/configuration/updating-configuration-via-a-configmap/). But, the workload needs to watch for the update itself. So in our case, if the collector knows to watch for file changes, it can automatically signal a dynamic config reload without restarting the collector.
This forks the default fileprovider in our collectors to create a new odigosfileprovider that uses fsnotify to watch for updates to the collector ConfigMap. When the ConfigMap is updated, the new fileprovider will signal the collector to hot-reload its config. Technically, we watch for
fsnotify.Remove
events, because the projected configmap data is a symlink, not an actual copy of the configmap (meaning that watchingfsnotify.Write
doesn't trigger any update).This means we don't need to restart the collector deployments or daemonsets for basic config updates, so those controllers have been updated to no longer update deployments when just the configmap has changed. They can of course still be manually redeployed with
kubectl
.In my manual testing, it took about 1 minute for the change to be reflected, which is due to the default kubelet sync period (https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically). Not sure how we could test for this automatically but working on that