Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload Collectors without restart when ConfigMap changes #1903

Merged
merged 11 commits into from
Dec 11, 2024

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Dec 2, 2024

Kubernetes automatically updates mounted configmaps in a pod (see https://kubernetes.io/docs/tutorials/configuration/updating-configuration-via-a-configmap/). But, the workload needs to watch for the update itself. So in our case, if the collector knows to watch for file changes, it can automatically signal a dynamic config reload without restarting the collector.

This forks the default fileprovider in our collectors to create a new odigosfileprovider that uses fsnotify to watch for updates to the collector ConfigMap. When the ConfigMap is updated, the new fileprovider will signal the collector to hot-reload its config. Technically, we watch for fsnotify.Remove events, because the projected configmap data is a symlink, not an actual copy of the configmap (meaning that watching fsnotify.Write doesn't trigger any update).

This means we don't need to restart the collector deployments or daemonsets for basic config updates, so those controllers have been updated to no longer update deployments when just the configmap has changed. They can of course still be manually redeployed with kubectl.

In my manual testing, it took about 1 minute for the change to be reflected, which is due to the default kubelet sync period (https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically). Not sure how we could test for this automatically but working on that

Copy link

@damemi damemi requested a review from RonFed December 3, 2024 14:54
Copy link
Collaborator

@blumamir blumamir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks mike, added few comments.

I wonder if we should now revisit this mechanism:

func (dm *DelayManager) RunSyncDaemonSetWithDelayAndSkipNewCalls(delay time.Duration, retries int, dests *odigosv1.DestinationList,

It was added to prevent a burst of restarts to the node collectors when sources a reconciled in a batch one-by-one. we can do it also as a followup PR. If we can remove it, we will have odigos pipeline starting 5 seconds earlier.

collector/odigosotelcol/go.mod Outdated Show resolved Hide resolved
autoscaler/controllers/gateway/deployment.go Outdated Show resolved Hide resolved
collector/builder-config.yaml Outdated Show resolved Hide resolved
collector/providers/odigosfileprovider/provider.go Outdated Show resolved Hide resolved
if err != nil {
return nil, err
}
err = watcher.Add(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the provider docs:

// Provider is an interface that helps to retrieve a config map and watch for any
// changes to the config map. Implementations may load the config from a file,
// a database or any other source.
//
// The typical usage is the following:
//
//	r, err := provider.Retrieve("file:/path/to/config")
//	// Use r.Map; wait for watcher to be called.
//	r.Close()
//	r, err = provider.Retrieve("file:/path/to/config")
//	// Use r.Map; wait for watcher to be called.
//	r.Close()
//	// repeat retrieve/wait/close cycle until it is time to shut down the Collector process.
//	// ...
//	provider.Shutdown()

Seems the Retrieve function may be called once per file update which can accumulate over time and Add many watchers on the same file.

Do we need to remove this watch after invoking the wf callback to ensure we are not leaking resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to worry about adding multiple watches on the same file from the same provider, the fsnotify docs say: A path can only be watched once; watching it more than once is a no-op and will not return an error.

But you are right that this code will leak goroutines, so I added some handling for that with a boolean to track if a watcher loop is already running (so we don't start another one)

collector/providers/odigosfileprovider/provider.go Outdated Show resolved Hide resolved
@damemi damemi force-pushed the dynamic-config branch 3 times, most recently from 84df7b0 to f8b2fdf Compare December 5, 2024 14:05
}

// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times
if !fmp.running {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use sync.Once for this pattern? (instead of boolean and mutex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure sync.Once is what we want. I think that will only let the routine be called once, ever, for this instance of the provider. What I'm doing here considers the case where the watcher closes (for whatever reason) and ends the goroutine, then Retrieve is called again on the same Provider. In that case, we should start up the watcher again. sync.Once won't start it a 2nd time

tests/e2e/workload-lifecycle/chainsaw-test.yaml Outdated Show resolved Hide resolved
Comment on lines 111 to 114
- gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v0.106.0
- gomod: go.opentelemetry.io/collector/confmap/provider/httpprovider v0.106.0
- gomod: go.opentelemetry.io/collector/confmap/provider/httpsprovider v0.106.0
- gomod: go.opentelemetry.io/collector/confmap/provider/yamlprovider v0.106.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove them now that we have a provider and we don't use them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably will still want the env provider at least, others I think can be removed

Copy link
Collaborator

@RonFed RonFed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left a few non-blocking comments (except maybe the concurrent access to the running flag in the provider)

// Calculate the hash of the config data and the secrets version hash, this is used to make sure the gateway will restart when the config changes
configDataHash := common.Sha256Hash(fmt.Sprintf("%s-%s", configData, secretsVersionHash))
// Use the hash of the secrets to make sure the gateway will restart when the secrets (mounted as environment variables) changes
configDataHash := common.Sha256Hash(secretsVersionHash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the configData argument from this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I did, since we don't want a change in the configdata to trigger a new rollout, we don't need to hash the config data anymore. is that what you were asking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the function signature of syncDeployment:

func syncDeployment(dests *odigosv1.DestinationList, gateway *odigosv1.CollectorsGroup, configData string,
	ctx context.Context, c client.Client, scheme *runtime.Scheme, imagePullSecrets []string, odigosVersion string) (*appsv1.Deployment, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, updated in 6d53ab0

collector/providers/odigosfileprovider/provider.go Outdated Show resolved Hide resolved
// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times
if !fmp.running {
go func() {
fmp.running = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is possible non-safe concurrent access to this flag since this is set inside a goroutine, and being read from the outer Retrieve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this to lock around the variable. this feels a little like "communicating by sharing memory" but I'm not sure of a better way to do what we want here

collector/providers/odigosfileprovider/go.mod Outdated Show resolved Hide resolved
collector/providers/odigosfileprovider/provider.go Outdated Show resolved Hide resolved
collector/providers/odigosfileprovider/provider.go Outdated Show resolved Hide resolved
if !fmp.running {
go func() {
fmp.running = true
defer func() { fmp.done <- struct{}{} }()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code is invoked from a flow which isn't Shutdown, this call will block indefinitely, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I switched this back to using a waitgroup which I think will fix that

@damemi damemi enabled auto-merge (squash) December 11, 2024 14:22
@damemi damemi merged commit 9ee4343 into odigos-io:main Dec 11, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants