Reload Collectors without restart when ConfigMap changes #1903

damemi · 2024-12-02T16:23:50Z

Kubernetes automatically updates mounted configmaps in a pod (see https://kubernetes.io/docs/tutorials/configuration/updating-configuration-via-a-configmap/). But, the workload needs to watch for the update itself. So in our case, if the collector knows to watch for file changes, it can automatically signal a dynamic config reload without restarting the collector.

This forks the default fileprovider in our collectors to create a new odigosfileprovider that uses fsnotify to watch for updates to the collector ConfigMap. When the ConfigMap is updated, the new fileprovider will signal the collector to hot-reload its config. Technically, we watch for fsnotify.Remove events, because the projected configmap data is a symlink, not an actual copy of the configmap (meaning that watching fsnotify.Write doesn't trigger any update).

This means we don't need to restart the collector deployments or daemonsets for basic config updates, so those controllers have been updated to no longer update deployments when just the configmap has changed. They can of course still be manually redeployed with kubectl.

In my manual testing, it took about 1 minute for the change to be reflected, which is due to the default kubelet sync period (https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically). Not sure how we could test for this automatically but working on that

notion-workspace · 2024-12-02T16:25:09Z

Research: Dynamic collector configuration instead of static ConfigMap

blumamir

Thanks mike, added few comments.

I wonder if we should now revisit this mechanism:

odigos/autoscaler/controllers/datacollection/daemonset.go

Line 49 in 89aa4d0

    
           func (dm *DelayManager) RunSyncDaemonSetWithDelayAndSkipNewCalls(delay time.Duration, retries int, dests *odigosv1.DestinationList,

It was added to prevent a burst of restarts to the node collectors when sources a reconciled in a batch one-by-one. we can do it also as a followup PR. If we can remove it, we will have odigos pipeline starting 5 seconds earlier.

collector/odigosotelcol/go.mod

autoscaler/controllers/gateway/deployment.go

collector/builder-config.yaml

collector/providers/odigosfileprovider/provider.go

blumamir · 2024-12-04T04:57:34Z

collector/providers/odigosfileprovider/provider.go

+	if err != nil {
+		return nil, err
+	}
+	err = watcher.Add(file)


According to the provider docs:

// Provider is an interface that helps to retrieve a config map and watch for any // changes to the config map. Implementations may load the config from a file, // a database or any other source. // // The typical usage is the following: // // r, err := provider.Retrieve("file:/path/to/config") // // Use r.Map; wait for watcher to be called. // r.Close() // r, err = provider.Retrieve("file:/path/to/config") // // Use r.Map; wait for watcher to be called. // r.Close() // // repeat retrieve/wait/close cycle until it is time to shut down the Collector process. // // ... // provider.Shutdown()

Seems the Retrieve function may be called once per file update which can accumulate over time and Add many watchers on the same file.

Do we need to remove this watch after invoking the wf callback to ensure we are not leaking resources?

I think we don't need to worry about adding multiple watches on the same file from the same provider, the fsnotify docs say: A path can only be watched once; watching it more than once is a no-op and will not return an error.

But you are right that this code will leak goroutines, so I added some handling for that with a boolean to track if a watcher loop is already running (so we don't start another one)

collector/providers/odigosfileprovider/provider.go

RonFed · 2024-12-05T16:29:43Z

collector/providers/odigosfileprovider/provider.go

+	}
+
+	// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times
+	if !fmp.running {


can we use sync.Once for this pattern? (instead of boolean and mutex)

I'm not sure sync.Once is what we want. I think that will only let the routine be called once, ever, for this instance of the provider. What I'm doing here considers the case where the watcher closes (for whatever reason) and ends the goroutine, then Retrieve is called again on the same Provider. In that case, we should start up the watcher again. sync.Once won't start it a 2nd time

collector/providers/odigosfileprovider/provider.go

tests/e2e/workload-lifecycle/chainsaw-test.yaml

RonFed · 2024-12-05T16:36:31Z

collector/builder-config.yaml

+  - gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v0.106.0
+  - gomod: go.opentelemetry.io/collector/confmap/provider/httpprovider v0.106.0
+  - gomod: go.opentelemetry.io/collector/confmap/provider/httpsprovider v0.106.0
+  - gomod: go.opentelemetry.io/collector/confmap/provider/yamlprovider v0.106.0


Why are those needed?

See #1903 (comment)

I think we can remove them now that we have a provider and we don't use them.

We probably will still want the env provider at least, others I think can be removed

RonFed

Looks great, left a few non-blocking comments (except maybe the concurrent access to the running flag in the provider)

RonFed · 2024-12-07T13:07:57Z

autoscaler/controllers/gateway/deployment.go

-	// Calculate the hash of the config data and the secrets version hash, this is used to make sure the gateway will restart when the config changes
-	configDataHash := common.Sha256Hash(fmt.Sprintf("%s-%s", configData, secretsVersionHash))
+	// Use the hash of the secrets  to make sure the gateway will restart when the secrets (mounted as environment variables) changes
+	configDataHash := common.Sha256Hash(secretsVersionHash)


Can we remove the configData argument from this function?

that's what I did, since we don't want a change in the configdata to trigger a new rollout, we don't need to hash the config data anymore. is that what you were asking?

I was referring to the function signature of syncDeployment:

func syncDeployment(dests *odigosv1.DestinationList, gateway *odigosv1.CollectorsGroup, configData string, ctx context.Context, c client.Client, scheme *runtime.Scheme, imagePullSecrets []string, odigosVersion string) (*appsv1.Deployment, error)

Ah, updated in 6d53ab0

collector/providers/odigosfileprovider/provider.go

RonFed · 2024-12-07T13:22:45Z

collector/providers/odigosfileprovider/provider.go

+	// start a new watcher routine only if one isn't already running, since Retrieve could be called multiple times
+	if !fmp.running {
+		go func() {
+			fmp.running = true


I think there is possible non-safe concurrent access to this flag since this is set inside a goroutine, and being read from the outer Retrieve

updated this to lock around the variable. this feels a little like "communicating by sharing memory" but I'm not sure of a better way to do what we want here

collector/providers/odigosfileprovider/go.mod

collector/providers/odigosfileprovider/provider.go

blumamir · 2024-12-09T13:11:02Z

collector/providers/odigosfileprovider/provider.go

+	if !fmp.running {
+		go func() {
+			fmp.running = true
+			defer func() { fmp.done <- struct{}{} }()


If this code is invoked from a flow which isn't Shutdown, this call will block indefinitely, right?

good point, I switched this back to using a waitgroup which I think will fix that

damemi requested a review from RonFed December 3, 2024 14:54

blumamir reviewed Dec 4, 2024

View reviewed changes

damemi force-pushed the dynamic-config branch 3 times, most recently from 84df7b0 to f8b2fdf Compare December 5, 2024 14:05

damemi added 6 commits December 5, 2024 09:28

Add odigosfileprovider for dynamic config reloading

82ffd3a

Switch to fsnotify.Remove

801aa9f

Don't restart data-collection or gateway for config updates

0801cec

Revert go toolchain change

8f70603

Chainsaw tests

4b90c80

Feedback

c1310ee

damemi force-pushed the dynamic-config branch from f8b2fdf to c1310ee Compare December 5, 2024 14:28

RonFed reviewed Dec 5, 2024

View reviewed changes

BenElferink and others added 2 commits December 5, 2024 19:46

Merge branch 'main' into dynamic-config

f5c1483

More feedback

f5212a3

RonFed approved these changes Dec 7, 2024

View reviewed changes

blumamir reviewed Dec 9, 2024

View reviewed changes

Feedback

fc3a40f

damemi force-pushed the dynamic-config branch from 2e20ea1 to fc3a40f Compare December 9, 2024 17:35

damemi added 2 commits December 9, 2024 16:48

Remove configdata argument from syncconfigmap funcs

6d53ab0

Merge branch 'main' into dynamic-config

7d0d243

damemi enabled auto-merge (squash) December 11, 2024 14:22

damemi merged commit 9ee4343 into odigos-io:main Dec 11, 2024
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload Collectors without restart when ConfigMap changes #1903

Reload Collectors without restart when ConfigMap changes #1903

damemi commented Dec 2, 2024

notion-workspace bot commented Dec 2, 2024

blumamir left a comment

blumamir Dec 4, 2024

damemi Dec 5, 2024

RonFed Dec 5, 2024

damemi Dec 5, 2024

RonFed Dec 5, 2024

damemi Dec 5, 2024

RonFed Dec 7, 2024

damemi Dec 9, 2024

RonFed left a comment

RonFed Dec 7, 2024

damemi Dec 9, 2024

RonFed Dec 9, 2024

damemi Dec 9, 2024

RonFed Dec 7, 2024

damemi Dec 9, 2024

blumamir Dec 9, 2024

damemi Dec 9, 2024

Reload Collectors without restart when ConfigMap changes #1903

Reload Collectors without restart when ConfigMap changes #1903

Conversation

damemi commented Dec 2, 2024

notion-workspace bot commented Dec 2, 2024

blumamir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RonFed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment