[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost by paulojmdias · Pull Request #43084 · open-telemetry/opentelemetry-collector-contrib

paulojmdias · 2025-09-30T21:44:33Z

Description

Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.

Link to tracking issue

Fixes #42707

Testing

Tested locally and added new tests to cover the new behaviour.

…e is lost Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

dmitryax · 2025-10-01T02:04:43Z

+	kr.wg.Add(1)
+	kr.mu.Unlock()


I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?

The conversation/idea came from this PR review.

If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.

This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.

I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.

The benefit of the new approach is cleaner separation between component lifecycle (Start/Shutdown) and session lifecycle (leadership gain/loss).

With the old approach, calling kr.cancel() mixes these concerns by cancelling the component-level context during what should be a session-level operation. That said, the practical difference is subtle. The main value is consistency with PR #42330 and making the lifecycle boundaries more explicit through the two-context pattern, which should make the code easier to maintain.

But, if you feel the added complexity isn't worth it, I'm open to reverting that, and I'm also available to revert the logic on the other related components.

The added complexity isn't justified from my perspective. It adds too many unnecessary concurrency controls

Thank you @dmitryax

@ChrsMark I think we should revert the changes we introduced on other components related with this approach. Can I start with it ?

PRs created #44512 #44513

Thank you all 🙏

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

ChrsMark

LGTM, with 2 nits.

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

atoulme · 2025-10-14T22:46:53Z

this needs another look from @dmitryax before it gets in.

github-actions · 2025-11-01T05:21:04Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

paulojmdias · 2025-11-01T19:02:45Z

/label -stale

dmitryax · 2025-11-06T02:39:17Z

+	kr.wg.Add(1)
+	kr.mu.Unlock()


I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

paulojmdias · 2025-11-25T10:07:02Z

Closed as it is not beneficial as discussed in the following thread

…oss (#44513)  #### Description This change ensures that all components using the `k8sleaderelector` extension use the same behaviour of shutdown instead of standby on leader election loss. As discussed in the following [thread](#43084 (comment)), this is not so beneficial and should be reverted to keep the same behaviour in all the components which use the `k8sleaderelector` extension. --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com> Co-authored-by: Christos Markou <chrismarkou92@gmail.com>

… loss (#44512)  #### Description This PR reverts the change introduced in #43054. As discussed in the following [thread](#43084 (comment)), this is not so beneficial and should be reverted to keep the same behaviour in all the components which use the `k8sleaderelector` extension. --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com> Co-authored-by: Christos Markou <chrismarkou92@gmail.com>

…ordering initialization (#44136)  #### Description Add RWMutex protection for `metadataConsumers` and `entityLogConsumer` fields in resourceWatcher to prevent potential data races during concurrent access.  #### Link to tracking issue Relates to #43084 (comment) --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

[receiver/k8sclusterreceiver] Switch to standby mode when leader leas…

bd5a99f

…e is lost Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

paulojmdias marked this pull request as ready for review September 30, 2025 22:04

paulojmdias requested review from a team, ChrsMark, TylerHelmuth and dmitryax as code owners September 30, 2025 22:04

github-actions Bot assigned crobert-1 Sep 30, 2025

github-actions Bot added the receiver/k8scluster label Sep 30, 2025

github-actions Bot requested a review from povilasv September 30, 2025 22:04

dmitryax reviewed Oct 1, 2025

View reviewed changes

feat: Reset per-leadership-session state informers

8365f3e

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

odubajDT approved these changes Oct 2, 2025

View reviewed changes

ChrsMark approved these changes Oct 6, 2025

View reviewed changes

Comment thread receiver/k8sclusterreceiver/receiver.go Outdated

Comment thread receiver/k8sclusterreceiver/receiver.go

odubajDT requested a review from dmitryax October 6, 2025 12:43

paulojmdias and others added 7 commits October 8, 2025 10:40

feat: move wg to be session scoped

2f2c72e

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

Merge branch 'main' into feat/42707

25b78d9

chore: fix changelog

a39cf6b

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

Merge branch 'main' into feat/42707

8e7582f

Merge branch 'main' into feat/42707

34ab664

Merge branch 'main' into feat/42707

359536d

Merge branch 'main' into feat/42707

e1d376e

atoulme added the waiting-for-code-owners label Oct 14, 2025

paulojmdias added 2 commits October 15, 2025 22:15

Merge branch 'main' into feat/42707

2567dc7

Merge branch 'main' into feat/42707

57ec170

github-actions Bot added the Stale label Nov 1, 2025

Merge branch 'main' into feat/42707

d951582

github-actions Bot removed the Stale label Nov 2, 2025

paulojmdias added 2 commits November 4, 2025 21:20

Merge branch 'main' into feat/42707

8ff8532

Merge branch 'main' into feat/42707

b17ebf8

dmitryax reviewed Nov 6, 2025

View reviewed changes

paulojmdias and others added 2 commits November 6, 2025 23:22

Merge branch 'main' into feat/42707

cbb4c45

chore: add comprehensive comments explaining standby mode pattern

9170c5e

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

paulojmdias mentioned this pull request Nov 10, 2025

[chore][receiver/k8scluster] Prevent concurrent metadata access by reordering initialization #44136

Merged

Merge branch 'main' into feat/42707

7052aaf

atoulme requested a review from dmitryax November 25, 2025 04:34

paulojmdias mentioned this pull request Nov 25, 2025

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #42707

Closed

paulojmdias closed this Nov 25, 2025

This was referenced Nov 25, 2025

[receiver/k8sobjects] revert standby mode behavior on leader election loss #44512

Merged

[receiver/k8sevents] shutdown instead of standby on leader election loss #44513

Merged

Conversation

paulojmdias commented Sep 30, 2025

Description

Link to tracking issue

Testing

Uh oh!

Uh oh!

dmitryax Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulojmdias Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

dmitryax Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrsMark Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

paulojmdias Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

dmitryax Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulojmdias Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ChrsMark Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

paulojmdias Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

atoulme commented Oct 14, 2025

Uh oh!

github-actions Bot commented Nov 1, 2025

Uh oh!

paulojmdias commented Nov 1, 2025

Uh oh!

Uh oh!

dmitryax Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulojmdias commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dmitryax Oct 1, 2025 •

edited

Loading

dmitryax Nov 6, 2025 •

edited

Loading

dmitryax Nov 24, 2025 •

edited

Loading

dmitryax Nov 6, 2025 •

edited

Loading