Skip to content

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost#43084

Closed
paulojmdias wants to merge 17 commits into
open-telemetry:mainfrom
paulojmdias:feat/42707
Closed

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost#43084
paulojmdias wants to merge 17 commits into
open-telemetry:mainfrom
paulojmdias:feat/42707

Conversation

@paulojmdias
Copy link
Copy Markdown
Member

Description

Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.

Link to tracking issue

Fixes #42707

Testing

Tested locally and added new tests to cover the new behaviour.

…e is lost

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
@paulojmdias paulojmdias marked this pull request as ready for review September 30, 2025 22:04
@github-actions github-actions Bot requested a review from povilasv September 30, 2025 22:04
Comment thread receiver/k8sclusterreceiver/receiver.go
Comment thread receiver/k8sclusterreceiver/receiver.go Outdated
Comment on lines +73 to +74
kr.wg.Add(1)
kr.mu.Unlock()
Copy link
Copy Markdown
Member

@dmitryax dmitryax Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversation/idea came from this PR review.

If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.

This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.

Copy link
Copy Markdown
Member

@dmitryax dmitryax Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.

I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of the new approach is cleaner separation between component lifecycle (Start/Shutdown) and session lifecycle (leadership gain/loss).

With the old approach, calling kr.cancel() mixes these concerns by cancelling the component-level context during what should be a session-level operation. That said, the practical difference is subtle. The main value is consistency with PR #42330 and making the lifecycle boundaries more explicit through the two-context pattern, which should make the code easier to maintain.

But, if you feel the added complexity isn't worth it, I'm open to reverting that, and I'm also available to revert the logic on the other related components.

Copy link
Copy Markdown
Member

@dmitryax dmitryax Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added complexity isn't justified from my perspective. It adds too many unnecessary concurrency controls

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dmitryax

@ChrsMark I think we should revert the changes we introduced on other components related with this approach. Can I start with it ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs created #44512 #44513

Thank you all 🙏

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Copy link
Copy Markdown
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with 2 nits.

Comment thread receiver/k8sclusterreceiver/receiver.go Outdated
Comment thread receiver/k8sclusterreceiver/receiver.go
@odubajDT odubajDT requested a review from dmitryax October 6, 2025 12:43
@atoulme
Copy link
Copy Markdown
Contributor

atoulme commented Oct 14, 2025

this needs another look from @dmitryax before it gets in.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 1, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Nov 1, 2025
@paulojmdias
Copy link
Copy Markdown
Member Author

/label -stale

@github-actions github-actions Bot removed the Stale label Nov 2, 2025
Comment thread receiver/k8sclusterreceiver/watcher.go
Comment thread receiver/k8sclusterreceiver/receiver.go Outdated
Comment on lines +73 to +74
kr.wg.Add(1)
kr.mu.Unlock()
Copy link
Copy Markdown
Member

@dmitryax dmitryax Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

@paulojmdias
Copy link
Copy Markdown
Member Author

Closed as it is not beneficial as discussed in the following thread

dmitryax pushed a commit that referenced this pull request Dec 11, 2025
…oss (#44513)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This change ensures that all components using the `k8sleaderelector`
extension use the same behaviour of shutdown instead of standby on
leader election loss.

As discussed in the following
[thread](#43084 (comment)),
this is not so beneficial and should be reverted to keep the same
behaviour in all the components which use the `k8sleaderelector`
extension.

---------

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
ChrsMark added a commit that referenced this pull request Dec 11, 2025
… loss (#44512)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This PR reverts the change introduced in #43054.

As discussed in the following
[thread](#43084 (comment)),
this is not so beneficial and should be reverted to keep the same
behaviour in all the components which use the `k8sleaderelector`
extension.

---------

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
ChrsMark pushed a commit that referenced this pull request Dec 17, 2025
…ordering initialization (#44136)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Add RWMutex protection for `metadataConsumers` and `entityLogConsumer`
fields in resourceWatcher to prevent potential data races during
concurrent access.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates to
#43084 (comment)

---------

Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost

6 participants