[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost#43084
[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost#43084paulojmdias wants to merge 17 commits into
Conversation
…e is lost Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
| kr.wg.Add(1) | ||
| kr.mu.Unlock() |
There was a problem hiding this comment.
I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?
There was a problem hiding this comment.
The conversation/idea came from this PR review.
If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.
This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.
There was a problem hiding this comment.
I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."
Can you clarify what you mean by "fully tear down"?
Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.
What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.
There was a problem hiding this comment.
Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.
I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.
There was a problem hiding this comment.
The benefit of the new approach is cleaner separation between component lifecycle (Start/Shutdown) and session lifecycle (leadership gain/loss).
With the old approach, calling kr.cancel() mixes these concerns by cancelling the component-level context during what should be a session-level operation. That said, the practical difference is subtle. The main value is consistency with PR #42330 and making the lifecycle boundaries more explicit through the two-context pattern, which should make the code easier to maintain.
But, if you feel the added complexity isn't worth it, I'm open to reverting that, and I'm also available to revert the logic on the other related components.
There was a problem hiding this comment.
The added complexity isn't justified from my perspective. It adds too many unnecessary concurrency controls
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
|
this needs another look from @dmitryax before it gets in. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
/label -stale |
| kr.wg.Add(1) | ||
| kr.mu.Unlock() |
There was a problem hiding this comment.
I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."
Can you clarify what you mean by "fully tear down"?
Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.
What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
|
Closed as it is not beneficial as discussed in the following thread |
…oss (#44513) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This change ensures that all components using the `k8sleaderelector` extension use the same behaviour of shutdown instead of standby on leader election loss. As discussed in the following [thread](#43084 (comment)), this is not so beneficial and should be reverted to keep the same behaviour in all the components which use the `k8sleaderelector` extension. --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com> Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
… loss (#44512) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR reverts the change introduced in #43054. As discussed in the following [thread](#43084 (comment)), this is not so beneficial and should be reverted to keep the same behaviour in all the components which use the `k8sleaderelector` extension. --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com> Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
…ordering initialization (#44136) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Add RWMutex protection for `metadataConsumers` and `entityLogConsumer` fields in resourceWatcher to prevent potential data races during concurrent access. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Relates to #43084 (comment) --------- Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Description
Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.
Link to tracking issue
Fixes #42707
Testing
Tested locally and added new tests to cover the new behaviour.