pimd: Prevent t_join_timer thread from being canceled multiple times #17812

routingrocks · 2025-01-09T20:14:44Z

Issue:
We stop all PIM timers during instance shutdown. Simultaneously, if there are any changes to the next hop (ZEBRA_NEXTHOP_UPDATE), it triggers an RPF update to the upstream next hop based on the new update received from Zebra.
This leads to stop an already stopped timer.

Fix:
Ensure that join_timer_stop is not called on an already canceled thread.

signed-off-by: Rajesh Varatharaj[email protected]

Issue: We stop all PIM timers during instance shutdown. Simultaneously, if there are any changes to the next hop (ZEBRA_NEXTHOP_UPDATE), it triggers an RPF update to the upstream next hop based on the new update received from Zebra. This leads to stop an already stopped timer. Fix: Ensure that join_timer_stop is not called on an already canceled thread. signed-off-by: Rajesh Varatharaj<[email protected]>

mjstapp · 2025-01-09T20:38:36Z

pimd/pim_upstream.c

@@ -332,7 +332,14 @@ static void join_timer_stop(struct pim_upstream *up)
 {
 	struct pim_neighbor *nbr = NULL;

-	EVENT_OFF(up->t_join_timer);
+    if (up->t_join_timer) {


I'm not usually a fan of testing these pointers in application code, since there's no locking done. the actual event lib call does use a lock - why is this code a problem? it's not unusual to see code that uses the "OFF" or "cancel" apis like this.

I understand that the event library cancel logic includes locking, and it is common to see APIs like "OFF" or "cancel" used directly.

The issue here is a race condition during PIM instance shutdown combined with RPF updates. Without this check, we can see crashes when attempting to cancel a previously canceled thread.

Another approach I am considering is to call thread_cancel only when it hasnt already cancelled,

#define THREAD_OFF(thread) \ do { \ if ((thread) && (thread)->master != NULL) { \ thread_cancel(&(thread)); \ } \ } while (0)

Isn't that a universal solution,? I need your opinion

I ... don't think I understand the race condition - can you say what it is that goes wrong? I mean, if the "up" object is still valid, then cancelling a task should be safe. if the object isn't valid, all bets are off, and a minor change in this path isn't going to be a real fix?

@mjstapp, let me go back, and dig some more info. Meantime closing this

frrbot bot added the pim label Jan 9, 2025

github-actions bot added master size/XS labels Jan 9, 2025

mjstapp reviewed Jan 9, 2025

View reviewed changes

routingrocks marked this pull request as draft January 10, 2025 19:24

routingrocks closed this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pimd: Prevent t_join_timer thread from being canceled multiple times #17812

pimd: Prevent t_join_timer thread from being canceled multiple times #17812

routingrocks commented Jan 9, 2025

mjstapp Jan 9, 2025

routingrocks Jan 9, 2025

routingrocks Jan 9, 2025

mjstapp Jan 10, 2025

routingrocks Jan 10, 2025

pimd: Prevent t_join_timer thread from being canceled multiple times #17812

pimd: Prevent t_join_timer thread from being canceled multiple times #17812

Conversation

routingrocks commented Jan 9, 2025

mjstapp Jan 9, 2025

Choose a reason for hiding this comment

routingrocks Jan 9, 2025

Choose a reason for hiding this comment

routingrocks Jan 9, 2025

Choose a reason for hiding this comment

mjstapp Jan 10, 2025

Choose a reason for hiding this comment

routingrocks Jan 10, 2025

Choose a reason for hiding this comment