Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pimd: Prevent t_join_timer thread from being canceled multiple times #17812

Closed
wants to merge 1 commit into from

Conversation

routingrocks
Copy link
Contributor

Issue:
We stop all PIM timers during instance shutdown. Simultaneously, if there are any changes to the next hop (ZEBRA_NEXTHOP_UPDATE), it triggers an RPF update to the upstream next hop based on the new update received from Zebra.
This leads to stop an already stopped timer.

Fix:
Ensure that join_timer_stop is not called on an already canceled thread.

signed-off-by: Rajesh Varatharaj[email protected]

Issue:
We stop all PIM timers during instance shutdown. Simultaneously,
if there are any changes to the next hop (ZEBRA_NEXTHOP_UPDATE),
it triggers an RPF update to the upstream next hop based on the
new update received from Zebra.
This leads to stop an already stopped timer.

Fix:
Ensure that join_timer_stop is not called on an already canceled thread.

signed-off-by: Rajesh Varatharaj<[email protected]>
@@ -332,7 +332,14 @@ static void join_timer_stop(struct pim_upstream *up)
{
struct pim_neighbor *nbr = NULL;

EVENT_OFF(up->t_join_timer);
if (up->t_join_timer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not usually a fan of testing these pointers in application code, since there's no locking done. the actual event lib call does use a lock - why is this code a problem? it's not unusual to see code that uses the "OFF" or "cancel" apis like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the event library cancel logic includes locking, and it is common to see APIs like "OFF" or "cancel" used directly.

The issue here is a race condition during PIM instance shutdown combined with RPF updates. Without this check, we can see crashes when attempting to cancel a previously canceled thread.

Another approach I am considering is to call thread_cancel only when it hasnt already cancelled,

#define THREAD_OFF(thread)                                             \
    do {                                                               \
        if ((thread) && (thread)->master != NULL) {                    \
            thread_cancel(&(thread));                                  \
        }                                                              \
    } while (0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a universal solution,? I need your opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ... don't think I understand the race condition - can you say what it is that goes wrong? I mean, if the "up" object is still valid, then cancelling a task should be safe. if the object isn't valid, all bets are off, and a minor change in this path isn't going to be a real fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjstapp, let me go back, and dig some more info. Meantime closing this

@routingrocks routingrocks marked this pull request as draft January 10, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants