Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the signal handling to avoid crashes during stop #26679

Open
wants to merge 2 commits into
base: 4.10.2
Choose a base branch
from

Conversation

pereyra-m
Copy link
Member

@pereyra-m pereyra-m commented Nov 1, 2024

Related issue
Closes #26536

Description

To avoid the crash of the wazuh-modulesd during the shutdown, some improvements were added:

  • The signals are ignored at startup and until the modules have started. This avoids the termination of the module during the sleep() calls.
  • There is a wait time of 1 second after the modules have started to allow them settle.
  • The signal handling setup was moved after the added wait time to avoid calling a stop before the start has finished.
  • The signal handling now has a mutex, it will only process one signal and will ignore the rest

Update

The original proposal described above might have many changes. A simplified one is under analysis, where the sleep occurs at the stop method and no sig-ignore is configured at startup.

Tests

The reproduction steps now don't cause crashes.

2024-11-01_11-37

  • Compilation without warnings in every supported platform
    • Linux

@pereyra-m pereyra-m self-assigned this Nov 1, 2024
@pereyra-m pereyra-m linked an issue Nov 1, 2024 that may be closed by this pull request
Copy link
Member

@sebasfalcone sebasfalcone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @pereyra-m


static int flag_foreground = 0; // Running in foreground.

static bool wm_sig_received = false; // Flag to indicate if a signal is being handled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be defined inside the void wm_handler(int signum) function to reduce the scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.

// Main function

int main(int argc, char **argv)
{
// Ignore signals until the modules have started
wm_signals_ignore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Performed some testing over this

If the module initialization takes too long or it hungs we are still able to kill the process, I don't know if this desirable or not but its working as intended

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to comments here. Ignoring the signals also avoids the Termination of the process when it receives a signal during the starting phase, but it might change the current behavior too much.

The real fix is in the mutex, I can revert this.

Copy link
Member

@vikman90 vikman90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution is on the right track! However, after reviewing the trace, I would suggest considering some alternative approaches.

One concern from the trace is that the wm_handler exit routine is executed by the Vuln Detector thread, while ideally, it should be handled by the main thread.

Additionally, the use of sleep(1) raises some concerns. This approach could lead to inconsistent behavior across different systems—on some, it might be too long, while on others, it could be insufficient. It appears that a sleep function is being used to mitigate a race condition, which is not an ideal solution.

Given that the modules seem to have a flag for handling graceful shutdowns, I would recommend exploring the replacement of wm_signals_ignore with a per-thread signal mask, ensuring proper signal handling across different threads.

Cheers!

@pereyra-m
Copy link
Member Author

Hi @vikman90

Thank you for the review.

One concern from the trace is that the wm_handler exit routine is executed by the Vuln Detector thread, while ideally, it should be handled by the main thread.

I don't understand why this happens. Can you share the backtrace? I think that setting the signal handling in the main thread shouldn't have this behavior.

Additionally, the use of sleep(1) raises some concerns. This approach could lead to inconsistent behavior across different systems—on some, it might be too long, while on others, it could be insufficient. It appears that a sleep function is being used to mitigate a race condition, which is not an ideal solution.

I don't like it either, but I thought that if we had some sleeps at the start of other modules, it wouldn't be that bad. I agree that a real coordination of all the modules is much better, because we could know the exact moment when all of them are ready. But this issue isn't reproduced naturally in the product, so I preferred to make fewer changes and not to have a major rework.
I proposed it here to allow all the modules to settle, but it isn't the most important part of the fix and I can remove it.

I would recommend exploring the replacement of wm_signals_ignore with a per-thread signal mask, ensuring proper signal handling across different threads.

I proposed the wm_signals_ignore() method to avoid a premature shutdown of the module. I meant to ignore all signals until the daemon is ready to process them. I can remove it also, because in my tests, the most important part of the fix is the mutex at wm_handler()

@pereyra-m
Copy link
Member Author

@sebasfalcone @vikman90

I've uploaded another proposal for your consideration.

Given that making each module start routine notify the main thread that is ready is a big change, I moved the sleep to the stop section. I try to cause the less impact possible here, but without this, the SIGTERM signal happens too fast and the module crashes. Now I don't affect the time it takes the module to start.

Also, I've removed the signal ignore method at startup to have fewer changes in the PR. Consider that sometimes, the process is terminated because the signal is received during a pre-existent sleep and it isn't ignored anymore; but this isn't a crash.

Copy link
Member

@sebasfalcone sebasfalcone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The segfault prevented with this approach, in a normal situation it shouldn't happen anymore

Comment on lines +217 to +218
// We wait in case the modules haven't completed the initialization
sleep(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know this is not the best approach, but is the best we can do without inquiring about a big refactor regarding the initialization and stopping process of the modules

The stop-and-start interfaces we currently have are not prepared for this kind of behaviour

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wazuh-modulesd crashes at Vulnerability Scanner module
3 participants