-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in control requests #950
Comments
jphickey
added a commit
to jphickey/cFE
that referenced
this issue
Oct 19, 2020
Because the process of handling a control request involves calling other subsystems, the ES lock needs to be released. However, this also means that the app record can change state for other reasons, such as the app self-exiting at the same time. To avoid this possibility, process in two phases: First assemble a list of tasks that have timed out and need to be cleaned up, while ES is locked. Next actually perform the cleanup, while ES is unlocked. In areas during cleanup that need to update the ES global, the lock is locally re-acquired and released.
jphickey
added a commit
to jphickey/cFE
that referenced
this issue
Oct 21, 2020
Because the process of handling a control request involves calling other subsystems, the ES lock needs to be released. However, this also means that the app record can change state for other reasons, such as the app self-exiting at the same time. To avoid this possibility, process in two phases: First assemble a list of tasks that have timed out and need to be cleaned up, while ES is locked. Next actually perform the cleanup, while ES is unlocked. In areas during cleanup that need to update the ES global, the lock is locally re-acquired and released.
jphickey
added a commit
to jphickey/cFE
that referenced
this issue
Oct 26, 2020
Because the process of handling a control request involves calling other subsystems, the ES lock needs to be released. However, this also means that the app record can change state for other reasons, such as the app self-exiting at the same time. To avoid this possibility, process in two phases: First assemble a list of tasks that have timed out and need to be cleaned up, while ES is locked. Next actually perform the cleanup, while ES is unlocked. In areas during cleanup that need to update the ES global, the lock is locally re-acquired and released.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Due to the order of operations in clean up, the ES global lock is given up and then re-acquired:
cFE/fsw/cfe-core/src/es/cfe_es_apps.c
Lines 859 to 861 in dc3d62b
The problem is that this provides a window of opportunity for the underlying state to change externally while the global data is unlocked.
To Reproduce
This can happen, for instance, if the task that is being cleaned up calls
CFE_ES_ExitApp()
while this state machine is also cleaning up the app.This actually does happen because
CFE_ES_RunLoop()
will return false if there is an exit request pending. It is just masked by the fact that most apps are pending in a message receive queue, so they don't self exit - they are deleted by ES instead.I was able to get CFE to segfault/crash by allowing SAMPLE_APP to exit itself at the very same time that this state machine was also cleaning it up.
Expected behavior
No crashes, proper clean up.
System observed on:
Ubuntu 20.04
Additional context
Due to the ~5 second exit/cleanup delay it is unlikely to occur "in the wild" but it can easily be forced to happen. In my test I just used a slightly modified
sample_app
that doesn't pend forever onCFE_SB_RcvMsg
, and also delays itself such that it self-exits at the exact same time that the ES background job is running, which reliably segfaults every time.Reporter Info
Joseph Hickey, Vantage Systems, Inc.
The text was updated successfully, but these errors were encountered: