Race condition in control requests #950

jphickey · 2020-10-15T16:08:08Z

Describe the bug
Due to the order of operations in clean up, the ES global lock is given up and then re-acquired:

Lines 859 to 861 in dc3d62b

    
           CFE_ES_UnlockSharedData(__func__,__LINE__); 
        
           CFE_ES_ProcessControlRequest(AppPtr); 
        
           CFE_ES_LockSharedData(__func__,__LINE__);

The problem is that this provides a window of opportunity for the underlying state to change externally while the global data is unlocked.

To Reproduce
This can happen, for instance, if the task that is being cleaned up calls CFE_ES_ExitApp() while this state machine is also cleaning up the app.
This actually does happen because CFE_ES_RunLoop() will return false if there is an exit request pending. It is just masked by the fact that most apps are pending in a message receive queue, so they don't self exit - they are deleted by ES instead.

I was able to get CFE to segfault/crash by allowing SAMPLE_APP to exit itself at the very same time that this state machine was also cleaning it up.

Expected behavior
No crashes, proper clean up.

System observed on:
Ubuntu 20.04

Additional context
Due to the ~5 second exit/cleanup delay it is unlikely to occur "in the wild" but it can easily be forced to happen. In my test I just used a slightly modified sample_app that doesn't pend forever on CFE_SB_RcvMsg, and also delays itself such that it self-exits at the exact same time that the ES background job is running, which reliably segfaults every time.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

The text was updated successfully, but these errors were encountered:

Because the process of handling a control request involves calling other subsystems, the ES lock needs to be released. However, this also means that the app record can change state for other reasons, such as the app self-exiting at the same time. To avoid this possibility, process in two phases: First assemble a list of tasks that have timed out and need to be cleaned up, while ES is locked. Next actually perform the cleanup, while ES is unlocked. In areas during cleanup that need to update the ES global, the lock is locally re-acquired and released.

jphickey self-assigned this Oct 15, 2020

jphickey added the bug label Oct 15, 2020

jphickey mentioned this issue Oct 16, 2020

remove conditional TBL compilation logic #680

Closed

jphickey mentioned this issue Oct 21, 2020

Fix #28, provide library API #960

Merged

jphickey linked a pull request Oct 21, 2020 that will close this issue

Fix #28, provide library API #960

Merged

astrogeco added this to the 7.0.0 milestone Oct 21, 2020

astrogeco mentioned this issue Oct 28, 2020

Integration Candidate 2020-10-27 #975

Merged

astrogeco closed this as completed in #975 Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in control requests #950

Race condition in control requests #950

jphickey commented Oct 15, 2020

Race condition in control requests #950

Race condition in control requests #950

Comments

jphickey commented Oct 15, 2020