Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional deadlock issue in tests that delete tasks #1027

Closed
jphickey opened this issue May 19, 2021 · 1 comment · Fixed by #1028 or #1050
Closed

Occasional deadlock issue in tests that delete tasks #1027

jphickey opened this issue May 19, 2021 · 1 comment · Fixed by #1028 or #1050
Assignees
Labels
Milestone

Comments

@jphickey
Copy link
Contributor

Describe the bug
When running the unit tests repeatedly, occasionally some tests are getting into a deadlock. These tests are ones that:

  1. use sub-tasks to do various work
  2. those sub-tasks use OS_printf()
  3. use an asynchronous OS_TaskDelete during their cleanup/teardown

In the event that the sub-task was in the midst of an OS_printf() call when OS_TaskDelete was invoked, the underlying BSP lock will not get released.

Observed in mutex-test, but others may have similar patterns.

To Reproduce
Run mutex-test repeatedly, may deadlock at some runs. (it is a race condition, not 100% reproducible)

Expected behavior
Should run consistently.

System observed on:
Ubuntu

Additional context
This really just a symptom of a generic/known issue with OS_TaskDelete, in that other resources held by that task are not necessarily tracked or freed, depending on what it was doing at the time it was deleted.

Linux/Pthreads does have a workaround but the issue is likely to exist on all OS's

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey added the bug label May 19, 2021
@jphickey jphickey self-assigned this May 19, 2021
@skliper skliper added this to the 6.0.0 milestone May 19, 2021
@jphickey
Copy link
Contributor Author

Did some more digging on this one, because I am using the async console/utility task enabled for OS_printf, this should protect against any issues with the OS_printf calls.

In this case the issue is a little deeper:

  1. When OS_TaskDelete is called by the main thread, the task record/ID is invalidated internally
  2. But on Pthreads/Linux the task (task 2 in this case) keeps running until it hits a cancellation point
  3. Next cancellation point will be during the clock_nanosleep() call (part of OS_TaskDelay at the top of the loop)
  4. So Task 2 will still attempt to give the mutex one more time (OS_MutSemGive).
  5. Inside OS_MutSemGive, this calls OS_TaskGetId as part of the debug check to confirm that this task owns the mutex
  6. But since the task is pending deletion, OS_TaskGetId actually returns OS_OBJECT_ID_UNDEFINED here:
    task_id = OS_OBJECT_ID_UNDEFINED;
  7. This in turn causes it to fail the test and generate a debug message here:
    OS_DEBUG("WARNING: Task %lu giving mutex %lu while owned by task %lu\n", OS_ObjectIdToInteger(self_task),
  8. The debug message is output synchronously, and will directly invoke the BSP to print the message on the console. This will lock the BSP, then call OS_BSP_ConsoleOutput_Impl, which in turn invokes write().
  9. But write() is also cancellation point - so this is where the cancellation will happen, and it will leave the BSP locked.

So in the end, this particular sequence of events is really a product of the deferred thread cancellation that exists on POSIX, and the fact that the write system call is a cancellation point, and that the implementation now calls this function with the mutex locked.

It also wouldn't happen if OS_CONFIG_DEBUG_PRINTF is disabled, and also related to the fact that the OS_TaskGetId doesn't work once deletion is pending. Basically if any one of these is changed, the issue doesn't happen.

jphickey added a commit to jphickey/osal that referenced this issue May 19, 2021
Resolves two related issues:
- OS_TaskGetId does not return a valid value for tasks where cancellation
  is pending, but they are still running.  This in turn is likely to trigger
  other (bogus) debug checks which invoke OS_DEBUG and in turn do console writes.
- The console write itself is a cancellation point, which is now done while
  holding a BSP mutex.  If canceled here, then the mutex is not released.

Solution is in two parts:
- OS_TaskGetId should return the task ID it knows about, regardless of whether
  the task is pending cancellation or not.
- Defer cancellation of the task while the BSP is locked, ensure it reaches the
  unlock, then restore the previous cancel state.
jphickey added a commit to jphickey/osal that referenced this issue May 19, 2021
Resolves two related issues:
- OS_TaskGetId does not return a valid value for tasks where cancellation
  is pending, but they are still running.  This in turn is likely to trigger
  other (bogus) debug checks which invoke OS_DEBUG and in turn do console writes.
- The console write itself is a cancellation point, which is now done while
  holding a BSP mutex.  If canceled here, then the mutex is not released.

Solution is in two parts:
- OS_TaskGetId should return the task ID it knows about, regardless of whether
  the task is pending cancellation or not.
- Defer cancellation of the task while the BSP is locked, ensure it reaches the
  unlock, then restore the previous cancel state.
astrogeco added a commit that referenced this issue May 20, 2021
Fix #1027, defer cancellation when BSP locked
pepepr08 pushed a commit to pepepr08/osal that referenced this issue Jun 9, 2021
Resolves two related issues:
- OS_TaskGetId does not return a valid value for tasks where cancellation
  is pending, but they are still running.  This in turn is likely to trigger
  other (bogus) debug checks which invoke OS_DEBUG and in turn do console writes.
- The console write itself is a cancellation point, which is now done while
  holding a BSP mutex.  If canceled here, then the mutex is not released.

Solution is in two parts:
- OS_TaskGetId should return the task ID it knows about, regardless of whether
  the task is pending cancellation or not.
- Defer cancellation of the task while the BSP is locked, ensure it reaches the
  unlock, then restore the previous cancel state.
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Remove reference to `CFE_MSG_Message_t` in doxygen comments
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
cFE Integration Candidate: 2020-11-24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants