Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deadlock when ending task scheduler on POSIX #1217

Closed
denravonska opened this issue Jan 3, 2025 · 21 comments
Closed

[BUG] Deadlock when ending task scheduler on POSIX #1217

denravonska opened this issue Jan 3, 2025 · 21 comments
Labels
bug Something isn't working

Comments

@denravonska
Copy link

denravonska commented Jan 3, 2025

Describe the bug
We have a unit test runner that spawns a FreeRTOS task that runs our test suite and then calls vTaskEndScheduler to allow the main function to exit. This works most of the time but we noticed that there's an occasional deadlock.

Target

  • Development board: Host
  • Instruction Set Architecture: x64
  • IDE and version: Visual Studio Code 1.96.2
  • Toolchain and version: gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
  • FreeRTOS commit: e55bde2

Host

  • Host OS: Ubuntu
  • Version: 24.04

To Reproduce
Example code:

void Task(void *)
{
    vTaskEndScheduler();
    vTaskDelete(nullptr);
}

int main(int argc, char ** argv)
{
    xTaskCreate(Task, "MainTask", 8192, nullptr, 6, nullptr);
    vTaskStartScheduler();

    printf("Done\n");
    return 0;
}

Running this in a loop helps triggering the issue. For me it triggers faster if I switch to another terminal:

while /bin/true; do ./test ; done

Looking at the threads we can see that the main task is stuck trying to take a mutex:

(gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x78b84772ae40 (LWP 4087203) "Scheduler"       0x000078b846045fb8 in __GI___sigtimedwait (set=set@entry=0x78b84400ae60, info=info@entry=0x7ffd41908d10, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:31
  2    Thread 0x78b83f8006c0 (LWP 4087207) "Scheduler timer" 0x000078b8460ecadf in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x78b83f7ffb40, rem=rem@entry=0x0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  3    Thread 0x78b842e006c0 (LWP 4087204) "MainTask"        futex_wait (private=0, expected=2, futex_word=0x5080000000a0) at ../sysdeps/nptl/futex-internal.h:146
  
(gdb) thread 3
[Switching to thread 3 (Thread 0x78b842e006c0 (LWP 4087204))]
#0  futex_wait (private=0, expected=2, futex_word=0x5080000000a0) at ../sysdeps/nptl/futex-internal.h:146
warning: 146	../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x5080000000a0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x5080000000a0, private=0) at ./nptl/lowlevellock.c:49
#2  0x000078b8460a00f1 in lll_mutex_lock_optimized (mutex=0x5080000000a0) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x5080000000a0) at ./nptl/pthread_mutex_lock.c:93
#4  0x000062f66ed88eb7 in event_signal (ev=0x5080000000a0) at ../third-party/freertos/repo/portable/ThirdParty/GCC/Posix/utils/wait_for_event.c:104
#5  0x000062f66ed886cf in vPortCancelThread (pxTaskToDelete=<optimized out>) at ../third-party/freertos/repo/portable/ThirdParty/GCC/Posix/port.c:445
#6  0x000062f66ed774a0 in prvDeleteTCB (pxTCB=pxTCB@entry=0x62f66efdf720 <xIdleTaskTCB.3>) at ../third-party/freertos/repo/tasks.c:6445
#7  0x000062f66ed78726 in vTaskDelete (xTaskToDelete=<optimized out>) at ../third-party/freertos/repo/tasks.c:2316
#8  0x000062f66ed79fea in vTaskEndScheduler () at ../third-party/freertos/repo/tasks.c:3797
#9  0x000062f66eceeb26 in Task () at ../test/src/main.cpp:12
#10 0x000062f66ed881b0 in prvWaitForStart (pvParams=pvParams@entry=0x62f66eff0928 <ucHeap+65512>) at ../third-party/freertos/repo/portable/ThirdParty/GCC/Posix/port.c:465
#11 0x000078b84705ea42 in asan_thread_start (arg=0x78b846ef9000) at ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
#12 0x000078b84609ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#13 0x000078b846129c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

What's interesting is, if I interpret this correctly, that the mutex owner no longer exists:

(gdb) print ev.mutex
$3 = {__data = {__lock = 2, __count = 0, __owner = 4087205, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000\245]>\000\001", '\000' <repeats 26 times>, __align = 2}
@denravonska denravonska added the bug Something isn't working label Jan 3, 2025
@rawalexe
Copy link
Member

rawalexe commented Jan 7, 2025

Hello @denravonska,
Thank you for your report, I'll forward this to the team and have a look.

@rawalexe
Copy link
Member

rawalexe commented Jan 7, 2025

Screenshot 2025-01-07 at 3 20 21 PM How long are you waiting? I modified the code a bit and tried and cannot reproduce.

@denravonska
Copy link
Author

Tried it with this script:

#!/bin/bash

counter=0

while true; do
   echo -n "$counter "
   ./test.out
   let counter++
done

Hung on run:

  • 9
  • 5481
  • 390
  • 1264
  • 735

@denravonska
Copy link
Author

Adding the config we're using if it helps.
FreeRTOSConfig.h.txt

@rawalexe
Copy link
Member

rawalexe commented Jan 10, 2025

I tried to build with the config and got a build error:

./FreeRTOS.h:2674:10: error: #error If configGENERATE_RUN_TIME_STATS is defined then portCONFIGURE_TIMER_FOR_RUN_TIME_STATS must also be defined. portCONFIGURE_TIMER_FOR_RUN_TIME_STATS should call a port layer function to setup a peripheral timer/counter that can then be used as the run time counter time base.
 2674 |         #error If configGENERATE_RUN_TIME_STATS is defined then portCONFIGURE_TIMER_FOR_RUN_TIME_STATS must also be defined.  portCONFIGURE_TIMER_FOR_RUN_TIME_STATS should call a port layer function to setup a peripheral timer/counter that can then be used as the run time counter time base.
./FreeRTOS.h:2679:14: error: #error If configGENERATE_RUN_TIME_STATS is defined then either portGET_RUN_TIME_COUNTER_VALUE or portALT_GET_RUN_TIME_COUNTER_VALUE must also be defined. See the examples provided and the FreeRTOS web site for more information.
 2679 |             #error If configGENERATE_RUN_TIME_STATS is defined then either portGET_RUN_TIME_COUNTER_VALUE or portALT_GET_RUN_TIME_COUNTER_VALUE must also be defined.  See the examples provided and the FreeRTOS web site for more information.

I can define the value but wanted to know what do you have for it?

@rawalexe
Copy link
Member

Can you actually provide the whole application and email me at [email protected]

@denravonska
Copy link
Author

I tried to build with the config and got a build error:

./FreeRTOS.h:2674:10: error: #error If configGENERATE_RUN_TIME_STATS is defined then portCONFIGURE_TIMER_FOR_RUN_TIME_STATS must also be defined. portCONFIGURE_TIMER_FOR_RUN_TIME_STATS should call a port layer function to setup a peripheral timer/counter that can then be used as the run time counter time base.
 2674 |         #error If configGENERATE_RUN_TIME_STATS is defined then portCONFIGURE_TIMER_FOR_RUN_TIME_STATS must also be defined.  portCONFIGURE_TIMER_FOR_RUN_TIME_STATS should call a port layer function to setup a peripheral timer/counter that can then be used as the run time counter time base.
./FreeRTOS.h:2679:14: error: #error If configGENERATE_RUN_TIME_STATS is defined then either portGET_RUN_TIME_COUNTER_VALUE or portALT_GET_RUN_TIME_COUNTER_VALUE must also be defined. See the examples provided and the FreeRTOS web site for more information.
 2679 |             #error If configGENERATE_RUN_TIME_STATS is defined then either portGET_RUN_TIME_COUNTER_VALUE or portALT_GET_RUN_TIME_COUNTER_VALUE must also be defined.  See the examples provided and the FreeRTOS web site for more information.

I can define the value but wanted to know what do you have for it?

That's really weird. We don't define portCONFIGURE_TIMER_FOR_RUN_TIME_STATS at all, and I've verified that the FreeRTOSConfig.h gets included.

I have sent you a binary built with the following:

gcc -static -o freeze -O2 -g -ggdb \
    -I ../third-party/freertos/config  \
    -I $FREERTOS_ROOT/include \
    -I $FREERTOS_ROOT/portable/ThirdParty/GCC/Posix \
    src/main.cpp \
    $FREERTOS_ROOT/*.c \
    $FREERTOS_ROOT/portable/ThirdParty/GCC/Posix/port.c \
    $FREERTOS_ROOT/portable/ThirdParty/GCC/Posix/utils/wait_for_event.c \
    $FREERTOS_ROOT/portable/MemMang/heap_4.c

where ../third-party/freertos/config is the location of the above config and src/main.cpp is the above example. After sending I noticed that it also freezes with -O0 so I can provide you with a binary of that as well if it helps debugging.

@denravonska
Copy link
Author

denravonska commented Jan 10, 2025

I did some more testing with the config from examples/template_configuration and I am getting the freeze there as well. I had to modify my example by reducing the stack size and priority.

#include <FreeRTOS.h>
#include <task.h>
#include <stdio.h>

void vApplicationStackOverflowHook( TaskHandle_t xTask, char *pcTaskName)
{
    printf("OVERFLOW!\n");
}

void Task(void *)
{
    vTaskEndScheduler();
    vTaskDelete(nullptr);
}

int main(int argc, char ** argv)
{
    xTaskCreate(Task, "MainTask", 256, nullptr, 4, nullptr);
    vTaskStartScheduler();

    printf("Done\n");
    return 0;
}

Edit: I've also switched laptops and I get it on Arch in addition to Ubuntu.

@rawalexe
Copy link
Member

The document that you emailed me isn't the correct one, can you send me a valid zip or tar file

@denravonska
Copy link
Author

denravonska commented Jan 15, 2025

I'm not sure what you mean. It's a minimal binary that has the deadlock issue. If you need the source code rather than the binary it's in the post above.

@tymmej
Copy link

tymmej commented Jan 22, 2025

I've encountered same issue with slightly modifying example app: https://forums.freertos.org/t/freertos-hangs-after-vtaskendscheduler-in-posix-gcc-port/22287

@denravonska
Copy link
Author

denravonska commented Jan 24, 2025

Now we are seeing it when our test runner calls exit(). That will in turn destroy our Thread objects (FreeRTOS task wrappers) which will kill their wrapped task. Sometimes it gets stuck on the same mutex which is held by a destroyed task, presumably the timer task again.

(gdb) info threads
  Id   Target Id                                         Frame 
* 1    Thread 0x7fce3177be40 (LWP 662) "Scheduler"       0x00007fce317c2fb8 in __GI___sigtimedwait (set=set@entry=0x7fce2f609e40, info=info@entry=0x7ffd2d7bbf00, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:31
  2    Thread 0x7fce2bc006c0 (LWP 667) "Scheduler timer" 0x00007fce31869adf in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fce2bbffb40, rem=rem@entry=0x0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  3    Thread 0x7fce2c6006c0 (LWP 666) "Tmr Svc"         0x00007fce31815d61 in __futex_abstimed_wait_common64 (private=32718, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5080000001f0) at ./nptl/futex-internal.c:57
  4    Thread 0x7fce2d0006c0 (LWP 665) "IDLE"            0x00007fce31815d61 in __futex_abstimed_wait_common64 (private=32718, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x508000000170) at ./nptl/futex-internal.c:57
  5    Thread 0x7fce2da006c0 (LWP 664) "MainTask"        futex_wait (private=0, expected=2, futex_word=0x508000000020) at ../sysdeps/nptl/futex-internal.h:146
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fce2da006c0 (LWP 664))]
#0  futex_wait (private=0, expected=2, futex_word=0x508000000020) at ../sysdeps/nptl/futex-internal.h:146
warning: 146	../sysdeps/nptl/futex-internal.h: No such file or directory

So it might not just be limited to vTaskEndScheduler as we don't call that anymore.

@aggarg
Copy link
Member

aggarg commented Jan 24, 2025

Would you please try the following patch - posix_port.patch?

@denravonska
Copy link
Author

denravonska commented Jan 24, 2025

Would you please try the following patch - posix_port.patch?

Tested with this example and I can't reproduce it with the patch applied. Without it it still hangs after around 1k iterations, with it it's still running after 25k 247k iterations. I'll leave it running overnight but it looks promising.

Edit: Turned it off after 4 million iterations.

@aggarg
Copy link
Member

aggarg commented Jan 25, 2025

I just want to confirm that we can reasonably conclude that the patch addresses the problem?

@denravonska
Copy link
Author

I just want to confirm that we can reasonably conclude that the patch addresses the problem?

I can confirm that this fixes both my problems:

  • Calling vTaskEndScheduler
  • Killing tasks via C++ destructors that run during exit

@aggarg
Copy link
Member

aggarg commented Jan 25, 2025

Thank you for confirming!

@denravonska
Copy link
Author

denravonska commented Jan 25, 2025

Looking at the documentation it might be a good idea to incorporate this:

If a mutex is initialized with the PTHREAD_MUTEX_ROBUST
attribute and its owner dies without unlocking it, any
future attempts to call pthread_mutex_lock(3) on this
mutex will succeed and return EOWNERDEAD to indicate that
the original owner no longer exists and the mutex is in an
inconsistent state. Usually after EOWNERDEAD is returned,
the next owner should call pthread_mutex_consistent(3) on
the acquired mutex to make it consistent again before
using it any further.
If the next owner unlocks the mutex using
pthread_mutex_unlock(3) before making it consistent, the
mutex will be permanently unusable and any subsequent
attempts to lock it using pthread_mutex_lock(3) will fail
with the error ENOTRECOVERABLE. The only permitted
operation on such a mutex is pthread_mutex_destroy(3).

aggarg added a commit to aggarg/FreeRTOS-Kernel that referenced this issue Jan 25, 2025
Prevent application hangs that occur when a thread dies while holding a
mutex, particularly during vTaskEndScheduler or exit calls. This is
achieved by setting the PTHREAD_MUTEX_ROBUST attribute on the mutex.

Fixes:
- GitHub issue: FreeRTOS#1217
- Forum thread: freertos.org/t/22287

Signed-off-by: Gaurav Aggarwal <[email protected]>
@aggarg
Copy link
Member

aggarg commented Jan 25, 2025

Sure, thank you for the suggestion! I have raised the following PR -#1233.

Would be grateful if you can give it a try as well.

@denravonska
Copy link
Author

Sure, thank you for the suggestion! I have raised the following PR -#1233.

Would be grateful if you can give it a try as well.

Tested and it seems to be working fine. Thanks!

aggarg added a commit that referenced this issue Jan 29, 2025
Mark mutex as robust to prevent deadlocks

Prevent application hangs that occur when a thread dies while holding a
mutex, particularly during vTaskEndScheduler or exit calls. This is
achieved by setting the PTHREAD_MUTEX_ROBUST attribute on the mutex.

Fixes:
- GitHub issue: #1217
- Forum thread: freertos.org/t/22287

Signed-off-by: Gaurav Aggarwal <[email protected]>
@aggarg
Copy link
Member

aggarg commented Jan 29, 2025

Thank you @denravonska for reporting the issue and testing the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants