Removed the 'configASSERT( xInheritanceOccurred == pdFALSE )' assertion from xQueueSemaphoreTake (IDFGH-8766)#10199
Conversation
Fixes rare deadlock on heavy loaded multicore-systems.
|
Thanks for your contribution. |
There was a problem hiding this comment.
@timoxd7 Changes LGTM.
I think the issue described in the post is a manifestation of more general "spinlock acquire race condition" that can occur on SMP systems under the current FreeRTOS queue/semaphore/mutex implementation.
Basically, because there is a gap in the critical section between each loop iteration (see below), this can lead to some race conditions when there are multiple cores running concurrently.
BaseType_t xQueueSend/Receive/SemaphoreTake(...)
{
// Main fucntion loop
for(;;) {
taskENTER_CRITICAL();
// Check for empty/full queue/semaphore/mutex
taskEXIT_CRITICAL();
// Handle timeout stuff
taskENTER_CRITICAL();
if (timed_out) {
portYIELD_WITHIN_API();
}
taskEXIT_CRITICAL();
/*
- If yielding, it occurs after we exit the critical section.
- Once we schedule back here, we need to enter the critical section again
which can lead to race conditions if the other CPU gets enters it
faster than we do.
*/
}
}
We've run into a similar issue with xQueueReceive() before where a lower priority task can "steal" a message intended for a higher priority task that was blocked on the queue. The flow would look like the following:
- Task A (CPU 0, Priority 5) calls
xQueueReceive()on a empty queue, thus it blocks - Task B (CPU 0, Priority 4) calls
xQueueSend()to send a message to the queue, thus unblocking Task A. - Task A (on CPU 0) has how scheduled back, and needs to enter the critical section again to read the message.
- At the same time a Task C (CPU 1, Priority 3) also calls
xQueueReceive()at the same time, and is quicker than Task A to enter the critical section. - Task C ends up reading the message while Task A spins waiting to enter the critical section. Thus Task C as "stolen" Task A's message, even though Task C had a lower priority.
I'm not sure if there's clean way we can work around this SMP race condition, unless we add some information in the queue/mutexes regarding the last task that was unblocked on the queue.
Regardless, the PR changes LGTM. Thanks for the contribution.
|
sha=355abfdff6f1039f3b98b598c484f6103ade9749 |
|
@Dazza0 Stable branches also need this fix? |
|
Thanks for merging this into IDF! I would greatly appreciate it if it would be merged into 4.4.3 . I have manually applied this change to 4.4.2 and it works fine for me. |
|
@AxelLin @Erlkoenig90 Backports have already been internally raised and pending merge. |
|
@Dazza0 The fix is in v4.4+, but v4.3 also needs fix. |
FreeRTOS References:
Commit: FreeRTOS/FreeRTOS-Kernel@4e2bf2c
Pull-Request: FreeRTOS/FreeRTOS-Kernel#576
Problem Description: https://forums.freertos.org/t/15967
This pull-request is purposed to copy the bugfix from the FreeRTOS Repo into esp-idf.
All credit goes to @Erlkoenig90.
Original FreeRTOS Pull Request Description
Removed the 'configASSERT( xInheritanceOccurred == pdFALSE )' assertion from xQueueSemaphoreTake as the reasoning behind it is wrong; it can trigger on wrongly on highly-contested semaphores on multicore systems. See the forum for details.