-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ipc timeout caused due to incorrect task scheduling #709
Conversation
This is an alternative to #704. |
6db4c57
to
1d84dba
Compare
@@ -280,6 +280,8 @@ static inline void idc_do_cmd(void *data) | |||
|
|||
idc_cmd(&idc->received_msg); | |||
|
|||
schedule_task_complete(&idc->idc_task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to be changing scheduler task state outside of scheduler/task subsystems.
1d84dba
to
318b4ed
Compare
@lgirdwood @xiulipan I have updated the PR based on your suggestion. But this is still untested. I will request for validation tonight and update you when the stress test finishes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiulipan can you check this tomorrow if it fails the stress test. Thanks
There is a KW issue: Critical | SOF_FW/src/arch/xtensa/include/arch/task.h | Pointer may be dereferenced after it was positively checked for NULL |
schedule_task_running(task); | ||
run_task = 1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranj063 best set run_task = 0 here to fix KW issue;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood I moved the check for task->func before actually calling it. That should fix the kw issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually i take that back. I've set run_task to 0 instead. so we set the task state correctly too.
318b4ed
to
bdd1261
Compare
Because of a race condition between when the DONE interrupt is sent to the host and when the previous ipc task's state is updated to COMPLETED, new ipc's from the host could end up not being scheduled. This leads to ipc time outs on the host. In order to prevent this, this patch introduces a new task state called PENDING which is assigned to the task when it is picked as the next task to be run. The state is then updated to running when the task function is executed. This way when a ipc task comes in, a RUNNING task could get scheduled again and assigned the PENDING state to ensure that it doesnt get missed. Signed-off-by: Ranjani Sridharan <[email protected]>
bdd1261
to
f765638
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still seeing risk with this version, imagine this case:
- IPC1 arrived and be added a list, be run in _irq_task(), task->func() running and done bit is set to driver. now, task->state is running.
- IPC2 arrived before schedule_task_complete() is run, task->state is running, so task->state will be changed QUEUED and the new IPC2 will be added to list and be scheduled(sch->lock might be not held by anybody else at that moment), and then another SCHEDULE_IRQ will gonna be triggered.
- I believe this latter IRQ will be actually triggered before the former one finished, the issue is that if the former one run schedule_task_complete() now, it will change the task->state to be COMPLETED, and then the latter IPC won't be run! (as task->state != TASK_STATE_PENDING)
So, I think the patch might reduce possibility of the issue's happening, but not eliminate it thoroughly.
@keyonjie what you are describing is not possible. It is the same task. We are not really spawning off 2 different threads for the 2 ipc's. It is really only one task. |
I know we have only one task, and that is the reason why issue happens. |
@ranj063 @keyonjie @keqiaozhang @jocelyn-li @mengdonglin does this pass stress test ? |
@lgirdwood it passed on 3 boards at my end. @keqiaozhang is verifying it in PRC but there are no functional regressions with it. So I think it is good to merge. |
@lgirdwood I don't have the failed patterns to verify if this PR can fix it yet. @xiulipan can you help check? |
@ranj063 can you address my concern, why my described is not possible? |
@keyonjie maybe I am not able to imagine the scenario you explained. Lets talk Monday. According to me once the new ipc comes in, the former ipc is officially non-existent. So it is never going to come back to set the task state as COMPLETED. So that part is a bit confusing for me. |
@lgirdwood @ranj063 |
@ranj063 @keyonjie @lgirdwood Following is the workaround added by @tlauda f396500 #125 (discussion about the fix is here)
So if the task is going to accepted running task to be added. You need to remove the above code or add more check in the |
@lgirdwood @mmaka1 what do you think? Are you ok with those change? |
@ranj063
|
@@ -159,13 +160,21 @@ static void _irq_task(void *arg) | |||
task = container_of(clist, struct task, irq_list); | |||
list_item_del(clist); | |||
|
|||
if (task->func && task->state == TASK_STATE_PENDING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not check state here. In case previous task complete have changed the state to COMPLETED to make us drop ipc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, let's change this line to
if (task->func) {
for hot fix.
diff --git a/src/arch/xtensa/include/arch/task.h b/src/arch/xtensa/include/arch/task.h
index e981fdc8..e8eb3a2a 100644
--- a/src/arch/xtensa/include/arch/task.h
+++ b/src/arch/xtensa/include/arch/task.h
@@ -150,6 +150,8 @@ static void _irq_task(void *arg)
if (task->func && task->state == TASK_STATE_PENDING) {
schedule_task_running(task);
run_task = 1;
+ } else if (task->state == TASK_STATE_COMPLETED) {
+ trace_error(0, "PXL debug! Wrong state machine!")
} else {
run_task = 0;
} With debug code above, reproduced the potential risk more clear:
|
@xiulipan I get the problem now. I think what might fix the issue is if we check if task->state is RUNNING before setting it to COMPLETED in schedule_task_complete(). What do you think? |
@ranj063 |
|
scheduler: allow RUNNING tasks to be re-scheduled in the future.
Because of a race condition between when the DONE interrupt
is sent to the host and when the previous ipc task's
state is updated to COMPLETED, new ipc's from the host
could end up not being scheduled. This leads to ipc
time outs on the host.
In order to prevent this, this patch introduces a new task state
called PENDING which is assigned to the task when it is picked
as the next task to be run. The state is then updated to running
when the task function is executed. This way when a ipc task
comes in, a RUNNING task could get scheduled again and assigned
the PENDING state to ensure that it doesnt get missed.