Fix ipc timeout caused due to incorrect task scheduling #709

ranj063 · 2018-12-13T06:59:12Z

scheduler: allow RUNNING tasks to be re-scheduled in the future.

Because of a race condition between when the DONE interrupt
is sent to the host and when the previous ipc task's
state is updated to COMPLETED, new ipc's from the host
could end up not being scheduled. This leads to ipc
time outs on the host.

In order to prevent this, this patch introduces a new task state
called PENDING which is assigned to the task when it is picked
as the next task to be run. The state is then updated to running
when the task function is executed. This way when a ipc task
comes in, a RUNNING task could get scheduled again and assigned
the PENDING state to ensure that it doesnt get missed.

ranj063 · 2018-12-13T06:59:47Z

This is an alternative to #704.

lgirdwood · 2018-12-13T15:32:17Z

src/arch/xtensa/smp/include/arch/idc.h

@@ -280,6 +280,8 @@ static inline void idc_do_cmd(void *data)

 	idc_cmd(&idc->received_msg);

+	schedule_task_complete(&idc->idc_task);


I don't think we want to be changing scheduler task state outside of scheduler/task subsystems.

lgirdwood · 2018-12-13T15:33:46Z

@ranj063 @xiulipan please see some example code for fixing this I've appended to #704. We basically add a new scheduler state and allow tasks that are running to be scheduled again in the future

ranj063 · 2018-12-13T19:16:01Z

@lgirdwood @xiulipan I have updated the PR based on your suggestion.

But this is still untested. I will request for validation tonight and update you when the stress test finishes.

lgirdwood

@xiulipan can you check this tomorrow if it fails the stress test. Thanks

michalgrodzicki · 2018-12-13T21:12:33Z

There is a KW issue:

Critical | SOF_FW/src/arch/xtensa/include/arch/task.h | Pointer may be dereferenced after it was positively checked for NULL

lgirdwood · 2018-12-13T21:27:58Z

src/arch/xtensa/include/arch/task.h

+			schedule_task_running(task);
+			run_task = 1;
+		}
+


@ranj063 best set run_task = 0 here to fix KW issue;

@lgirdwood I moved the check for task->func before actually calling it. That should fix the kw issue

actually i take that back. I've set run_task to 0 instead. so we set the task state correctly too.

Because of a race condition between when the DONE interrupt is sent to the host and when the previous ipc task's state is updated to COMPLETED, new ipc's from the host could end up not being scheduled. This leads to ipc time outs on the host. In order to prevent this, this patch introduces a new task state called PENDING which is assigned to the task when it is picked as the next task to be run. The state is then updated to running when the task function is executed. This way when a ipc task comes in, a RUNNING task could get scheduled again and assigned the PENDING state to ensure that it doesnt get missed. Signed-off-by: Ranjani Sridharan <[email protected]>

keyonjie

I still seeing risk with this version, imagine this case:

IPC1 arrived and be added a list, be run in _irq_task(), task->func() running and done bit is set to driver. now, task->state is running.
IPC2 arrived before schedule_task_complete() is run, task->state is running, so task->state will be changed QUEUED and the new IPC2 will be added to list and be scheduled(sch->lock might be not held by anybody else at that moment), and then another SCHEDULE_IRQ will gonna be triggered.
I believe this latter IRQ will be actually triggered before the former one finished, the issue is that if the former one run schedule_task_complete() now, it will change the task->state to be COMPLETED, and then the latter IPC won't be run! (as task->state != TASK_STATE_PENDING)

So, I think the patch might reduce possibility of the issue's happening, but not eliminate it thoroughly.

ranj063 · 2018-12-14T03:38:14Z

@keyonjie what you are describing is not possible. It is the same task. We are not really spawning off 2 different threads for the 2 ipc's. It is really only one task.

keyonjie · 2018-12-14T04:18:06Z

@keyonjie what you are describing is not possible. It is the same task. We are not really spawning off 2 different threads for the 2 ipc's. It is really only one task.

I know we have only one task, and that is the reason why issue happens.

lgirdwood · 2018-12-14T10:22:29Z

@ranj063 @keyonjie @keqiaozhang @jocelyn-li @mengdonglin does this pass stress test ?

ranj063 · 2018-12-14T15:32:13Z

@lgirdwood it passed on 3 boards at my end. @keqiaozhang is verifying it in PRC but there are no functional regressions with it. So I think it is good to merge.

keyonjie · 2018-12-14T16:42:38Z

@lgirdwood I don't have the failed patterns to verify if this PR can fix it yet. @xiulipan can you help check?

keyonjie · 2018-12-14T16:44:14Z

@ranj063 can you address my concern, why my described is not possible?

ranj063 · 2018-12-14T17:38:43Z

@ranj063 can you address my concern, why my described is not possible?

@keyonjie maybe I am not able to imagine the scenario you explained. Lets talk Monday.

According to me once the new ipc comes in, the former ipc is officially non-existent. So it is never going to come back to set the task state as COMPLETED. So that part is a bit confusing for me.

keqiaozhang · 2018-12-17T01:45:17Z

@lgirdwood @ranj063
I think this PR is good to merge now.
I tested on one Yorp and one Bobba, both of them passed 2500+ iterations.

xiulipan · 2018-12-17T04:47:50Z

@ranj063 @keyonjie @lgirdwood
One hint for the scheduler refine:
For the pipeline_task, the pipeline_xrun_recover will schedule a task from pipeline_trigger COMP_TRIGGER_START that will be ignored since the xrun recovery is called from pipeline_task.

Following is the workaround added by @tlauda f396500 #125 (discussion about the fix is here)

        /* for playback copy it here, because scheduling won't work
         * on this interrupt level
         */
        if (p->sched_comp->params.direction == SOF_IPC_STREAM_PLAYBACK) {
                ret = pipeline_copy(p->sched_comp);
                if (ret < 0) {
                        trace_pipe_error_with_ids(p, "pipeline_xrun_recover() "
                                                  "error: pipeline_copy() "
                                                  "failed, ret = %d", ret);
                        return ret;
                }
        }

So if the task is going to accepted running task to be added. You need to remove the above code or add more check in the pipeline_trigger

michalgrodzicki · 2018-12-17T08:21:16Z

@lgirdwood @mmaka1 what do you think? Are you ok with those change?

xiulipan · 2018-12-17T08:35:42Z

@ranj063
It seems the same risk heppen in this PR as your first one.
Said we have task A comes first and task B comes secondly. In some case, task A will change the state to complete and let task B can not be run.

    0      2          IPC             73024.635417        43.541668            apl-ipc.c:65     IRQ
    0      2          IPC             73028.020833         3.385417            apl-ipc.c:72     Nms
    0      2         PIPE             73031.406250         3.385417           schedule.c:257    ad!
    0      2         PIPE             73037.291667         5.885417           schedule.c:364    run
    0      2         PIPE             73040.729167         3.437500           schedule.c:183    edf
    0      2         PIPE             73049.218750         8.489583           schedule.c:337    com //task A
    0      2         PIPE             73054.583333         5.364583           schedule.c:337    com //task B
    0      2          IPC             73061.145833         6.562500            apl-ipc.c:172    Msg
    0      2          IPC             73113.645833        52.500000            apl-ipc.c:65     IRQ
    0      2          IPC             73117.135417         3.489583            apl-ipc.c:72     Nms
    0      1          IPC             73120.416667         3.281250            apl-ipc.c:82     Pen
    0      2          IPC             73123.750000         3.333333            apl-ipc.c:92     Rpy
    0      2         HOST            572977.447917    499853.687500            hda-dma.c:231    GwU
    0      2          IPC            573022.343750        44.895832            apl-ipc.c:172    Msg
    0      2          IPC            573401.406250       379.062500            apl-ipc.c:65     IRQ
    0      2          IPC            573404.895833         3.489583            apl-ipc.c:72     Nms
    0      1          IPC            573408.072917         3.177083            apl-ipc.c:82     Pen
    0      2          IPC            573410.937500         2.864583            apl-ipc.c:92     Rpy

xiulipan · 2018-12-17T10:42:26Z

src/arch/xtensa/include/arch/task.h

@@ -159,13 +160,21 @@ static void _irq_task(void *arg)
 		task = container_of(clist, struct task, irq_list);
 		list_item_del(clist);

+		if (task->func && task->state == TASK_STATE_PENDING) {


Maybe we should not check state here. In case previous task complete have changed the state to COMPLETED to make us drop ipc.

agree, let's change this line to

if (task->func) {

for hot fix.

xiulipan · 2018-12-17T10:54:53Z

diff --git a/src/arch/xtensa/include/arch/task.h b/src/arch/xtensa/include/arch/task.h
index e981fdc8..e8eb3a2a 100644
--- a/src/arch/xtensa/include/arch/task.h
+++ b/src/arch/xtensa/include/arch/task.h
@@ -150,6 +150,8 @@ static void _irq_task(void *arg)
                if (task->func && task->state == TASK_STATE_PENDING) {
                        schedule_task_running(task);
                        run_task = 1;
+               } else if (task->state == TASK_STATE_COMPLETED) {
+                       trace_error(0, "PXL debug! Wrong state machine!")
                } else {
                        run_task = 0;
                }

With debug code above, reproduced the potential risk more clear:

    0      2          IPC             72082.864583        44.114582            apl-ipc.c:65     IRQ
    0      2          IPC             72086.354167         3.489583            apl-ipc.c:72     Nms
    0      2         PIPE             72089.843750         3.489583           schedule.c:257    ad!
    0      2         PIPE             72095.625000         5.781250           schedule.c:364    run
    0      2         PIPE             72099.322917         3.697917           schedule.c:183    edf
    0      2         PIPE             72107.604167         8.281250           schedule.c:337    com
    0      1      unknown             72112.552083         4.947917 /home/pxl/work/sof/sofp3/src/arch/xtensa/include/arch/task.h:154    PXL debug! Wrong state machine!
    0      2         PIPE             72116.093750         3.541667           schedule.c:337    com
    0      2          IPC             72122.812500         6.718750            apl-ipc.c:172    Msg
    0      2          IPC             72176.666667        53.854168            apl-ipc.c:65     IRQ
    0      2          IPC             72180.156250         3.489583            apl-ipc.c:72     Nms
    0      1          IPC             72183.333333         3.177083            apl-ipc.c:82     Pen
    0      2          IPC             72186.250000         2.916667            apl-ipc.c:92     Rpy
    0      2         HOST            572034.895833    499848.656250            hda-dma.c:231    GwU

ranj063 · 2018-12-17T11:46:56Z

@xiulipan I get the problem now. I think what might fix the issue is if we check if task->state is RUNNING before setting it to COMPLETED in schedule_task_complete(). What do you think?

xiulipan · 2018-12-17T13:02:56Z

@ranj063
Then what will the state be set to if the state is not running?
The most riskily thing now is that, the two ipc_tasks are modifying the same state and they have overlap.

lgirdwood · 2018-12-17T13:24:07Z

@ranj063 either @keyonjie or @xiulipan will send a subsequent PR with a update for further stress testing.

ranj063 · 2018-12-17T15:16:50Z

@ranj063
Then what will the state be set to if the state is not running?
@xiulipan It will be whatever the task B set it to in your described case which will be PENDING. And this will allow it to be run again.

The most riskily thing now is that, the two ipc_tasks are modifying the same state and they have overlap.

ranj063 requested review from bardliao, keyonjie, lgirdwood, xiulipan and tlauda December 13, 2018 06:59

ranj063 requested review from lbetlej, mmaka1 and plbossart as code owners December 13, 2018 06:59

ranj063 force-pushed the fix/ipc_timeout branch from 6db4c57 to 1d84dba Compare December 13, 2018 07:16

lgirdwood requested changes Dec 13, 2018

View reviewed changes

ranj063 force-pushed the fix/ipc_timeout branch from 1d84dba to 318b4ed Compare December 13, 2018 19:13

ranj063 requested a review from libinyang as a code owner December 13, 2018 19:13

ranj063 changed the title ~~ipc: fix ipc timeout caused due to incorrect task state~~ Fix ipc timeout caused due to incorrect task scheduling Dec 13, 2018

lgirdwood approved these changes Dec 13, 2018

View reviewed changes

lgirdwood requested changes Dec 13, 2018

View reviewed changes

ranj063 force-pushed the fix/ipc_timeout branch from 318b4ed to bdd1261 Compare December 13, 2018 21:28

ranj063 force-pushed the fix/ipc_timeout branch from bdd1261 to f765638 Compare December 13, 2018 21:32

keyonjie reviewed Dec 14, 2018

View reviewed changes

This was referenced Dec 14, 2018

Regression: volume change doesn't work #717

Closed

Regression: PCM read error when recording through dmic on GLK #718

Closed

xiulipan suggested changes Dec 17, 2018

View reviewed changes

lgirdwood merged commit 6d59eac into thesofproject:master Dec 17, 2018

xiulipan mentioned this pull request Dec 17, 2018

task: do not check task state #734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ipc timeout caused due to incorrect task scheduling #709

Fix ipc timeout caused due to incorrect task scheduling #709

ranj063 commented Dec 13, 2018 •

edited

Loading

ranj063 commented Dec 13, 2018

lgirdwood Dec 13, 2018

lgirdwood commented Dec 13, 2018 •

edited

Loading

ranj063 commented Dec 13, 2018

lgirdwood left a comment

michalgrodzicki commented Dec 13, 2018 •

edited

Loading

lgirdwood Dec 13, 2018

ranj063 Dec 13, 2018

ranj063 Dec 13, 2018

keyonjie left a comment

ranj063 commented Dec 14, 2018 •

edited

Loading

keyonjie commented Dec 14, 2018

lgirdwood commented Dec 14, 2018

ranj063 commented Dec 14, 2018

keyonjie commented Dec 14, 2018

keyonjie commented Dec 14, 2018

ranj063 commented Dec 14, 2018

keqiaozhang commented Dec 17, 2018

xiulipan commented Dec 17, 2018

michalgrodzicki commented Dec 17, 2018

xiulipan commented Dec 17, 2018

xiulipan Dec 17, 2018

keyonjie Dec 17, 2018

xiulipan commented Dec 17, 2018

ranj063 commented Dec 17, 2018

xiulipan commented Dec 17, 2018

lgirdwood commented Dec 17, 2018

ranj063 commented Dec 17, 2018 •

edited

Loading

		@@ -280,6 +280,8 @@ static inline void idc_do_cmd(void *data)

		idc_cmd(&idc->received_msg);

		schedule_task_complete(&idc->idc_task);

Fix ipc timeout caused due to incorrect task scheduling #709

Fix ipc timeout caused due to incorrect task scheduling #709

Conversation

ranj063 commented Dec 13, 2018 • edited Loading

ranj063 commented Dec 13, 2018

lgirdwood Dec 13, 2018

Choose a reason for hiding this comment

lgirdwood commented Dec 13, 2018 • edited Loading

ranj063 commented Dec 13, 2018

lgirdwood left a comment

Choose a reason for hiding this comment

michalgrodzicki commented Dec 13, 2018 • edited Loading

lgirdwood Dec 13, 2018

Choose a reason for hiding this comment

ranj063 Dec 13, 2018

Choose a reason for hiding this comment

ranj063 Dec 13, 2018

Choose a reason for hiding this comment

keyonjie left a comment

Choose a reason for hiding this comment

ranj063 commented Dec 14, 2018 • edited Loading

keyonjie commented Dec 14, 2018

lgirdwood commented Dec 14, 2018

ranj063 commented Dec 14, 2018

keyonjie commented Dec 14, 2018

keyonjie commented Dec 14, 2018

ranj063 commented Dec 14, 2018

keqiaozhang commented Dec 17, 2018

xiulipan commented Dec 17, 2018

michalgrodzicki commented Dec 17, 2018

xiulipan commented Dec 17, 2018

xiulipan Dec 17, 2018

Choose a reason for hiding this comment

keyonjie Dec 17, 2018

Choose a reason for hiding this comment

xiulipan commented Dec 17, 2018

ranj063 commented Dec 17, 2018

xiulipan commented Dec 17, 2018

lgirdwood commented Dec 17, 2018

ranj063 commented Dec 17, 2018 • edited Loading

ranj063 commented Dec 13, 2018 •

edited

Loading

lgirdwood commented Dec 13, 2018 •

edited

Loading

michalgrodzicki commented Dec 13, 2018 •

edited

Loading

ranj063 commented Dec 14, 2018 •

edited

Loading

ranj063 commented Dec 17, 2018 •

edited

Loading