Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ipc timeout caused due to incorrect task scheduling #709

Merged
merged 1 commit into from
Dec 17, 2018

Conversation

ranj063
Copy link
Collaborator

@ranj063 ranj063 commented Dec 13, 2018

scheduler: allow RUNNING tasks to be re-scheduled in the future.

Because of a race condition between when the DONE interrupt
is sent to the host and when the previous ipc task's
state is updated to COMPLETED, new ipc's from the host
could end up not being scheduled. This leads to ipc
time outs on the host.

In order to prevent this, this patch introduces a new task state
called PENDING which is assigned to the task when it is picked
as the next task to be run. The state is then updated to running
when the task function is executed. This way when a ipc task
comes in, a RUNNING task could get scheduled again and assigned
the PENDING state to ensure that it doesnt get missed.

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 13, 2018

This is an alternative to #704.

@@ -280,6 +280,8 @@ static inline void idc_do_cmd(void *data)

idc_cmd(&idc->received_msg);

schedule_task_complete(&idc->idc_task);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to be changing scheduler task state outside of scheduler/task subsystems.

@lgirdwood
Copy link
Member

lgirdwood commented Dec 13, 2018

@ranj063 @xiulipan please see some example code for fixing this I've appended to #704. We basically add a new scheduler state and allow tasks that are running to be scheduled again in the future

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 13, 2018

@lgirdwood @xiulipan I have updated the PR based on your suggestion.

But this is still untested. I will request for validation tonight and update you when the stress test finishes.

@ranj063 ranj063 changed the title ipc: fix ipc timeout caused due to incorrect task state Fix ipc timeout caused due to incorrect task scheduling Dec 13, 2018
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiulipan can you check this tomorrow if it fails the stress test. Thanks

@michalgrodzicki
Copy link

michalgrodzicki commented Dec 13, 2018

There is a KW issue:

Critical | SOF_FW/src/arch/xtensa/include/arch/task.h | Pointer may be dereferenced after it was positively checked for NULL

schedule_task_running(task);
run_task = 1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranj063 best set run_task = 0 here to fix KW issue;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood I moved the check for task->func before actually calling it. That should fix the kw issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i take that back. I've set run_task to 0 instead. so we set the task state correctly too.

Because of a race condition between when the DONE interrupt
is sent to the host and when the previous ipc task's
state is updated to COMPLETED, new ipc's from the host
could end up not being scheduled. This leads to ipc
time outs on the host.

In order to prevent this, this patch introduces a new task state
called PENDING which is assigned to the task when it is picked
as the next task to be run. The state is then updated to running
when the task function is executed. This way when a ipc task
comes in, a RUNNING task could get scheduled again and assigned
the PENDING state to ensure that it doesnt get missed.

Signed-off-by: Ranjani Sridharan <[email protected]>
Copy link
Contributor

@keyonjie keyonjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still seeing risk with this version, imagine this case:

  1. IPC1 arrived and be added a list, be run in _irq_task(), task->func() running and done bit is set to driver. now, task->state is running.
  2. IPC2 arrived before schedule_task_complete() is run, task->state is running, so task->state will be changed QUEUED and the new IPC2 will be added to list and be scheduled(sch->lock might be not held by anybody else at that moment), and then another SCHEDULE_IRQ will gonna be triggered.
  3. I believe this latter IRQ will be actually triggered before the former one finished, the issue is that if the former one run schedule_task_complete() now, it will change the task->state to be COMPLETED, and then the latter IPC won't be run! (as task->state != TASK_STATE_PENDING)

So, I think the patch might reduce possibility of the issue's happening, but not eliminate it thoroughly.

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 14, 2018

@keyonjie what you are describing is not possible. It is the same task. We are not really spawning off 2 different threads for the 2 ipc's. It is really only one task.

@keyonjie
Copy link
Contributor

@keyonjie what you are describing is not possible. It is the same task. We are not really spawning off 2 different threads for the 2 ipc's. It is really only one task.

I know we have only one task, and that is the reason why issue happens.

@lgirdwood
Copy link
Member

@ranj063 @keyonjie @keqiaozhang @jocelyn-li @mengdonglin does this pass stress test ?

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 14, 2018

@lgirdwood it passed on 3 boards at my end. @keqiaozhang is verifying it in PRC but there are no functional regressions with it. So I think it is good to merge.

@keyonjie
Copy link
Contributor

@lgirdwood I don't have the failed patterns to verify if this PR can fix it yet. @xiulipan can you help check?

@keyonjie
Copy link
Contributor

@ranj063 can you address my concern, why my described is not possible?

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 14, 2018

@ranj063 can you address my concern, why my described is not possible?

@keyonjie maybe I am not able to imagine the scenario you explained. Lets talk Monday.

According to me once the new ipc comes in, the former ipc is officially non-existent. So it is never going to come back to set the task state as COMPLETED. So that part is a bit confusing for me.

@keqiaozhang
Copy link
Collaborator

@lgirdwood @ranj063
I think this PR is good to merge now.
I tested on one Yorp and one Bobba, both of them passed 2500+ iterations.

@xiulipan
Copy link
Contributor

@ranj063 @keyonjie @lgirdwood
One hint for the scheduler refine:
For the pipeline_task, the pipeline_xrun_recover will schedule a task from pipeline_trigger COMP_TRIGGER_START that will be ignored since the xrun recovery is called from pipeline_task.

Following is the workaround added by @tlauda f396500 #125 (discussion about the fix is here)

        /* for playback copy it here, because scheduling won't work
         * on this interrupt level
         */
        if (p->sched_comp->params.direction == SOF_IPC_STREAM_PLAYBACK) {
                ret = pipeline_copy(p->sched_comp);
                if (ret < 0) {
                        trace_pipe_error_with_ids(p, "pipeline_xrun_recover() "
                                                  "error: pipeline_copy() "
                                                  "failed, ret = %d", ret);
                        return ret;
                }
        }

So if the task is going to accepted running task to be added. You need to remove the above code or add more check in the pipeline_trigger

@michalgrodzicki
Copy link

@lgirdwood @mmaka1 what do you think? Are you ok with those change?

@xiulipan
Copy link
Contributor

@ranj063
It seems the same risk heppen in this PR as your first one.
Said we have task A comes first and task B comes secondly. In some case, task A will change the state to complete and let task B can not be run.

    0      2          IPC             73024.635417        43.541668            apl-ipc.c:65     IRQ
    0      2          IPC             73028.020833         3.385417            apl-ipc.c:72     Nms
    0      2         PIPE             73031.406250         3.385417           schedule.c:257    ad!
    0      2         PIPE             73037.291667         5.885417           schedule.c:364    run
    0      2         PIPE             73040.729167         3.437500           schedule.c:183    edf
    0      2         PIPE             73049.218750         8.489583           schedule.c:337    com //task A
    0      2         PIPE             73054.583333         5.364583           schedule.c:337    com //task B
    0      2          IPC             73061.145833         6.562500            apl-ipc.c:172    Msg
    0      2          IPC             73113.645833        52.500000            apl-ipc.c:65     IRQ
    0      2          IPC             73117.135417         3.489583            apl-ipc.c:72     Nms
    0      1          IPC             73120.416667         3.281250            apl-ipc.c:82     Pen
    0      2          IPC             73123.750000         3.333333            apl-ipc.c:92     Rpy
    0      2         HOST            572977.447917    499853.687500            hda-dma.c:231    GwU
    0      2          IPC            573022.343750        44.895832            apl-ipc.c:172    Msg
    0      2          IPC            573401.406250       379.062500            apl-ipc.c:65     IRQ
    0      2          IPC            573404.895833         3.489583            apl-ipc.c:72     Nms
    0      1          IPC            573408.072917         3.177083            apl-ipc.c:82     Pen
    0      2          IPC            573410.937500         2.864583            apl-ipc.c:92     Rpy

@@ -159,13 +160,21 @@ static void _irq_task(void *arg)
task = container_of(clist, struct task, irq_list);
list_item_del(clist);

if (task->func && task->state == TASK_STATE_PENDING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not check state here. In case previous task complete have changed the state to COMPLETED to make us drop ipc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, let's change this line to

if (task->func) {

for hot fix.

@xiulipan
Copy link
Contributor

diff --git a/src/arch/xtensa/include/arch/task.h b/src/arch/xtensa/include/arch/task.h
index e981fdc8..e8eb3a2a 100644
--- a/src/arch/xtensa/include/arch/task.h
+++ b/src/arch/xtensa/include/arch/task.h
@@ -150,6 +150,8 @@ static void _irq_task(void *arg)
                if (task->func && task->state == TASK_STATE_PENDING) {
                        schedule_task_running(task);
                        run_task = 1;
+               } else if (task->state == TASK_STATE_COMPLETED) {
+                       trace_error(0, "PXL debug! Wrong state machine!")
                } else {
                        run_task = 0;
                }

With debug code above, reproduced the potential risk more clear:

    0      2          IPC             72082.864583        44.114582            apl-ipc.c:65     IRQ
    0      2          IPC             72086.354167         3.489583            apl-ipc.c:72     Nms
    0      2         PIPE             72089.843750         3.489583           schedule.c:257    ad!
    0      2         PIPE             72095.625000         5.781250           schedule.c:364    run
    0      2         PIPE             72099.322917         3.697917           schedule.c:183    edf
    0      2         PIPE             72107.604167         8.281250           schedule.c:337    com
    0      1      unknown             72112.552083         4.947917 /home/pxl/work/sof/sofp3/src/arch/xtensa/include/arch/task.h:154    PXL debug! Wrong state machine!
    0      2         PIPE             72116.093750         3.541667           schedule.c:337    com
    0      2          IPC             72122.812500         6.718750            apl-ipc.c:172    Msg
    0      2          IPC             72176.666667        53.854168            apl-ipc.c:65     IRQ
    0      2          IPC             72180.156250         3.489583            apl-ipc.c:72     Nms
    0      1          IPC             72183.333333         3.177083            apl-ipc.c:82     Pen
    0      2          IPC             72186.250000         2.916667            apl-ipc.c:92     Rpy
    0      2         HOST            572034.895833    499848.656250            hda-dma.c:231    GwU

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 17, 2018

@xiulipan I get the problem now. I think what might fix the issue is if we check if task->state is RUNNING before setting it to COMPLETED in schedule_task_complete(). What do you think?

@xiulipan
Copy link
Contributor

@ranj063
Then what will the state be set to if the state is not running?
The most riskily thing now is that, the two ipc_tasks are modifying the same state and they have overlap.

@lgirdwood lgirdwood merged commit 6d59eac into thesofproject:master Dec 17, 2018
@lgirdwood
Copy link
Member

@ranj063 either @keyonjie or @xiulipan will send a subsequent PR with a update for further stress testing.

@ranj063
Copy link
Collaborator Author

ranj063 commented Dec 17, 2018

@ranj063
Then what will the state be set to if the state is not running?
@xiulipan It will be whatever the task B set it to in your described case which will be PENDING. And this will allow it to be run again.

The most riskily thing now is that, the two ipc_tasks are modifying the same state and they have overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants