-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Initial FW watchdog implementation #841
Conversation
Implement a basic recovery flow for the case when hardware hits a fatal error and stops responding to IPC messages. The SOF core is left in an unrecoverable state in this case, as power down sequence is blocked by inability to send state save IPC to the firmware. To force recovery, detect when IPC fails to a timeout on the suspend call path, and force device to suspend despite of the error. This patch only helps on devices with runtime PM enabled. Signed-off-by: Kai Vehmanen <[email protected]>
To ensure fatal FW errors are detected even if runtime PM is not in use, add an additional check to 'hw_free' method. By forcing a suspend-resume cycle on IPC timeout errors, we ensure FW is not left in error state at stream close. Signed-off-by: Kai Vehmanen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will only bind with pcm and pm?
How about topology loading (may cause endless loop because of bad tplg, maybe a retry times needed)
Controls IPC like volume control, which we are suffering from WHL.
And please see my inline comments about debug options. Maybe better to wrap the the reset into a function to give chance for better option control.
@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream) | |||
/* send IPC to the DSP */ | |||
ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream, | |||
sizeof(stream), &reply, sizeof(reply)); | |||
/* force power down if FW is in error state */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i
I suggest to add some debug option here to give developer some chance to recover the scene. As we will lost the error or the panic exception in memory bank as the DSP is reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiulipan thanks for the review!
This is actually not so bad for debug as the client will be stuck waiting on sample availability. So e.g. with aplay, you have plenty of time to dump the DSP memories, before you hit ctrl-C or otherwise kill the application.
There may of course be issue with apps that have their own timeout mechanism and will force the app to close if the audio device seems stuck.
This seems like a very specific debug option -- do you think it's worth its own Kconfig option or is it enough just to have a separate function that can be commented out easily by developers when debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments below.
Indeed you'd need a true watchdog in the firmware to track that it's functional, this is just an recovery on IPC.
Maybe you could add some comments on what forcing a suspend results in, I guess somehow the application will get an error and re-open the device?
if (ret == -ETIMEDOUT) { | ||
dev_warn(sdev->dev, | ||
"warn: ctx_save ipc timeout, forced suspend\n"); | ||
} else if (ret < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do we prevent suspend on any IPC error on CTX_SAVE? @ranj063?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But coming back to this patch, if the DSP crashed even before the CTX_SAVE is sent, why do we even send out the ipc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more intelligent way to tell that the DSP has crashed (based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp to be powered down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.
Humm, I am old enough to question these statements. Murphy's Law applies. If there is by-design no error then let's remove it. If not let's document what the expected errors might bet and leave room for the 'known unknowns'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ranj063 and @plbossart for the comments!
I'll change this so that we do not prevent suspend on any errors. This would indeed seem like the right thing to do.
On @ranj063 comment:
I think we should come up with a more intelligent way to tell that the DSP has crashed
(based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp
@xiulipan also raised a question about mixer interface.
I still think doing the liveliness check at stream close is pretty good compromise that addresses the main concern Pierre had -> if something goes haywire in FW, driver should not just bail out, but rather leave the hardware in a state where system can recover reliable with limited impact to applications.
I can see what a real PCM watchdog would look like. We could monitor e.g. the HW pointer and see whether it has gone stuck. But it will require complex logic to ensure it doesn't get fooled by state changes and we will have a difficult time to report this sanely to the apps. We could try a power recycle and show it as a XRUN, or just use SNDRV_PCM_STATE_DISCONNECTED. In either case, this will be a rather complicated flow to test&maintain, so I'd like to first merge the simple recovery patchset and then follow-up with the active watchdog.
@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream) | |||
/* send IPC to the DSP */ | |||
ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream, | |||
sizeof(stream), &reply, sizeof(reply)); | |||
/* force power down if FW is in error state */ | |||
if (ret == -ETIMEDOUT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, why just a time-out? have we checked what types of errors we can get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart the ETIMEOUT error makes sense because if there are other errors it surely means the DSP is still alive. Its only when the DSP crashes that we run into IPC timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the DSP is still alive but throw errors due to a bad programming or configurations, they should not come in the form of a -EXYZ form, they should be a message that's says NAK or something. We are blurring layers if we mix low-level communication issues and protocol issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart it actually does send a NAK in the form of setting the ipc reply's error field. The problem is that we are translating it into a -EINVAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no let me correct myself, -EINVAL is only when the reply size doesnt match the expected reply size.
Otherwise, it is the error value that is propagated from the DSP. which will certainly be something other than -ETIMEOUT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranj063 @plbossart My rationale for limiting to ETIMEDOUT is that hw_free can be called in various situations and we could even have invalid call sequences from the ALSA core (or introduced by user somehow) and power recycling on these case might lead to hard to debug scenarios.
ETIMEDOUT OTOH seems a safe bet. The FW is alive and it should always respond to the IPC. If it is not, it is definitely not working as specific and semantically at 'hw_free', power cycling the FW seems safe.
Maybe this is too conservative, but it is safe and handles a large number of real-life scenarios.
@xiulipan wrote:
Hmm, that's an interesting question. I think for most of these we should be able to provide an immediate error response. E.g. if topology loading fails, driver should detect that and fail the card creation. I don't think we need a watchdog mechanism for this. Mixer controls are a bit more complicated. There too, the primary method should be to provide a direct response if the action fails. Odds are high that if FW suddenly stops processing mixer control IPCs, the PCM paths will be affected soon as well. But hmm, I'll think about. Gut feeling that it is still better to focus on the PCM activity and not add too many different recovery paths (these all need to be tested&maintained so fewer we have the better). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more comments below
if (ret == -ETIMEDOUT) { | ||
dev_warn(sdev->dev, | ||
"warn: ctx_save ipc timeout, forced suspend\n"); | ||
} else if (ret < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this again, the commit message is not quite aligned with the code.
In this patch you only force suspend even when the IPC fails. It's not a generic mechanism to force suspend when any IPC fails as the text hints at.
ret = snd_sof_suspend(sdev->dev); | ||
if (!ret) | ||
ret = snd_sof_resume(sdev->dev); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this commit should come first.
I am also wondering if instead of doing this for PCM only we should do this inside the tx_message function, so that all IPCs would be treated the same, not just the PCM stuff.
++. Wouldn't it indeed be good to implement a proper hardware watchdog in the FW? Do all our DSPs have that functionality? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also need to be linked into IPC timeout and FW panic logic within the driver. We should also make sure we have the option to block the power cycle in DEBUG mode. i.e. for dumps, log and attaching GDB etc
The PR is "on hold" at the moment. One patch of the series was now submitted for merging as #1004 but rest of this PR is still on hold. |
Closing this PR as I have not had time to work on this. I'll reopen when work is resumed. |
Avoid that running test nvme/012 from the blktests suite triggers the following false positive lockdep complaint: ============================================ WARNING: possible recursive locking detected 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted -------------------------------------------- ksoftirqd/1/16 is trying to acquire lock: 000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 but task is already holding lock: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&fq->mq_flush_lock)->rlock); lock(&(&fq->mq_flush_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by ksoftirqd/1/16: #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 stack backtrace: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: dump_stack+0x67/0x90 __lock_acquire.cold.45+0x2b4/0x313 lock_acquire+0x98/0x160 _raw_spin_lock_irqsave+0x3b/0x80 flush_end_io+0x4e/0x1d0 blk_mq_complete_request+0x76/0x110 nvmet_req_complete+0x15/0x110 [nvmet] nvmet_bio_done+0x27/0x50 [nvmet] blk_update_request+0xd7/0x2d0 blk_mq_end_request+0x1a/0x100 blk_flush_complete_seq+0xe5/0x350 flush_end_io+0x12f/0x1d0 blk_done_softirq+0x9f/0xd0 __do_softirq+0xca/0x440 run_ksoftirqd+0x24/0x50 smpboot_thread_fn+0x113/0x1e0 kthread+0x121/0x140 ret_from_fork+0x3a/0x50 Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Hannes Reinecke <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
Currently, for hashmap, the bpf iterator will grab a bucket lock, a spinlock, before traversing the elements in the bucket. This can ensure all bpf visted elements are valid. But this mechanism may cause deadlock if update/deletion happens to the same bucket of the visited map in the program. For example, if we added bpf_map_update_elem() call to the same visited element in selftests bpf_iter_bpf_hash_map.c, we will have the following deadlock: ============================================ WARNING: possible recursive locking detected 5.9.0-rc1+ #841 Not tainted -------------------------------------------- test_progs/1750 is trying to acquire lock: ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410 but task is already holding lock: ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&htab->buckets[i].raw_lock); lock(&htab->buckets[i].raw_lock); *** DEADLOCK *** ... Call Trace: dump_stack+0x78/0xa0 __lock_acquire.cold.74+0x209/0x2e3 lock_acquire+0xba/0x380 ? htab_map_update_elem+0x1cf/0x410 ? __lock_acquire+0x639/0x20c0 _raw_spin_lock_irqsave+0x3b/0x80 ? htab_map_update_elem+0x1cf/0x410 htab_map_update_elem+0x1cf/0x410 ? lock_acquire+0xba/0x380 bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c ? find_held_lock+0x34/0xa0 bpf_iter_run_prog+0x81/0x16e __bpf_hash_map_seq_show+0x145/0x180 bpf_seq_read+0xff/0x3d0 vfs_read+0xad/0x1c0 ksys_read+0x5f/0xe0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ... The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(), and then grabbed again in htab_map_update_elem() in the bpf program, causing deadlocks. Actually, we do not need bucket_lock here, we can just use rcu_read_lock() similar to netlink iterator where the rcu_read_{lock,unlock} likes below: seq_ops->start(): rcu_read_lock(); seq_ops->next(): rcu_read_unlock(); /* next element */ rcu_read_lock(); seq_ops->stop(); rcu_read_unlock(); Compared to old bucket_lock mechanism, if concurrent updata/delete happens, we may visit stale elements, miss some elements, or repeat some elements. I think this is a reasonable compromise. For users wanting to avoid stale, missing/repeated accesses, bpf_map batch access syscall interface can be used. Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Here's an intial set of patches to fix #452
I split the solution to two patches, first a simple one to sort out a minor issue that prevented the existing runtime-PM logic to perform basic recovery. Second is more invasive as it adds a conditional forced suspend/resume cycle when closing the stream.
I tested the patches on UP2 with a modified FW that starts busylooping on the audio task after 30secs. This will eventually lead to a DSP oops. Without these patches, it is not possible to recover playback without rebooting. With the patches, I can succesfully recover both with and without runtime PM.
The caveat is that there is no actual watchdog that fires during PCM playback. With no_wakeup_mode, there is no continuous IPC traffic between host and DSP and the watchdog would have to monitor the state of xruns (which could get complicated as ALSA has many operational modes for this). Rather the recovery implemented now is based on checking FW status when stream is terminated. If IPC fails on these paths to a timeout, we know FW is not OK and we can force a power cycle.
This is practically a a feasible solution as most audio applications will regularly open and close the device (e.g. pulseaudio does this). With applications like aplay, issuing a ctrl-C is enough to initiate the watchdog.
Please comment.