[RFC] Initial FW watchdog implementation #841

kv2019i · 2019-04-17T12:24:04Z

Here's an intial set of patches to fix #452

I split the solution to two patches, first a simple one to sort out a minor issue that prevented the existing runtime-PM logic to perform basic recovery. Second is more invasive as it adds a conditional forced suspend/resume cycle when closing the stream.

I tested the patches on UP2 with a modified FW that starts busylooping on the audio task after 30secs. This will eventually lead to a DSP oops. Without these patches, it is not possible to recover playback without rebooting. With the patches, I can succesfully recover both with and without runtime PM.

The caveat is that there is no actual watchdog that fires during PCM playback. With no_wakeup_mode, there is no continuous IPC traffic between host and DSP and the watchdog would have to monitor the state of xruns (which could get complicated as ALSA has many operational modes for this). Rather the recovery implemented now is based on checking FW status when stream is terminated. If IPC fails on these paths to a timeout, we know FW is not OK and we can force a power cycle.

This is practically a a feasible solution as most audio applications will regularly open and close the device (e.g. pulseaudio does this). With applications like aplay, issuing a ctrl-C is enough to initiate the watchdog.

Please comment.

Implement a basic recovery flow for the case when hardware hits a fatal error and stops responding to IPC messages. The SOF core is left in an unrecoverable state in this case, as power down sequence is blocked by inability to send state save IPC to the firmware. To force recovery, detect when IPC fails to a timeout on the suspend call path, and force device to suspend despite of the error. This patch only helps on devices with runtime PM enabled. Signed-off-by: Kai Vehmanen <[email protected]>

To ensure fatal FW errors are detected even if runtime PM is not in use, add an additional check to 'hw_free' method. By forcing a suspend-resume cycle on IPC timeout errors, we ensure FW is not left in error state at stream close. Signed-off-by: Kai Vehmanen <[email protected]>

xiulipan

So this will only bind with pcm and pm?
How about topology loading (may cause endless loop because of bad tplg, maybe a retry times needed)
Controls IPC like volume control, which we are suffering from WHL.

And please see my inline comments about debug options. Maybe better to wrap the the reset into a function to give chance for better option control.

xiulipan · 2019-04-17T15:05:53Z

sound/soc/sof/pcm.c

@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream)
 	/* send IPC to the DSP */
 	ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream,
 				 sizeof(stream), &reply, sizeof(reply));
+	/* force power down if FW is in error state */


@kv2019i
I suggest to add some debug option here to give developer some chance to recover the scene. As we will lost the error or the panic exception in memory bank as the DSP is reset.

@xiulipan thanks for the review!

This is actually not so bad for debug as the client will be stuck waiting on sample availability. So e.g. with aplay, you have plenty of time to dump the DSP memories, before you hit ctrl-C or otherwise kill the application.

There may of course be issue with apps that have their own timeout mechanism and will force the app to close if the audio device seems stuck.

This seems like a very specific debug option -- do you think it's worth its own Kconfig option or is it enough just to have a separate function that can be commented out easily by developers when debugging?

plbossart

Comments below.
Indeed you'd need a true watchdog in the firmware to track that it's functional, this is just an recovery on IPC.
Maybe you could add some comments on what forcing a suspend results in, I guess somehow the application will get an error and re-open the device?

plbossart · 2019-04-17T21:05:05Z

sound/soc/sof/pm.c

+	if (ret == -ETIMEDOUT) {
+		dev_warn(sdev->dev,
+			 "warn: ctx_save ipc timeout, forced suspend\n");
+	} else if (ret < 0) {


question: why do we prevent suspend on any IPC error on CTX_SAVE? @ranj063?

@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.

But coming back to this patch, if the DSP crashed even before the CTX_SAVE is sent, why do we even send out the ipc?

I think we should come up with a more intelligent way to tell that the DSP has crashed (based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp to be powered down.

@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.

Humm, I am old enough to question these statements. Murphy's Law applies. If there is by-design no error then let's remove it. If not let's document what the expected errors might bet and leave room for the 'known unknowns'

Thanks @ranj063 and @plbossart for the comments!

I'll change this so that we do not prevent suspend on any errors. This would indeed seem like the right thing to do.

On @ranj063 comment:

I think we should come up with a more intelligent way to tell that the DSP has crashed
(based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp

@xiulipan also raised a question about mixer interface.

I still think doing the liveliness check at stream close is pretty good compromise that addresses the main concern Pierre had -> if something goes haywire in FW, driver should not just bail out, but rather leave the hardware in a state where system can recover reliable with limited impact to applications.

I can see what a real PCM watchdog would look like. We could monitor e.g. the HW pointer and see whether it has gone stuck. But it will require complex logic to ensure it doesn't get fooled by state changes and we will have a difficult time to report this sanely to the apps. We could try a power recycle and show it as a XRUN, or just use SNDRV_PCM_STATE_DISCONNECTED. In either case, this will be a rather complicated flow to test&maintain, so I'd like to first merge the simple recovery patchset and then follow-up with the active watchdog.

plbossart · 2019-04-17T21:07:18Z

sound/soc/sof/pcm.c

@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream)
 	/* send IPC to the DSP */
 	ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream,
 				 sizeof(stream), &reply, sizeof(reply));
+	/* force power down if FW is in error state */
+	if (ret == -ETIMEDOUT) {


same question, why just a time-out? have we checked what types of errors we can get here?

@plbossart the ETIMEOUT error makes sense because if there are other errors it surely means the DSP is still alive. Its only when the DSP crashes that we run into IPC timeout

if the DSP is still alive but throw errors due to a bad programming or configurations, they should not come in the form of a -EXYZ form, they should be a message that's says NAK or something. We are blurring layers if we mix low-level communication issues and protocol issues.

@plbossart it actually does send a NAK in the form of setting the ipc reply's error field. The problem is that we are translating it into a -EINVAL.

no let me correct myself, -EINVAL is only when the reply size doesnt match the expected reply size.
Otherwise, it is the error value that is propagated from the DSP. which will certainly be something other than -ETIMEOUT

@ranj063 @plbossart My rationale for limiting to ETIMEDOUT is that hw_free can be called in various situations and we could even have invalid call sequences from the ALSA core (or introduced by user somehow) and power recycling on these case might lead to hard to debug scenarios.

ETIMEDOUT OTOH seems a safe bet. The FW is alive and it should always respond to the IPC. If it is not, it is definitely not working as specific and semantically at 'hw_free', power cycling the FW seems safe.

Maybe this is too conservative, but it is safe and handles a large number of real-life scenarios.

kv2019i · 2019-04-18T18:34:53Z

@xiulipan wrote:

So this will only bind with pcm and pm?
How about topology loading (may cause endless loop because of bad tplg, maybe a retry times needed)
Controls IPC like volume control, which we are suffering from WHL.

Hmm, that's an interesting question. I think for most of these we should be able to provide an immediate error response. E.g. if topology loading fails, driver should detect that and fail the card creation. I don't think we need a watchdog mechanism for this.

Mixer controls are a bit more complicated. There too, the primary method should be to provide a direct response if the action fails. Odds are high that if FW suddenly stops processing mixer control IPCs, the PCM paths will be affected soon as well. But hmm, I'll think about. Gut feeling that it is still better to focus on the PCM activity and not add too many different recovery paths (these all need to be tested&maintained so fewer we have the better).

plbossart

more comments below

plbossart · 2019-04-19T15:55:19Z

sound/soc/sof/pm.c

+	if (ret == -ETIMEDOUT) {
+		dev_warn(sdev->dev,
+			 "warn: ctx_save ipc timeout, forced suspend\n");
+	} else if (ret < 0) {


Reading this again, the commit message is not quite aligned with the code.
In this patch you only force suspend even when the IPC fails. It's not a generic mechanism to force suspend when any IPC fails as the text hints at.

plbossart · 2019-04-19T15:57:37Z

sound/soc/sof/pcm.c

+		ret = snd_sof_suspend(sdev->dev);
+		if (!ret)
+			ret = snd_sof_resume(sdev->dev);
+	}


maybe this commit should come first.
I am also wondering if instead of doing this for PCM only we should do this inside the tx_message function, so that all IPCs would be treated the same, not just the PCM stuff.

lyakh · 2019-05-03T12:21:22Z

Indeed you'd need a true watchdog in the firmware to track that it's functional, this is just an recovery on IPC.

++. Wouldn't it indeed be good to implement a proper hardware watchdog in the FW? Do all our DSPs have that functionality?

lgirdwood

This will also need to be linked into IPC timeout and FW panic logic within the driver. We should also make sure we have the option to block the power cycle in DEBUG mode. i.e. for dumps, log and attaching GDB etc

kv2019i · 2019-06-03T13:30:41Z

The PR is "on hold" at the moment. One patch of the series was now submitted for merging as #1004 but rest of this PR is still on hold.

kv2019i · 2019-08-19T08:50:37Z

Closing this PR as I have not had time to work on this. I'll reopen when work is resumed.

Avoid that running test nvme/012 from the blktests suite triggers the following false positive lockdep complaint: ============================================ WARNING: possible recursive locking detected 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted -------------------------------------------- ksoftirqd/1/16 is trying to acquire lock: 000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 but task is already holding lock: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&fq->mq_flush_lock)->rlock); lock(&(&fq->mq_flush_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by ksoftirqd/1/16: #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 stack backtrace: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: dump_stack+0x67/0x90 __lock_acquire.cold.45+0x2b4/0x313 lock_acquire+0x98/0x160 _raw_spin_lock_irqsave+0x3b/0x80 flush_end_io+0x4e/0x1d0 blk_mq_complete_request+0x76/0x110 nvmet_req_complete+0x15/0x110 [nvmet] nvmet_bio_done+0x27/0x50 [nvmet] blk_update_request+0xd7/0x2d0 blk_mq_end_request+0x1a/0x100 blk_flush_complete_seq+0xe5/0x350 flush_end_io+0x12f/0x1d0 blk_done_softirq+0x9f/0xd0 __do_softirq+0xca/0x440 run_ksoftirqd+0x24/0x50 smpboot_thread_fn+0x113/0x1e0 kthread+0x121/0x140 ret_from_fork+0x3a/0x50 Cc: Christoph Hellwig <[email protected]> Cc: Ming Lei <[email protected]> Cc: Hannes Reinecke <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>

Currently, for hashmap, the bpf iterator will grab a bucket lock, a spinlock, before traversing the elements in the bucket. This can ensure all bpf visted elements are valid. But this mechanism may cause deadlock if update/deletion happens to the same bucket of the visited map in the program. For example, if we added bpf_map_update_elem() call to the same visited element in selftests bpf_iter_bpf_hash_map.c, we will have the following deadlock: ============================================ WARNING: possible recursive locking detected 5.9.0-rc1+ #841 Not tainted -------------------------------------------- test_progs/1750 is trying to acquire lock: ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410 but task is already holding lock: ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&htab->buckets[i].raw_lock); lock(&htab->buckets[i].raw_lock); *** DEADLOCK *** ... Call Trace: dump_stack+0x78/0xa0 __lock_acquire.cold.74+0x209/0x2e3 lock_acquire+0xba/0x380 ? htab_map_update_elem+0x1cf/0x410 ? __lock_acquire+0x639/0x20c0 _raw_spin_lock_irqsave+0x3b/0x80 ? htab_map_update_elem+0x1cf/0x410 htab_map_update_elem+0x1cf/0x410 ? lock_acquire+0xba/0x380 bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c ? find_held_lock+0x34/0xa0 bpf_iter_run_prog+0x81/0x16e __bpf_hash_map_seq_show+0x145/0x180 bpf_seq_read+0xff/0x3d0 vfs_read+0xad/0x1c0 ksys_read+0x5f/0xe0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ... The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(), and then grabbed again in htab_map_update_elem() in the bpf program, causing deadlocks. Actually, we do not need bucket_lock here, we can just use rcu_read_lock() similar to netlink iterator where the rcu_read_{lock,unlock} likes below: seq_ops->start(): rcu_read_lock(); seq_ops->next(): rcu_read_unlock(); /* next element */ rcu_read_lock(); seq_ops->stop(); rcu_read_unlock(); Compared to old bucket_lock mechanism, if concurrent updata/delete happens, we may visit stale elements, miss some elements, or repeat some elements. I think this is a reasonable compromise. For users wanting to avoid stale, missing/repeated accesses, bpf_map batch access syscall interface can be used. Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

kv2019i added 2 commits April 17, 2019 15:09

kv2019i requested review from juimonen, plbossart, lyakh, ranj063, keyonjie and xiulipan April 17, 2019 12:24

xiulipan suggested changes Apr 17, 2019

View reviewed changes

plbossart reviewed Apr 17, 2019

View reviewed changes

plbossart reviewed Apr 19, 2019

View reviewed changes

lgirdwood reviewed May 3, 2019

View reviewed changes

plbossart added the Unclear No agreement on problem statement and resolution label May 10, 2019

kv2019i closed this Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Initial FW watchdog implementation #841

[RFC] Initial FW watchdog implementation #841

kv2019i commented Apr 17, 2019

xiulipan left a comment

xiulipan Apr 17, 2019

kv2019i Apr 18, 2019

plbossart left a comment

plbossart Apr 17, 2019

ranj063 Apr 17, 2019

ranj063 Apr 17, 2019

ranj063 Apr 17, 2019

plbossart Apr 18, 2019

kv2019i Apr 18, 2019 •

edited

Loading

plbossart Apr 17, 2019

ranj063 Apr 17, 2019

plbossart Apr 18, 2019

ranj063 Apr 18, 2019 •

edited

Loading

ranj063 Apr 18, 2019

kv2019i Apr 18, 2019

kv2019i commented Apr 18, 2019

plbossart left a comment

plbossart Apr 19, 2019

plbossart Apr 19, 2019

lyakh commented May 3, 2019

lgirdwood left a comment •

edited

Loading

kv2019i commented Jun 3, 2019

kv2019i commented Aug 19, 2019

[RFC] Initial FW watchdog implementation #841

[RFC] Initial FW watchdog implementation #841

Conversation

kv2019i commented Apr 17, 2019

xiulipan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plbossart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i Apr 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ranj063 Apr 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i commented Apr 18, 2019

plbossart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lyakh commented May 3, 2019

lgirdwood left a comment • edited Loading

Choose a reason for hiding this comment

kv2019i commented Jun 3, 2019

kv2019i commented Aug 19, 2019

kv2019i Apr 18, 2019 •

edited

Loading

ranj063 Apr 18, 2019 •

edited

Loading

lgirdwood left a comment •

edited

Loading