Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Initial FW watchdog implementation #841

Closed
wants to merge 2 commits into from

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Apr 17, 2019

Here's an intial set of patches to fix #452

I split the solution to two patches, first a simple one to sort out a minor issue that prevented the existing runtime-PM logic to perform basic recovery. Second is more invasive as it adds a conditional forced suspend/resume cycle when closing the stream.

I tested the patches on UP2 with a modified FW that starts busylooping on the audio task after 30secs. This will eventually lead to a DSP oops. Without these patches, it is not possible to recover playback without rebooting. With the patches, I can succesfully recover both with and without runtime PM.

The caveat is that there is no actual watchdog that fires during PCM playback. With no_wakeup_mode, there is no continuous IPC traffic between host and DSP and the watchdog would have to monitor the state of xruns (which could get complicated as ALSA has many operational modes for this). Rather the recovery implemented now is based on checking FW status when stream is terminated. If IPC fails on these paths to a timeout, we know FW is not OK and we can force a power cycle.

This is practically a a feasible solution as most audio applications will regularly open and close the device (e.g. pulseaudio does this). With applications like aplay, issuing a ctrl-C is enough to initiate the watchdog.

Please comment.

Implement a basic recovery flow for the case when hardware
hits a fatal error and stops responding to IPC messages.
The SOF core is left in an unrecoverable state in this case, as power
down sequence is blocked by inability to send state save IPC to
the firmware.

To force recovery, detect when IPC fails to a timeout on the
suspend call path, and force device to suspend despite of the error.

This patch only helps on devices with runtime PM enabled.

Signed-off-by: Kai Vehmanen <[email protected]>
To ensure fatal FW errors are detected even if runtime PM is not
in use, add an additional check to 'hw_free' method. By forcing
a suspend-resume cycle on IPC timeout errors, we ensure FW is
not left in error state at stream close.

Signed-off-by: Kai Vehmanen <[email protected]>
Copy link

@xiulipan xiulipan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will only bind with pcm and pm?
How about topology loading (may cause endless loop because of bad tplg, maybe a retry times needed)
Controls IPC like volume control, which we are suffering from WHL.

And please see my inline comments about debug options. Maybe better to wrap the the reset into a function to give chance for better option control.

@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream)
/* send IPC to the DSP */
ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream,
sizeof(stream), &reply, sizeof(reply));
/* force power down if FW is in error state */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i
I suggest to add some debug option here to give developer some chance to recover the scene. As we will lost the error or the panic exception in memory bank as the DSP is reset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiulipan thanks for the review!

This is actually not so bad for debug as the client will be stuck waiting on sample availability. So e.g. with aplay, you have plenty of time to dump the DSP memories, before you hit ctrl-C or otherwise kill the application.

There may of course be issue with apps that have their own timeout mechanism and will force the app to close if the audio device seems stuck.

This seems like a very specific debug option -- do you think it's worth its own Kconfig option or is it enough just to have a separate function that can be commented out easily by developers when debugging?

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments below.
Indeed you'd need a true watchdog in the firmware to track that it's functional, this is just an recovery on IPC.
Maybe you could add some comments on what forcing a suspend results in, I guess somehow the application will get an error and re-open the device?

if (ret == -ETIMEDOUT) {
dev_warn(sdev->dev,
"warn: ctx_save ipc timeout, forced suspend\n");
} else if (ret < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do we prevent suspend on any IPC error on CTX_SAVE? @ranj063?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But coming back to this patch, if the DSP crashed even before the CTX_SAVE is sent, why do we even send out the ipc?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more intelligent way to tell that the DSP has crashed (based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp to be powered down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart in the past we did let the DSP suspend even in the case of errors. But more recently we've fixed the power down sequence in the FW and so this should technically never fail.

Humm, I am old enough to question these statements. Murphy's Law applies. If there is by-design no error then let's remove it. If not let's document what the expected errors might bet and leave room for the 'known unknowns'

Copy link
Collaborator Author

@kv2019i kv2019i Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ranj063 and @plbossart for the comments!

I'll change this so that we do not prevent suspend on any errors. This would indeed seem like the right thing to do.

On @ranj063 comment:

I think we should come up with a more intelligent way to tell that the DSP has crashed
(based on a previous ipc timeout) and not this ipc in the first place to allow for the dsp

@xiulipan also raised a question about mixer interface.

I still think doing the liveliness check at stream close is pretty good compromise that addresses the main concern Pierre had -> if something goes haywire in FW, driver should not just bail out, but rather leave the hardware in a state where system can recover reliable with limited impact to applications.

I can see what a real PCM watchdog would look like. We could monitor e.g. the HW pointer and see whether it has gone stuck. But it will require complex logic to ensure it doesn't get fooled by state changes and we will have a difficult time to report this sanely to the apps. We could try a power recycle and show it as a XRUN, or just use SNDRV_PCM_STATE_DISCONNECTED. In either case, this will be a rather complicated flow to test&maintain, so I'd like to first merge the simple recovery patchset and then follow-up with the active watchdog.

@@ -201,6 +201,14 @@ static int sof_pcm_hw_free(struct snd_pcm_substream *substream)
/* send IPC to the DSP */
ret = sof_ipc_tx_message(sdev->ipc, stream.hdr.cmd, &stream,
sizeof(stream), &reply, sizeof(reply));
/* force power down if FW is in error state */
if (ret == -ETIMEDOUT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, why just a time-out? have we checked what types of errors we can get here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart the ETIMEOUT error makes sense because if there are other errors it surely means the DSP is still alive. Its only when the DSP crashes that we run into IPC timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the DSP is still alive but throw errors due to a bad programming or configurations, they should not come in the form of a -EXYZ form, they should be a message that's says NAK or something. We are blurring layers if we mix low-level communication issues and protocol issues.

Copy link
Collaborator

@ranj063 ranj063 Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart it actually does send a NAK in the form of setting the ipc reply's error field. The problem is that we are translating it into a -EINVAL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no let me correct myself, -EINVAL is only when the reply size doesnt match the expected reply size.
Otherwise, it is the error value that is propagated from the DSP. which will certainly be something other than -ETIMEOUT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranj063 @plbossart My rationale for limiting to ETIMEDOUT is that hw_free can be called in various situations and we could even have invalid call sequences from the ALSA core (or introduced by user somehow) and power recycling on these case might lead to hard to debug scenarios.

ETIMEDOUT OTOH seems a safe bet. The FW is alive and it should always respond to the IPC. If it is not, it is definitely not working as specific and semantically at 'hw_free', power cycling the FW seems safe.

Maybe this is too conservative, but it is safe and handles a large number of real-life scenarios.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Apr 18, 2019

@xiulipan wrote:

So this will only bind with pcm and pm?
How about topology loading (may cause endless loop because of bad tplg, maybe a retry times needed)
Controls IPC like volume control, which we are suffering from WHL.

Hmm, that's an interesting question. I think for most of these we should be able to provide an immediate error response. E.g. if topology loading fails, driver should detect that and fail the card creation. I don't think we need a watchdog mechanism for this.

Mixer controls are a bit more complicated. There too, the primary method should be to provide a direct response if the action fails. Odds are high that if FW suddenly stops processing mixer control IPCs, the PCM paths will be affected soon as well. But hmm, I'll think about. Gut feeling that it is still better to focus on the PCM activity and not add too many different recovery paths (these all need to be tested&maintained so fewer we have the better).

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments below

if (ret == -ETIMEDOUT) {
dev_warn(sdev->dev,
"warn: ctx_save ipc timeout, forced suspend\n");
} else if (ret < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this again, the commit message is not quite aligned with the code.
In this patch you only force suspend even when the IPC fails. It's not a generic mechanism to force suspend when any IPC fails as the text hints at.

ret = snd_sof_suspend(sdev->dev);
if (!ret)
ret = snd_sof_resume(sdev->dev);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this commit should come first.
I am also wondering if instead of doing this for PCM only we should do this inside the tx_message function, so that all IPCs would be treated the same, not just the PCM stuff.

@lyakh
Copy link

lyakh commented May 3, 2019

Indeed you'd need a true watchdog in the firmware to track that it's functional, this is just an recovery on IPC.

++. Wouldn't it indeed be good to implement a proper hardware watchdog in the FW? Do all our DSPs have that functionality?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need to be linked into IPC timeout and FW panic logic within the driver. We should also make sure we have the option to block the power cycle in DEBUG mode. i.e. for dumps, log and attaching GDB etc

@plbossart plbossart added the Unclear No agreement on problem statement and resolution label May 10, 2019
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jun 3, 2019

The PR is "on hold" at the moment. One patch of the series was now submitted for merging as #1004 but rest of this PR is still on hold.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Aug 19, 2019

Closing this PR as I have not had time to work on this. I'll reopen when work is resumed.

@kv2019i kv2019i closed this Aug 19, 2019
plbossart pushed a commit that referenced this pull request Jan 7, 2020
Avoid that running test nvme/012 from the blktests suite triggers the
following false positive lockdep complaint:

============================================
WARNING: possible recursive locking detected
5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
--------------------------------------------
ksoftirqd/1/16 is trying to acquire lock:
000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

but task is already holding lock:
00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&fq->mq_flush_lock)->rlock);
  lock(&(&fq->mq_flush_lock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by ksoftirqd/1/16:
 #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

stack backtrace:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 dump_stack+0x67/0x90
 __lock_acquire.cold.45+0x2b4/0x313
 lock_acquire+0x98/0x160
 _raw_spin_lock_irqsave+0x3b/0x80
 flush_end_io+0x4e/0x1d0
 blk_mq_complete_request+0x76/0x110
 nvmet_req_complete+0x15/0x110 [nvmet]
 nvmet_bio_done+0x27/0x50 [nvmet]
 blk_update_request+0xd7/0x2d0
 blk_mq_end_request+0x1a/0x100
 blk_flush_complete_seq+0xe5/0x350
 flush_end_io+0x12f/0x1d0
 blk_done_softirq+0x9f/0xd0
 __do_softirq+0xca/0x440
 run_ksoftirqd+0x24/0x50
 smpboot_thread_fn+0x113/0x1e0
 kthread+0x121/0x140
 ret_from_fork+0x3a/0x50

Cc: Christoph Hellwig <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
plbossart pushed a commit that referenced this pull request Sep 28, 2020
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:

  ============================================
  WARNING: possible recursive locking detected
  5.9.0-rc1+ #841 Not tainted
  --------------------------------------------
  test_progs/1750 is trying to acquire lock:
  ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410

  but task is already holding lock:
  ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&htab->buckets[i].raw_lock);
    lock(&htab->buckets[i].raw_lock);

   *** DEADLOCK ***
   ...
  Call Trace:
   dump_stack+0x78/0xa0
   __lock_acquire.cold.74+0x209/0x2e3
   lock_acquire+0xba/0x380
   ? htab_map_update_elem+0x1cf/0x410
   ? __lock_acquire+0x639/0x20c0
   _raw_spin_lock_irqsave+0x3b/0x80
   ? htab_map_update_elem+0x1cf/0x410
   htab_map_update_elem+0x1cf/0x410
   ? lock_acquire+0xba/0x380
   bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
   ? find_held_lock+0x34/0xa0
   bpf_iter_run_prog+0x81/0x16e
   __bpf_hash_map_seq_show+0x145/0x180
   bpf_seq_read+0xff/0x3d0
   vfs_read+0xad/0x1c0
   ksys_read+0x5f/0xe0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ...

The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.

Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
 seq_ops->start():
     rcu_read_lock();
 seq_ops->next():
     rcu_read_unlock();
     /* next element */
     rcu_read_lock();
 seq_ops->stop();
     rcu_read_unlock();

Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unclear No agreement on problem statement and resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Driver support for firmware recovery
6 participants