Improved NULL check, nulled `bus_data`, clear some status variables during init #2

fedefrancescon · 2023-12-18T13:37:21Z

When kree(sdio_priv) we should also null the bus_data probably

Using fully updated `wilc_stability_issues`

This reverts commit 9dd7b23.

fedefrancescon · 2023-12-20T14:51:53Z

TLDR

I'm testing and, as of now, the results seems to be on par with the previous versions (with sleeps). To be honest I don't know why.
But seems promising. I'm still a bit in doubt about readding the sleep. Let's see tomorrow how the tests go.
The problem seems to be still present, with the same rate as with the sleep.
A tester could be:

for i in {1..100}; do
  for act in up down; do
    ip link set wlan-sta $act && continue
    echo "[$i] $act Failed"
    sleep 30
  done
done

close / quit

In some places I've added the reset for wilc->close and wilc->quit, as far as I get these are used as "status" flags, but looks like in some code/event flows this is not always properly resetted at deinit/init.
Adjusting these seems to have greatly reduced the "broken" status

wilc_bt_power_down

I've added this function call because, it's twin is called on init and seems to be doing some interesting thing. I'm not sure if it's really needed and if it's in the correct place

rssi !initialized

I've added an init check as the rssi function seems to be called in a very early step when the module is not fully initialized and always fails, I'm not sure it's a real problem but anyway seems pointless to query something that I'm sure it's not going to work.
At first I returned an error instead of 0, but seems better to leave the init continue. What do you think?

`bus_data = NULL`

Added some Nulling of bus_data, looking at the code seems that this should never be needed as seems to be intended that it's freed only on module unloading. Maybe better safe than sorry?

… in the initialize

…ialize to avoid possible collisions

fedefrancescon · 2024-01-16T12:08:05Z

Summary for PR commit message

wilc1000 SDIO device has some issues while doing down/up interface operations.
The problem presents with failure loading the device firmware and driver initialization fails.
This issue seems to affect only wilc1000, even though we have access to only one wilc3000 evaluation board.

The issue seems to be triggered by the down operation which, probably, leaves the chip firmware in a kind of broken state.
This PR DOES NOT FIX the problem but, at least, greatly reduces its occurrence.
As an (hugly) workaround multiple subsequent call to ip link set [interface] up seems to restore the wilc1000 in a working state.

Example:

ip link set wlan0 down
for i in {1..20}; do
  ip link set wlan0 up && break
  echo "Failed $i"
done

Changes

close / quit / initialized

The wilc structure uses some flags to represent the current status of the driver/interface: close, quit, initialized.
Some other similar flags are used in more specific steps.
The usage of this flags withing the chip state flow is not very clear and missing reset of this flags could have left the device in an "inconsistent" state.

`wilc_bt_power_down`

It's twin function is called on init and seems to be doing something useful on power-down too.

rssi !initialized

RSSI function seems to be called in a very early step when the module is not fully initialized and always fails, seems not a real problem but is pointless to query something that it's not going to work.

Nulling `bus_data`

Added some Nulling of bus_data, looking at the code seems that this should never be needed as seems to be intended that it's freed only on module unloading, anyway looks safer to explicitly set to NULL.

[ Upstream commit 65c7cde ] The discussion about removing the side effect of irq_set_affinity_hint() of actually applying the cpumask (if not NULL) as affinity to the interrupt, unearthed a few unpleasantries: 1) The modular perf drivers rely on the current behaviour for the very wrong reasons. 2) While none of the other drivers prevents user space from changing the affinity, a cursorily inspection shows that there are at least expectations in some drivers. #1 needs to be cleaned up anyway, so that's not a problem #2 might result in subtle regressions especially when irqbalanced (which nowadays ignores the affinity hint) is disabled. Provide new interfaces: irq_update_affinity_hint() - Only sets the affinity hint pointer irq_set_affinity_and_hint() - Set the pointer and apply the affinity to the interrupt Make irq_set_affinity_hint() a wrapper around irq_apply_affinity_hint() and document it to be phased out. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Nitesh Narayan Lal <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Stable-dep-of: 915470e ("i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path") Signed-off-by: Sasha Levin <[email protected]>

commit 0570327 upstream. Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 linux4sam#3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 linux4sam#4 [3800313fc60] zpci_bus_remove_device at c90fb6104 linux4sam#5 [3800313fca0] __zpci_event_availability at c90fb3dca linux4sam#6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 linux4sam#7 [3800313fd60] crw_collect_info at c91905822 linux4sam#8 [3800313fe10] kthread at c90feb390 linux4sam#9 [3800313fe68] __ret_from_fork at c90f6aa64 linux4sam#10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Benjamin Block <[email protected]> Reviewed-by: Farhan Ali <[email protected]> Reviewed-by: Julian Ruess <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit a91c809 ] The original code causes a circular locking dependency found by lockdep. ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S U ------------------------------------------------------ xe_fault_inject/5091 is trying to acquire lock: ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660 but task is already holding lock: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&devcd->mutex){+.+.}-{3:3}: mutex_lock_nested+0x4e/0xc0 devcd_data_write+0x27/0x90 sysfs_kf_bin_write+0x80/0xf0 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (kn->active#236){++++}-{0:0}: kernfs_drain+0x1e2/0x200 __kernfs_remove+0xae/0x400 kernfs_remove_by_name_ns+0x5d/0xc0 remove_files+0x54/0x70 sysfs_remove_group+0x3d/0xa0 sysfs_remove_groups+0x2e/0x60 device_remove_attrs+0xc7/0x100 device_del+0x15d/0x3b0 devcd_del+0x19/0x30 process_one_work+0x22b/0x6f0 worker_thread+0x1e8/0x3d0 kthread+0x11c/0x250 ret_from_fork+0x26c/0x2e0 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}: __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 __flush_work+0x27a/0x660 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devcd->mutex); lock(kn->active#236); lock(&devcd->mutex); lock((work_completion)(&(&devcd->del_wk)->work)); *** DEADLOCK *** 5 locks held by xe_fault_inject/5091: #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280 linux4sam#3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 linux4sam#4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660 stack backtrace: CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S U 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)} Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021 Call Trace: <TASK> dump_stack_lvl+0x91/0xf0 dump_stack+0x10/0x20 print_circular_bug+0x285/0x360 check_noncircular+0x135/0x150 ? register_lock_class+0x48/0x4a0 __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 ? __flush_work+0x25d/0x660 ? mark_held_locks+0x46/0x90 ? __flush_work+0x25d/0x660 __flush_work+0x27a/0x660 ? __flush_work+0x25d/0x660 ? trace_hardirqs_on+0x1e/0xd0 ? __pfx_wq_barrier_func+0x10/0x10 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 ? bus_find_device+0xa8/0xe0 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 ? __f_unlock_pos+0x15/0x20 ? __x64_sys_getdents64+0x9b/0x130 ? __pfx_filldir64+0x10/0x10 ? do_syscall_64+0x1a2/0xb60 ? clear_bhb_loop+0x30/0x80 ? clear_bhb_loop+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x76e292edd574 Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574 RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063 R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000 </TASK> xe 0000:03:00.0: [drm] Xe device coredump has been deleted. Fixes: 01daccf ("devcoredump : Serialize devcd_del work") Cc: Mukesh Ojha <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: [email protected] Cc: [email protected] # v6.1+ Signed-off-by: Maarten Lankhorst <[email protected]> Cc: Matthew Brost <[email protected]> Acked-by: Mukesh Ojha <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> [ replaced disable_delayed_work_sync() with cancel_delayed_work_sync() ] Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

fedefrancescon and others added 11 commits December 18, 2023 14:35

nulling bus_data when freeing sdio_priv

7f4f22d

added some error check

f2c69cc

fix build issues

bcf9e8d

Merge pull request #1 from fedefrancescon:wilc_stability_issues

5a6760a

Using fully updated `wilc_stability_issues`

avoid errors on RSSI if driver has not yet been initialized

3a1d70f

added missing reset options (not sure they're really really needed)

c2ff900

added wilc powerdown on deinit

2187e89

removed sleeps

27b6b66

%% original patch: 0001-Disable_noisy_WILC1000.patch

9dd7b23

Revert "%% original patch: 0001-Disable_noisy_WILC1000.patch"

a4acb1c

This reverts commit 9dd7b23.

added reset for wilc->close

96103b2

fedefrancescon changed the title ~~Nulling sdio_priv~~ Improved NULL check, nulled bus_data, clear some status variables during init Dec 20, 2023

fedefrancescon marked this pull request as ready for review December 20, 2023 14:09

Federico Francescon added 5 commits January 3, 2024 12:38

removed duplicated property init (already in wilc_wlan_init)

49e2558

CHECK: wilc_wlan_deinitialize now respects the order of operations as…

634bae8

… in the initialize

CHECK: the wilc->initialized flag should be set within wilc_wlan_init…

b3c730b

…ialize to avoid possible collisions

missing initialized flag reset

059a594

made some error message clearer

8a8e1bf

fedepell merged commit 9f87329 into fedepell:wilc_stability_issues Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved NULL check, nulled `bus_data`, clear some status variables during init #2

Improved NULL check, nulled `bus_data`, clear some status variables during init #2

Uh oh!

fedefrancescon commented Dec 18, 2023

Uh oh!

fedefrancescon commented Dec 20, 2023

Uh oh!

fedefrancescon commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improved NULL check, nulled bus_data, clear some status variables during init #2

Improved NULL check, nulled bus_data, clear some status variables during init #2

Uh oh!

Conversation

fedefrancescon commented Dec 18, 2023

Uh oh!

fedefrancescon commented Dec 20, 2023

TLDR

close / quit

wilc_bt_power_down

rssi !initialized

bus_data = NULL

Uh oh!

fedefrancescon commented Jan 16, 2024

Summary for PR commit message

Changes

close / quit / initialized

wilc_bt_power_down

rssi !initialized

Nulling bus_data

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improved NULL check, nulled `bus_data`, clear some status variables during init #2

Improved NULL check, nulled `bus_data`, clear some status variables during init #2

`bus_data = NULL`

`wilc_bt_power_down`

Nulling `bus_data`