Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [DGX-8847] Backport MPAM patch to 24.04-linux-nvidia-adv-6.8-next #26

Conversation

KobaKoNvidia
Copy link
Collaborator

@KobaKoNvidia KobaKoNvidia commented Oct 11, 2024

BugLink: https://jirasw.nvidia.com/browse/DGX-8847

[Description]
Backport MPAM patches from MPAM-dev[0]
Also tried to backport MPAM patches based on PR25 to branch [1]

[0], https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev
[1], https://github.com/KobaKoNvidia/NV-Kernels/tree/kobak_dgx8849_pr-nv-kernels-25_backportMPAM

[Test Procedure]

  1. mount mpam resctrl and create partition P2
  2. Run stress-ng and assign PIDs to p2/task
$ killall stress-ng ; stress-ng --stream 144 --stream-l3-size 114m --stream-madvise hugepage & pidof stress-ng | xargs -rn1 echo | sudo tee /sys/fs/resctrl/p2/task
  1. Use perf to monitor MB
$ sudo ./perf stat -I 1000 -e resctrl_pmu/mbm_local_bytes,resctrl_id=`cat /sys/fs/resctrl/p2/id`,domain=1/
  1. Check Cache Portion
$ cat /sys/fs/resctrl/p2/mon_data/mon_L3_0*/llc_occupancy
  1. Check [1] for the details
    [2]. Test Results after backporting patches to linux-nvidia-adv-6.8-next
    [3] Test Results with Lego-CG1-QS-38, https://confluence.nvidia.com/display/DPS/Test+Results+after+backporting+patches+to+linux-nvidia-adv-6.8-next

Jira: DGX-8847
ChangeLog: feat

@clsotog
Copy link
Collaborator

clsotog commented Oct 15, 2024

I try to review the patches tomorrow but I have one question from the results page (https://confluence.nvidia.com/display/DPS/Test+Results+after+backporting+patches+to+linux-nvidia-adv-6.8-next)
its using the kernel 6.8.0-1002-nvidia-adv but the perf tool is like from directory nv_linux-6.5/tools/perf. Does that perf tool is from kernel 6.5? Should we use the perf tool from the kernel tree 6.8.0-1002-nvidia-adv to show results.

@tdavenvidia
Copy link
Collaborator

You should use perf tool from the respective kernel.

Copy link
Collaborator

@jamieNguyenNVIDIA jamieNguyenNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment on "6554ed347e3f arm_mpam: Add workaround for T241-MPAM-6" still accurate?

"[tdave: reworked the patch to make it apply on 6.5]"

debian.nvidia/changelog Outdated Show resolved Hide resolved
@KobaKoNvidia
Copy link
Collaborator Author

KobaKoNvidia commented Oct 16, 2024

from directory nv_linux-6.5/tools/perf. Does that perf tool is from kernel 6.5? Should we

@clsotog sorry for confusing you. nv_linux-6.5 is the name of directory.
I checkouted to the target branch 6.8.0-1002-nvidia-adv.
I think I can show the branch name in the that line.
Will provide the update

Thanks

@KobaKoNvidia KobaKoNvidia force-pushed the kobak_dgx8849_24.04_linux-nvidia-adv-6.8-next_backportMPAM branch from 95bd55b to a4ff7aa Compare October 16, 2024 03:30
@clsotog
Copy link
Collaborator

clsotog commented Oct 16, 2024

@KobaKoNvidia yeah probably showing the branch will help at the results to avoid confusion.
Quick question when I was looking at first commit "x86/resctrl: Track the closid with the rmid" and then compare with (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=40fc735b78f0c81cea7d1c511cfd83892cb4d679) then there are some differences like the link I pointed has these variables defined RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID. Which kernel tree I should compare your patches?

drivers/perf/resctrl_pmu.c Outdated Show resolved Hide resolved
@clsotog
Copy link
Collaborator

clsotog commented Nov 8, 2024

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week.
dmesg.mpam.tree2.txt

@KobaKoNvidia KobaKoNvidia force-pushed the kobak_dgx8849_24.04_linux-nvidia-adv-6.8-next_backportMPAM branch from e432182 to 78cc4dc Compare November 11, 2024 07:05
@KobaKoNvidia
Copy link
Collaborator Author

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. dmesg.mpam.tree2.txt

@clsotog could you share the Bianca with me, i would like to check further.
Thanks

@clsotog
Copy link
Collaborator

clsotog commented Nov 12, 2024

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. dmesg.mpam.tree2.txt

@clsotog could you share the Bianca with me, i would like to check further. Thanks

I only have the bianca during my day time. At your day time someone else use it.

@KobaKoNvidia
Copy link
Collaborator Author

d

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. dmesg.mpam.tree2.txt

@clsotog could you share the Bianca with me, i would like to check further. Thanks

I only have the bianca during my day time. At your day time someone else use it.

@clsotog
I built debs and install.

$ export $(sudo dpkg-architecture -aarm64); export CROSS_COMPILE=aarch64-linux-gnu-;time env CONCURRENCY_LEVEL=$(nproc) gcc=gcc-12\
                sh -c "rm -f ../*.deb;fakeroot debian/rules clean && fakeroot debian/rules binary-headers binary-nvidia-adv \
                binary-perarch do_linux_tools=false do_zfs=false do_mstflint_access=false do_nvidia-fs=false do_skip_checks=true; \
                echo $? > build_lock" 2>&1 | tee build.log
$ sudo dpkg -i *.deb

Didn't find "Call Trace"
Here're the logs, https://drive.google.com/drive/folders/1ZLOsznyv4gfpoBdrXSDIZvxuVde1UF3C?usp=sharing

@clsotog
Copy link
Collaborator

clsotog commented Nov 13, 2024

d

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. dmesg.mpam.tree2.txt

@clsotog could you share the Bianca with me, i would like to check further. Thanks

I only have the bianca during my day time. At your day time someone else use it.

@clsotog I built debs and install.

$ export $(sudo dpkg-architecture -aarm64); export CROSS_COMPILE=aarch64-linux-gnu-;time env CONCURRENCY_LEVEL=$(nproc) gcc=gcc-12\
                sh -c "rm -f ../*.deb;fakeroot debian/rules clean && fakeroot debian/rules binary-headers binary-nvidia-adv \
                binary-perarch do_linux_tools=false do_zfs=false do_mstflint_access=false do_nvidia-fs=false do_skip_checks=true; \
                echo $? > build_lock" 2>&1 | tee build.log
$ sudo dpkg -i *.deb

Didn't find "Call Trace" Here're the logs, https://drive.google.com/drive/folders/1ZLOsznyv4gfpoBdrXSDIZvxuVde1UF3C?usp=sharing

Actually I see the stack trace with 1014 kernel so its not mpam related. I have to debug it a little more on my side. If using noble I see it but if using jammy I do not see it at the same system.

@clsotog
Copy link
Collaborator

clsotog commented Nov 13, 2024

d

I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. dmesg.mpam.tree2.txt

@clsotog could you share the Bianca with me, i would like to check further. Thanks

I only have the bianca during my day time. At your day time someone else use it.

@clsotog I built debs and install.

$ export $(sudo dpkg-architecture -aarm64); export CROSS_COMPILE=aarch64-linux-gnu-;time env CONCURRENCY_LEVEL=$(nproc) gcc=gcc-12\
                sh -c "rm -f ../*.deb;fakeroot debian/rules clean && fakeroot debian/rules binary-headers binary-nvidia-adv \
                binary-perarch do_linux_tools=false do_zfs=false do_mstflint_access=false do_nvidia-fs=false do_skip_checks=true; \
                echo $? > build_lock" 2>&1 | tee build.log
$ sudo dpkg -i *.deb

Didn't find "Call Trace" Here're the logs, https://drive.google.com/drive/folders/1ZLOsznyv4gfpoBdrXSDIZvxuVde1UF3C?usp=sharing

Actually I see the stack trace with 1014 kernel so its not mpam related. I have to debug it a little more on my side. If using noble I see it but if using jammy I do not see it at the same system.

Found the issue at my side forgot that I was using that system to debug customer issue and changed the kernel parameters so we should be good.

nvidia-bfigg pushed a commit that referenced this pull request Nov 15, 2024
Syzkaller reported this warning:
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
 Modules linked in:
 CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
 FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x88/0x130
  ? inet_sock_destruct+0x1c5/0x1e0
  ? report_bug+0x18e/0x1a0
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x18/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? inet_sock_destruct+0x1c5/0x1e0
  __sk_destruct+0x2a/0x200
  rcu_do_batch+0x1aa/0x530
  ? rcu_do_batch+0x13b/0x530
  rcu_core+0x159/0x2f0
  handle_softirqs+0xd3/0x2b0
  ? __pfx_smpboot_thread_fn+0x10/0x10
  run_ksoftirqd+0x25/0x30
  smpboot_thread_fn+0xdd/0x1d0
  kthread+0xd3/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---

Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
which triggers a data-race around sk->sk_forward_alloc:
tcp_v6_rcv
    tcp_v6_do_rcv
        skb_clone_and_charge_r
            sk_rmem_schedule
                __sk_mem_schedule
                    sk_forward_alloc_add()
            skb_set_owner_r
                sk_mem_charge
                    sk_forward_alloc_add()
        __kfree_skb
            skb_release_all
                skb_release_head_state
                    sock_rfree
                        sk_mem_uncharge
                            sk_forward_alloc_add()
                            sk_mem_reclaim
                                // set local var reclaimable
                                __sk_mem_reclaim
                                    sk_forward_alloc_add()

In this syzkaller testcase, two threads call
tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
this:
 (cpu 1)             | (cpu 2)             | sk_forward_alloc
 ...                 | ...                 | 0
 __sk_mem_schedule() |                     | +4096 = 4096
                     | __sk_mem_schedule() | +4096 = 8192
 sk_mem_charge()     |                     | -768  = 7424
                     | sk_mem_charge()     | -768  = 6656
 ...                 |    ...              |
 sk_mem_uncharge()   |                     | +768  = 7424
 reclaimable=7424    |                     |
                     | sk_mem_uncharge()   | +768  = 8192
                     | reclaimable=8192    |
 __sk_mem_reclaim()  |                     | -4096 = 4096
                     | __sk_mem_reclaim()  | -8192 = -4096 != 0

The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
Fix the same issue in dccp_v6_do_rcv().

Suggested-by: Eric Dumazet <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Fixes: e994b2f ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Wang Liang <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
nvidia-bfigg and others added 7 commits November 19, 2024 17:00
Signed-off-by: Brad Figg <[email protected]>
(cherry picked from commit 207132556b8f2b1b4353b4d4e85eb56f9fca6a68 https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev)
Signed-off-by: KobaK <[email protected]>
The resctrl has helper functions for updating CPOR and MBW configuration
updates. While the existing resctrl_arch_update_one() effectively updates
the specified configuration as intended, it inadvertently overrides other
configurations with default values.

Example bandwdith value is not applied:
 root# cat /sys/fs/resctrl/schemata
 MB:1=100
 L3:1=fff

 root# echo -e "L3:1=fff\nMB:1=50" > /sys/fs/resctrl/schemata
 root# cat /sys/fs/resctrl/schemata
 MB:1=100
 L3:1=fff

Fix the potential loss of accuracy during the conversion of MBW_MAX from
percentage to fixed-point representation, and vice versa. The updated
functions provide fixed-point values that closely align with the values
specified in the MPAM specification, Table 9-3 for Fraction Widths And
Hex Representation.

Before this fix:
 root# echo -e "MB:1=1" > /sys/fs/resctrl/schemata
 root# cat /sys/fs/resctrl/schemata
 MB:1=000
 L3:1=fff

 root# echo -e "MB:1=2" > /sys/fs/resctrl/schemata
 root# cat /sys/fs/resctrl/schemata
 MB:1=001
 L3:1=fff

 root# echo -e "MB:1=3" > /sys/fs/resctrl/schemata
 root@pset# cat /sys/fs/resctrl/schemata
 MB:1=001
 L3:1=fff

With this patch:
 root# echo -e "MB:1=1" > /sys/fs/resctrl/schemata
 root# cat /sys/fs/resctrl/schemata
 MB:1=001
 L3:1=fff

 root# echo -e "MB:1=2" > /sys/fs/resctrl/schemata
 root# cat /sys/fs/resctrl/schemata
 MB:1=002
 L3:1=fff

 root# echo -e "MB:1=3" > /sys/fs/resctrl/schemata
 root@pset# cat /sys/fs/resctrl/schemata
 MB:1=003
 L3:1=fff

Signed-off-by: Shanker Donthineni <[email protected]>
[backported from https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/]
Signed-off-by: koba <[email protected]>
~~~
Don't find in mainline kernel and JMorse branches
The MPAM specification includes the MPAMF_IIDR, which serves to
uniquely identify the MSC implementation through a combination of
implementer details, product ID, variant, and revision. Certain
hardware issues/errata can be resolved using software workarounds.

Introduce a quirk framework to allow workarounds to be enabled based
on the MPAMF_IIDR value.

Signed-off-by: Shanker Donthineni <[email protected]>
[ morse: Stash the IIDR so this doesn't need an IPI, enable quirks only
  once, move the description to the callback so it can be pr_once()d, add
  an enum of workarounds for popular errata. Add macros for making lists
  of product/revision/vendor half readable ]
Signed-off-by: James Morse <[email protected]>
(backported from commit d0a88d4feda72bada81791170ee63d0df4774dc7 https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1)
Signed-off-by: KobaK <[email protected]>
The MPAM bandwidth partitioning controls will not be correctly configured,
and hardware will retain default configuration register values, meaning
generally that bandwidth will remain unprovisioned.

To address the issue, follow the below steps after updating the MBW_MIN
and/or MBW_MAX registers.

 - Perform 64b reads from all 12 bridge MPAM shadow registers at offsets
   (0x360048 + slice*0x10000 + partid*8). These registers are read-only.
 - Continue iterating until all 12 shadow register values match in a loop.
   Notify the user if the values fail to match within the loop count 1000.
 - Perform 64b writes with the value 0x0 to the two spare registers at
   offsets 0x1b0000 and 0x1c0000.

In the hardware, writes to the MPAMCFG_MBW_MAX MPAMCFG_MBW_MIN registers
are transformed into broadcast writes to the 12 shadow registers. The
final two writes to the spare registers cause a final rank of downstream
micro-architectural MPAM registers to be updated from the shadow copies.
The intervening loop to read the 12 shadow registers helps avoid a race
condition where writes to the spare registers occur before all shadow
registers have been updated.

Signed-off-by: Shanker Donthineni <[email protected]>
 [ morse: Merged the min/max update into a single
   mpam_quirk_post_config_change() helper. Stashed the t241_id in the msc
   instead of carrying the physical address around. Test the msc quirk bit
   instead of a static key. ]
 Signed-off-by: James Morse <[email protected]>
(backported from commit 90df4cd5e7792b26a619b05ff63264f34065e161 https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1)
Signed-off-by: KobaK <[email protected]>
In the T241 implementation of memory-bandwidth partitioning, in the
absence of contention for bandwidth, the minimum bandwidth setting
can affect the amount of achieved bandwidth. Specifically, the
achieved bandwidth in the absence of contention can settle to any
value between the values of MPAMCFG_MBW_MIN and MPAMCFG_MBW_MAX.
Also, if MPAMCFG_MBW_MIN is set zero (below 0.78125%), once a core
enters a throttled state, it will never leave that state.

The first issue is not a cocern if the MPAM software allows to
program MPAMCFG_MBW_MIN through the sysfs interface. This patch
ensures program MBW_MIN=1 (0.78125%) whenever MPAMCFG_MBW_MIN=0
is programmed.

In the scenario where the resctrl doesn't support the MBW_MIN
interface via sysfs, to achieve bandwidth closer to MW_MAX in the
absence of contention, software should configure a relatively narrow
gap between MBW_MIN and MBW_MAX. The recommendation is to use a 5%
gap to mitigate the problem.

Signed-off-by: Shanker Donthineni <[email protected]>
[ morse: Added as s econd quirk, addapted to use the new intermediate values
  in mpam_extend_config() ]
Signed-off-by: James Morse <[email protected]>
(backported from commit d9ba67a1a8dc6551a0b3254a8f2ee9993ad17957
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1)
Signed-off-by: KobaK <[email protected]>
~~~
Don't need [0] because backport remove the related function.
[0] arm_mpam: Fix T241-MPAM-4 workaround, https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/
The registers MSMON_MBWU_L and MSMON_MBWU return the number of
requests rather than the number of bytes transferred.

Bandwidth resource monitoring is performed at the last level cache,
where each request arrive in 64Byte granularity. The current
implementation returns the number of transactions received at the
last level cache but does not provide the value in bytes. Scaling
by 64 gives an accurate byte count to match the MPAM specification
for the MSMON_MBWU and MSMON_MBWU_L registers. This patch fixes
the issue by reporting the actual number of bytes instead of the
number of transactions from __ris_msmon_read().

Signed-off-by: Shanker Donthineni <[email protected]>
(backported from https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/)
Signed-off-by: KobaK <[email protected]>
The resctrl_pmu driver calls resctrl_arch_rmid_read() from interrupt
context. This eventually ends up in mpam_msmon_read() which cross-calls to
access remote cache-slices, and sleeps if the monitor returned not-ready.

To support resctrl_pmu, remove these behaviours if irqs are masked.

Instead of cross-calling, return an error. This means the resctrl_pmu can't
be supported on platforms with cache-slices that aren't accessible from
every CPU. Instead of sleeping when the counter is busy, return -EBUSY. The
resctrl_pmu driver will not update the counter in this case.

Signed-off-by: James Morse <[email protected]>
Signed-off-by: Tushar Dave <[email protected]>
(cherry picked from commit 6f1849529673a493c9ecd4459bf4a5e7bb69c56f
https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev)
Signed-off-by: KobaK <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the kobak_dgx8849_24.04_linux-nvidia-adv-6.8-next_backportMPAM branch from 9132bd9 to f2410f2 Compare November 20, 2024 01:41
@clsotog clsotog self-requested a review November 20, 2024 05:51
Copy link
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto ([email protected])

@KobaKoNvidia KobaKoNvidia requested a review from clsotog November 20, 2024 06:24
Copy link
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all picked commits against their source. Reviewed backported commits for intent and context.

Acked-by: Matthew R. Ochs [email protected]

@tdavenvidia
Copy link
Collaborator

Signed-off-by: Tushar Dave [email protected]

Copy link
Collaborator

@tdavenvidia tdavenvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add my SOB when commit.
LGTM.

@nvmochs
Copy link
Collaborator

nvmochs commented Nov 22, 2024

Added trailers and pushed to -next.

@nvmochs nvmochs closed this Nov 22, 2024
nvidia-bfigg pushed a commit that referenced this pull request Dec 13, 2024
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ #26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants