-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [DGX-8847] Backport MPAM patch to 24.04-linux-nvidia-adv-6.8-next #26
Conversation
I try to review the patches tomorrow but I have one question from the results page (https://confluence.nvidia.com/display/DPS/Test+Results+after+backporting+patches+to+linux-nvidia-adv-6.8-next) |
You should use perf tool from the respective kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment on "6554ed347e3f arm_mpam: Add workaround for T241-MPAM-6" still accurate?
"[tdave: reworked the patch to make it apply on 6.5]"
@clsotog sorry for confusing you. nv_linux-6.5 is the name of directory. Thanks |
95bd55b
to
a4ff7aa
Compare
@KobaKoNvidia yeah probably showing the branch will help at the results to avoid confusion. |
I tried the mpam tree at Bianca and I see stack trace that I do not see with 6.8 kernel. I also tried like lego-cg1 and I do not see the stack trace so Im thinking something different at Bianca. Sharing the dmesg I try to look more next week. |
e432182
to
78cc4dc
Compare
@clsotog could you share the Bianca with me, i would like to check further. |
I only have the bianca during my day time. At your day time someone else use it. |
d
@clsotog
Didn't find "Call Trace" |
Actually I see the stack trace with 1014 kernel so its not mpam related. I have to debug it a little more on my side. If using noble I see it but if using jammy I do not see it at the same system. |
Found the issue at my side forgot that I was using that system to debug customer issue and changed the kernel parameters so we should be good. |
Syzkaller reported this warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0 Modules linked in: CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78 FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x88/0x130 ? inet_sock_destruct+0x1c5/0x1e0 ? report_bug+0x18e/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? inet_sock_destruct+0x1c5/0x1e0 __sk_destruct+0x2a/0x200 rcu_do_batch+0x1aa/0x530 ? rcu_do_batch+0x13b/0x530 rcu_core+0x159/0x2f0 handle_softirqs+0xd3/0x2b0 ? __pfx_smpboot_thread_fn+0x10/0x10 run_ksoftirqd+0x25/0x30 smpboot_thread_fn+0xdd/0x1d0 kthread+0xd3/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add() concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked, which triggers a data-race around sk->sk_forward_alloc: tcp_v6_rcv tcp_v6_do_rcv skb_clone_and_charge_r sk_rmem_schedule __sk_mem_schedule sk_forward_alloc_add() skb_set_owner_r sk_mem_charge sk_forward_alloc_add() __kfree_skb skb_release_all skb_release_head_state sock_rfree sk_mem_uncharge sk_forward_alloc_add() sk_mem_reclaim // set local var reclaimable __sk_mem_reclaim sk_forward_alloc_add() In this syzkaller testcase, two threads call tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like this: (cpu 1) | (cpu 2) | sk_forward_alloc ... | ... | 0 __sk_mem_schedule() | | +4096 = 4096 | __sk_mem_schedule() | +4096 = 8192 sk_mem_charge() | | -768 = 7424 | sk_mem_charge() | -768 = 6656 ... | ... | sk_mem_uncharge() | | +768 = 7424 reclaimable=7424 | | | sk_mem_uncharge() | +768 = 8192 | reclaimable=8192 | __sk_mem_reclaim() | | -4096 = 4096 | __sk_mem_reclaim() | -8192 = -4096 != 0 The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock(). Fix the same issue in dccp_v6_do_rcv(). Suggested-by: Eric Dumazet <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Fixes: e994b2f ("tcp: do not lock listener to process SYN packets") Signed-off-by: Wang Liang <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Brad Figg <[email protected]> (cherry picked from commit 207132556b8f2b1b4353b4d4e85eb56f9fca6a68 https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev) Signed-off-by: KobaK <[email protected]>
The resctrl has helper functions for updating CPOR and MBW configuration updates. While the existing resctrl_arch_update_one() effectively updates the specified configuration as intended, it inadvertently overrides other configurations with default values. Example bandwdith value is not applied: root# cat /sys/fs/resctrl/schemata MB:1=100 L3:1=fff root# echo -e "L3:1=fff\nMB:1=50" > /sys/fs/resctrl/schemata root# cat /sys/fs/resctrl/schemata MB:1=100 L3:1=fff Fix the potential loss of accuracy during the conversion of MBW_MAX from percentage to fixed-point representation, and vice versa. The updated functions provide fixed-point values that closely align with the values specified in the MPAM specification, Table 9-3 for Fraction Widths And Hex Representation. Before this fix: root# echo -e "MB:1=1" > /sys/fs/resctrl/schemata root# cat /sys/fs/resctrl/schemata MB:1=000 L3:1=fff root# echo -e "MB:1=2" > /sys/fs/resctrl/schemata root# cat /sys/fs/resctrl/schemata MB:1=001 L3:1=fff root# echo -e "MB:1=3" > /sys/fs/resctrl/schemata root@pset# cat /sys/fs/resctrl/schemata MB:1=001 L3:1=fff With this patch: root# echo -e "MB:1=1" > /sys/fs/resctrl/schemata root# cat /sys/fs/resctrl/schemata MB:1=001 L3:1=fff root# echo -e "MB:1=2" > /sys/fs/resctrl/schemata root# cat /sys/fs/resctrl/schemata MB:1=002 L3:1=fff root# echo -e "MB:1=3" > /sys/fs/resctrl/schemata root@pset# cat /sys/fs/resctrl/schemata MB:1=003 L3:1=fff Signed-off-by: Shanker Donthineni <[email protected]> [backported from https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/] Signed-off-by: koba <[email protected]> ~~~ Don't find in mainline kernel and JMorse branches
The MPAM specification includes the MPAMF_IIDR, which serves to uniquely identify the MSC implementation through a combination of implementer details, product ID, variant, and revision. Certain hardware issues/errata can be resolved using software workarounds. Introduce a quirk framework to allow workarounds to be enabled based on the MPAMF_IIDR value. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Stash the IIDR so this doesn't need an IPI, enable quirks only once, move the description to the callback so it can be pr_once()d, add an enum of workarounds for popular errata. Add macros for making lists of product/revision/vendor half readable ] Signed-off-by: James Morse <[email protected]> (backported from commit d0a88d4feda72bada81791170ee63d0df4774dc7 https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1) Signed-off-by: KobaK <[email protected]>
The MPAM bandwidth partitioning controls will not be correctly configured, and hardware will retain default configuration register values, meaning generally that bandwidth will remain unprovisioned. To address the issue, follow the below steps after updating the MBW_MIN and/or MBW_MAX registers. - Perform 64b reads from all 12 bridge MPAM shadow registers at offsets (0x360048 + slice*0x10000 + partid*8). These registers are read-only. - Continue iterating until all 12 shadow register values match in a loop. Notify the user if the values fail to match within the loop count 1000. - Perform 64b writes with the value 0x0 to the two spare registers at offsets 0x1b0000 and 0x1c0000. In the hardware, writes to the MPAMCFG_MBW_MAX MPAMCFG_MBW_MIN registers are transformed into broadcast writes to the 12 shadow registers. The final two writes to the spare registers cause a final rank of downstream micro-architectural MPAM registers to be updated from the shadow copies. The intervening loop to read the 12 shadow registers helps avoid a race condition where writes to the spare registers occur before all shadow registers have been updated. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Merged the min/max update into a single mpam_quirk_post_config_change() helper. Stashed the t241_id in the msc instead of carrying the physical address around. Test the msc quirk bit instead of a static key. ] Signed-off-by: James Morse <[email protected]> (backported from commit 90df4cd5e7792b26a619b05ff63264f34065e161 https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1) Signed-off-by: KobaK <[email protected]>
In the T241 implementation of memory-bandwidth partitioning, in the absence of contention for bandwidth, the minimum bandwidth setting can affect the amount of achieved bandwidth. Specifically, the achieved bandwidth in the absence of contention can settle to any value between the values of MPAMCFG_MBW_MIN and MPAMCFG_MBW_MAX. Also, if MPAMCFG_MBW_MIN is set zero (below 0.78125%), once a core enters a throttled state, it will never leave that state. The first issue is not a cocern if the MPAM software allows to program MPAMCFG_MBW_MIN through the sysfs interface. This patch ensures program MBW_MIN=1 (0.78125%) whenever MPAMCFG_MBW_MIN=0 is programmed. In the scenario where the resctrl doesn't support the MBW_MIN interface via sysfs, to achieve bandwidth closer to MW_MAX in the absence of contention, software should configure a relatively narrow gap between MBW_MIN and MBW_MAX. The recommendation is to use a 5% gap to mitigate the problem. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Added as s econd quirk, addapted to use the new intermediate values in mpam_extend_config() ] Signed-off-by: James Morse <[email protected]> (backported from commit d9ba67a1a8dc6551a0b3254a8f2ee9993ad17957 https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.10-rc1) Signed-off-by: KobaK <[email protected]> ~~~ Don't need [0] because backport remove the related function. [0] arm_mpam: Fix T241-MPAM-4 workaround, https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/
The registers MSMON_MBWU_L and MSMON_MBWU return the number of requests rather than the number of bytes transferred. Bandwidth resource monitoring is performed at the last level cache, where each request arrive in 64Byte granularity. The current implementation returns the number of transactions received at the last level cache but does not provide the value in bytes. Scaling by 64 gives an accurate byte count to match the MPAM specification for the MSMON_MBWU and MSMON_MBWU_L registers. This patch fixes the issue by reporting the actual number of bytes instead of the number of transactions from __ris_msmon_read(). Signed-off-by: Shanker Donthineni <[email protected]> (backported from https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/) Signed-off-by: KobaK <[email protected]>
The resctrl_pmu driver calls resctrl_arch_rmid_read() from interrupt context. This eventually ends up in mpam_msmon_read() which cross-calls to access remote cache-slices, and sleeps if the monitor returned not-ready. To support resctrl_pmu, remove these behaviours if irqs are masked. Instead of cross-calling, return an error. This means the resctrl_pmu can't be supported on platforms with cache-slices that aren't accessible from every CPU. Instead of sleeping when the counter is busy, return -EBUSY. The resctrl_pmu driver will not update the counter in this case. Signed-off-by: James Morse <[email protected]> Signed-off-by: Tushar Dave <[email protected]> (cherry picked from commit 6f1849529673a493c9ecd4459bf4a5e7bb69c56f https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev) Signed-off-by: KobaK <[email protected]>
9132bd9
to
f2410f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked-by: Carol L Soto ([email protected])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified all picked commits against their source. Reviewed backported commits for intent and context.
Acked-by: Matthew R. Ochs [email protected]
Signed-off-by: Tushar Dave [email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add my SOB when commit.
LGTM.
Added trailers and pushed to -next. |
This fixes the circular locking dependency warning below, by reworking iso_sock_recvmsg, to ensure that the socket lock is always released before calling a function that locks hdev. [ 561.670344] ====================================================== [ 561.670346] WARNING: possible circular locking dependency detected [ 561.670349] 6.12.0-rc6+ #26 Not tainted [ 561.670351] ------------------------------------------------------ [ 561.670353] iso-tester/3289 is trying to acquire lock: [ 561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3}, at: iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670405] but task is already holding lock: [ 561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}, at: iso_sock_recvmsg+0xbf/0x500 [bluetooth] [ 561.670450] which lock already depends on the new lock. [ 561.670452] the existing dependency chain (in reverse order) is: [ 561.670453] -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}: [ 561.670458] lock_acquire+0x7c/0xc0 [ 561.670463] lock_sock_nested+0x3b/0xf0 [ 561.670467] bt_accept_dequeue+0x1a5/0x4d0 [bluetooth] [ 561.670510] iso_sock_accept+0x271/0x830 [bluetooth] [ 561.670547] do_accept+0x3dd/0x610 [ 561.670550] __sys_accept4+0xd8/0x170 [ 561.670553] __x64_sys_accept+0x74/0xc0 [ 561.670556] x64_sys_call+0x17d6/0x25f0 [ 561.670559] do_syscall_64+0x87/0x150 [ 561.670563] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670567] -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 561.670571] lock_acquire+0x7c/0xc0 [ 561.670574] lock_sock_nested+0x3b/0xf0 [ 561.670577] iso_sock_listen+0x2de/0xf30 [bluetooth] [ 561.670617] __sys_listen_socket+0xef/0x130 [ 561.670620] __x64_sys_listen+0xe1/0x190 [ 561.670623] x64_sys_call+0x2517/0x25f0 [ 561.670626] do_syscall_64+0x87/0x150 [ 561.670629] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670632] -> #0 (&hdev->lock){+.+.}-{3:3}: [ 561.670636] __lock_acquire+0x32ad/0x6ab0 [ 561.670639] lock_acquire.part.0+0x118/0x360 [ 561.670642] lock_acquire+0x7c/0xc0 [ 561.670644] __mutex_lock+0x18d/0x12f0 [ 561.670647] mutex_lock_nested+0x1b/0x30 [ 561.670651] iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670687] iso_sock_recvmsg+0x3e9/0x500 [bluetooth] [ 561.670722] sock_recvmsg+0x1d5/0x240 [ 561.670725] sock_read_iter+0x27d/0x470 [ 561.670727] vfs_read+0x9a0/0xd30 [ 561.670731] ksys_read+0x1a8/0x250 [ 561.670733] __x64_sys_read+0x72/0xc0 [ 561.670736] x64_sys_call+0x1b12/0x25f0 [ 561.670738] do_syscall_64+0x87/0x150 [ 561.670741] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670744] other info that might help us debug this: [ 561.670745] Chain exists of: &hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH [ 561.670751] Possible unsafe locking scenario: [ 561.670753] CPU0 CPU1 [ 561.670754] ---- ---- [ 561.670756] lock(sk_lock-AF_BLUETOOTH); [ 561.670758] lock(sk_lock AF_BLUETOOTH-BTPROTO_ISO); [ 561.670761] lock(sk_lock-AF_BLUETOOTH); [ 561.670764] lock(&hdev->lock); [ 561.670767] *** DEADLOCK *** Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync") Signed-off-by: Iulia Tanasescu <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
BugLink: https://jirasw.nvidia.com/browse/DGX-8847
[Description]
Backport MPAM patches from MPAM-dev[0]
Also tried to backport MPAM patches based on PR25 to branch [1]
[0], https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.5/tree/MPAM-dev
[1], https://github.com/KobaKoNvidia/NV-Kernels/tree/kobak_dgx8849_pr-nv-kernels-25_backportMPAM
[Test Procedure]
[2]. Test Results after backporting patches to linux-nvidia-adv-6.8-next
[3] Test Results with Lego-CG1-QS-38, https://confluence.nvidia.com/display/DPS/Test+Results+after+backporting+patches+to+linux-nvidia-adv-6.8-next
Jira: DGX-8847
ChangeLog: feat