Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf, arm64: fix bpf line info #3

Closed
wants to merge 3 commits into from

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: bpf, arm64: fix bpf line info
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002

@kernel-patches-bot
Copy link
Author

Master branch: edc21dc
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: d2b94f3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 8cbf062
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 8cbf062
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 477bb4c
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 2e3f7be
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: f76d850
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 9b6eb04
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 1b8c924
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 9e98ace
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 1b8c924
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: b38101c
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: b75daca
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002
version: 3

kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 21, 2024
Eduard Zingerman says:

====================
__jited test tag to check disassembly after jit

Some of the logic in the BPF jits might be non-trivial.
It might be useful to allow testing this logic by comparing
generated native code with expected code template.
This patch set adds a macro __jited() that could be used for
test_loader based tests in a following manner:

    SEC("tp")
    __arch_x86_64
    __jited("   endbr64")
    __jited("   nopl    (%rax,%rax)")
    __jited("   xorq    %rax, %rax")
    ...
    __naked void some_test(void) { ... }

Also add a test for jit code generated for tail calls handling to
demonstrate the feature.

The feature uses LLVM libraries to do the disassembly.
At selftests compilation time Makefile detects if these libraries are
available. When libraries are not available tests using __jit_x86()
are skipped.
Current CI environment does not include llvm development libraries,
but changes to add these are trivial.

This was previously discussed here:
https://lore.kernel.org/bpf/[email protected]/

Patch-set includes a few auxiliary steps:
- patches #2 and #3 fix a few bugs in test_loader behaviour;
- patch #4 replaces __regex macro with ability to specify regular
  expressions in __msg and __xlated using "{{" "}}" escapes;
- patch #8 updates __xlated to match disassembly lines consequently,
  same way as __jited does.

Changes v2->v3:
- changed macro name from __jit_x86 to __jited with __arch_* to
  specify disassembly arch (Yonghong);
- __jited matches disassembly lines consequently with "..."
  allowing to skip some number of lines (Andrii);
- __xlated matches disassembly lines consequently, same as __jited;
- "{{...}}" regex brackets instead of __regex macro;
- bug fixes for old commits.

Changes v1->v2:
- stylistic changes suggested by Yonghong;
- fix for -Wformat-truncation related warning when compiled with
  llvm15 (Yonghong).

v1: https://lore.kernel.org/bpf/[email protected]/
v2: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 22, 2024
Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731d ("blk-mq: fix potential io hang by wrong 'wake_batch'")
Signed-off-by: Li Lingfeng <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 27, 2024
Tariq Toukan says:

====================
mlx5 misc patches 2024-08-08

This patchset contains multiple enhancements from the team to the mlx5
core and Eth drivers.

Patch #1 by Chris bumps a defined value to permit more devices doing TC
offloads.

Patch #2 by Jianbo adds an IPsec fast-path optimization to replace the
slow async handling.

Patches #3 and #4 by Jianbo add TC offload support for complicated rules
to overcome firmware limitation.

Patch #5 by Gal unifies the access macro to advertised/supported link
modes.

Patches #6 to #9 by Gal adds extack messages in ethtool ops to replace
prints to the kernel log.

Patch #10 by Cosmin switches to using 'update' verb instead of 'replace'
to better reflect the operation.

Patch #11 by Cosmin exposes an update connection tracking operation to
replace the assumed delete+add implementaiton.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 27, 2024
Petr Machata says:

====================
net: nexthop: Increase weight to u16

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patchset increase the next hop weight from u8 to u16.

Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
This is a follow-up to a new fix merged in commit 6d745cd ("net:
nexthop: Initialize all fields in dumped nexthops"). The theory behind this
patch is that there is a strict ordering between the fields actually being
zeroed, the kernel declaring that they are, and the kernel repurposing the
fields. Thus clients can use the flag to tell if it is safe to interpret
the reserved fields in any way.

Patch #2 contains the substantial code and the commit message covers the
details of the changes.

Patches #3 to #6 add selftests.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 27, 2024
Ido Schimmel says:

====================
Preparations for FIB rule DSCP selector

This patchset moves the masking of the upper DSCP bits in 'flowi4_tos'
to the core instead of relying on callers of the FIB lookup API to do
it.

This will allow us to start changing users of the API to initialize the
'flowi4_tos' field with all six bits of the DSCP field. In turn, this
will allow us to extend FIB rules with a new DSCP selector.

By masking the upper DSCP bits in the core we are able to maintain the
behavior of the TOS selector in FIB rules and routes to only match on
the lower DSCP bits.

While working on this I found two users of the API that do not mask the
upper DSCP bits before performing the lookup. The first is an ancient
netlink family that is unlikely to be used. It is adjusted in patch #1
to mask both the upper DSCP bits and the ECN bits before calling the
API.

The second user is a nftables module that differs in this regard from
its equivalent iptables module. It is adjusted in patch #2 to invoke the
API with the upper DSCP bits masked, like all other callers. The
relevant selftest passed, but in the unlikely case that regressions are
reported because of this change, we can restore the existing behavior
using a new flow information flag as discussed here [1].

The last patch moves the masking of the upper DSCP bits to the core,
making the first two patches redundant, but I wanted to post them
separately to call attention to the behavior change for these two users
of the FIB lookup API.

Future patchsets (around 3) will start unmasking the upper DSCP bits
throughout the networking stack before adding support for the new FIB
rule DSCP selector.

Changes from v1 [2]:

Patch #3: Include <linux/ip.h> in <linux/in_route.h> instead of
including it in net/ip_fib.h

[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
[2] https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 27, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 disable BH when collecting stats via hardware offload to ensure
         concurrent updates from packet path do not result in losing stats.
         From Sebastian Andrzej Siewior.

Patch #2 uses write seqcount to reset counters serialize against reader.
         Also from Sebastian Andrzej Siewior.

Patch #3 ensures vlan header is in place before accessing its fields,
         according to KMSAN splat triggered by syzbot.

* tag 'nf-24-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: flowtable: validate vlan header
  netfilter: nft_counter: Synchronize nft_counter_reset() against reader.
  netfilter: nft_counter: Disable BH in nft_counter_offload_stats().
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Aug 27, 2024
…rnel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
	 segment GSO packet since skb_zerocopy() does not support
	 GSO_BY_FRAGS, from Antonio Ojea.

Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
	 from Antonio Ojea.

Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
         from Donald Hunter.

Patch #4 adds a dedicate commit list for sets to speed up
	 intra-transaction lookups, from Florian Westphal.

Patch #5 skips removal of element from abort path for the pipapo
         backend, ditching the shadow copy of this datastructure
	 is sufficient.

Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
	 let users of conncoiunt decide when to enable conntrack,
	 this is needed by openvswitch, from Xin Long.

Patch #7 pass context to all nft_parse_register_load() in
	 preparation for the next patch.

Patches #8 and #9 reject loads from uninitialized registers from
	 control plane to remove register initialization from
	 datapath. From Florian Westphal.

* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: don't initialize registers in nft_do_chain()
  netfilter: nf_tables: allow loads only when register is initialized
  netfilter: nf_tables: pass context structure to nft_parse_register_load
  netfilter: move nf_ct_netns_get out of nf_conncount_init
  netfilter: nf_tables: do not remove elements if set backend implements .abort
  netfilter: nf_tables: store new sets in dedicated list
  netfilter: nfnetlink: convert kfree_skb to consume_skb
  selftests: netfilter: nft_queue.sh: sctp coverage
  netfilter: nfnetlink_queue: unbreak SCTP traffic
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 9, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 10, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 10, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 11, 2024
llvm change [1] made a change such that __sync_fetch_and_{and,or,xor}()
will generate atomic_fetch_*() insns even if the return value is not used.
This is a deliberate choice to make sure barrier semantics are preserved
from source code to asm insn.

But the change in [1] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,v
ar_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [2] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses.

Version 2 of the patch fixed the issue by using inline asm ([3]). But further discussion
suggested to find a way from source to generate locked insn which is more user
friendly. So in not-merged llvm patch ([4]), if relax memory ordering is used and
the return value is not used, locked insn could be generated.

So with llvm patch [4] to compile the bpf selftest, the following code
  __c11_atomic_fetch_and(&and64_value, 0x011ull << 32, memory_order_relaxed);
is able to generate locked insn, hence fixing the selftest failure.

  [1] llvm/llvm-project#106494
  [2] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
  [3] https://lore.kernel.org/bpf/[email protected]/
  [4] llvm/llvm-project#107343

Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 13, 2024
Ido Schimmel says:

====================
Unmask upper DSCP bits - part 2

tl;dr - This patchset continues to unmask the upper DSCP bits in the
IPv4 flow key in preparation for allowing IPv4 FIB rules to match on
DSCP. No functional changes are expected. Part 1 was merged in commit
("Merge branch 'unmask-upper-dscp-bits-part-1'").

The TOS field in the IPv4 flow key ('flowi4_tos') is used during FIB
lookup to match against the TOS selector in FIB rules and routes.

It is currently impossible for user space to configure FIB rules that
match on the DSCP value as the upper DSCP bits are either masked in the
various call sites that initialize the IPv4 flow key or along the path
to the FIB core.

In preparation for adding a DSCP selector to IPv4 and IPv6 FIB rules, we
need to make sure the entire DSCP value is present in the IPv4 flow key.
This patchset continues to unmask the upper DSCP bits, but this time in
the output route path.

Patches #1-#3 unmask the upper DSCP bits in the various places that
invoke the core output route lookup functions directly.

Patches #4-#6 do the same in three helpers that are widely used in the
output path to initialize the TOS field in the IPv4 flow key.

The rest of the patches continue to unmask these bits in call sites that
invoke the following wrappers around the core lookup functions:

Patch #7 - __ip_route_output_key()
Patches #8-#12 - ip_route_output_flow()

The next patchset will handle the callers of ip_route_output_ports() and
ip_route_output_key().

No functional changes are expected as commit 1fa3314 ("ipv4:
Centralize TOS matching") moved the masking of the upper DSCP bits to
the core where 'flowi4_tos' is matched against the TOS selector.

Changes since v1 [1]:

* Remove IPTOS_RT_MASK in patch #7 instead of in patch #6

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 13, 2024
Daniel Machon says:

====================
net: microchip: add FDMA library and use it for Sparx5

This patch series is the first of a 2-part series, that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a
common library with a common implementation.  This also has the benefit
of removing a lot open-coded bookkeeping and duplicate code for the two
drivers.

Additionally, upstreaming efforts for a third chip, lan969x, will begin
in the near future. This chip will use the new library too.

In this first series, the FDMA library is introduced and used by the
Sparx5 switch driver.

 ###################
 # Example of use: #
 ###################

- Initialize the rx and tx fdma structs with values for: number of
  DCB's, number of DB's, channel ID, DB size (data buffer size), and
  total size of the requested memory. Also provide two callbacks:
  nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.

- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().

- Initialize the DCB's with fdma_dcb_init().

- Add new DCB's with fdma_dcb_add().

- Free memory with fdma_free_phys() or fdma_free_coherent().

 #####################
 # Patch  breakdown: #
 #####################

Patch #1:  introduces library and selects it for Sparx5.

Patch #2:  includes the fdma_api.h header and removes old symbols.

Patch #3:  replaces old rx and tx variables with equivalent ones from the
           fdma struct. Only the variables that can be changed without
           breaking traffic is changed in this patch.

Patch #4:  uses the library for allocation of rx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #5:  uses the library for adding DCB's in the rx path.

Patch #6:  uses the library for freeing rx buffers.

Patch #7:  uses the library helpers in the rx path.

Patch #8:  uses the library for allocation of tx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #9:  uses the library for adding DCB's in the tx path.

Patch #10: uses the library helpers in the tx path.

Patch #11: ditches the existing linked list for storing buffer addresses,
           and instead uses offsets into contiguous memory.

Patch #12: modifies existing rx and tx functions to be direction
           independent.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 13, 2024
…rnel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

Patch #1 adds ctnetlink support for kernel side filtering for
	 deletions, from Changliang Wu.

Patch #2 updates nft_counter support to Use u64_stats_t,
	 from Sebastian Andrzej Siewior.

Patch #3 uses kmemdup_array() in all xtables frontends,
	 from Yan Zhen.

Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead
	 opencoded casting, from Shen Lichuan.

Patch #5 removes unused argument in nftables .validate interface,
	 from Florian Westphal.

Patch #6 is a oneliner to correct a typo in nftables kdoc,
	 from Simon Horman.

Patch #7 fixes missing kdoc in nftables, also from Simon.

Patch #8 updates nftables to handle timeout less than CONFIG_HZ.

Patch #9 rejects element expiration if timeout is zero,
	 otherwise it is silently ignored.

Patch #10 disallows element expiration larger than timeout.

Patch #11 removes unnecessary READ_ONCE annotation while mutex is held.

Patch #12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset.

Patch #13 annotates data-races around element expiration.

Patch #14 allocates timeout and expiration in one single set element
	  extension, they are tighly couple, no reason to keep them
	  separated anymore.

Patch #15 updates nftables to interpret zero timeout element as never
	  times out. Note that it is already possible to declare sets
	  with elements that never time out but this generalizes to all
	  kind of set with timeouts.

Patch #16 supports for element timeout and expiration updates.

* tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: set element timeout update support
  netfilter: nf_tables: zero timeout means element never times out
  netfilter: nf_tables: consolidate timeout extension for elements
  netfilter: nf_tables: annotate data-races around element expiration
  netfilter: nft_dynset: annotate data-races around set timeout
  netfilter: nf_tables: remove annotation to access set timeout while holding lock
  netfilter: nf_tables: reject expiration higher than timeout
  netfilter: nf_tables: reject element expiration with no timeout
  netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire
  netfilter: nf_tables: Add missing Kernel doc
  netfilter: nf_tables: Correct spelling in nf_tables.h
  netfilter: nf_tables: drop unused 3rd argument from validate callback ops
  netfilter: conntrack: Convert to use ERR_CAST()
  netfilter: Use kmemdup_array instead of kmemdup for multiple allocation
  netfilter: nft_counter: Use u64_stats_t for statistic.
  netfilter: ctnetlink: support CTA_FILTER for flush
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 13, 2024
Daniel Machon says:

====================
net: lan966x: use the newly introduced FDMA library

This patch series is the second of a 2-part series [1], that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a common
library with a common implementation.  This also has the benefit of
removing a lot of open-coded bookkeeping and duplicate code for the two
drivers.

In this second series, the FDMA library will be taken into use by the
lan966x switch driver.

 ###################
 # Example of use: #
 ###################

- Initialize the rx and tx fdma structs with values for: number of
  DCB's, number of DB's, channel ID, DB size (data buffer size), and
  total size of the requested memory. Also provide two callbacks:
  nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.

- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().

- Initialize the DCB's with fdma_dcb_init().

- Add new DCB's with fdma_dcb_add().

- Free memory with fdma_free_phys() or fdma_free_coherent().

 #####################
 # Patch  breakdown: #
 #####################

Patch #1:  select FDMA library for lan966x.

Patch #2:  includes the fdma_api.h header and removes old symbols.

Patch #3:  replaces old rx and tx variables with equivalent ones from the
           fdma struct. Only the variables that can be changed without
           breaking traffic is changed in this patch.

Patch #4:  uses the library for allocation of rx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #5:  uses the library for adding DCB's in the rx path.

Patch #6:  uses the library for freeing rx buffers.

Patch #7:  uses the library for allocation of tx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #8:  uses the library for adding DCB's in the tx path.

Patch #9:  uses the library helpers in the tx path.

Patch #10: ditch last_in_use variable and use library instead.

Patch #11: uses library helpers throughout.

Patch #12: refactor lan966x_fdma_reload() function.

[1] https://lore.kernel.org/netdev/[email protected]/

Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2024
Ido Schimmel says:

====================
net: fib_rules: Add DSCP selector support

Currently, the kernel rejects IPv4 FIB rules that try to match on the
upper three DSCP bits:

 # ip -4 rule add tos 0x1c table 100
 # ip -4 rule add tos 0x3c table 100
 Error: Invalid tos.

The reason for that is that historically users of the FIB lookup API
only populated the lower three DSCP bits in the TOS field of the IPv4
flow key ('flowi4_tos'), which fits the TOS definition from the initial
IPv4 specification (RFC 791).

This is not very useful nowadays and instead some users want to be able
to match on the six bits DSCP field, which replaced the TOS and IP
precedence fields over 25 years ago (RFC 2474). In addition, the current
behavior differs between IPv4 and IPv6 which does allow users to match
on the entire DSCP field using the TOS selector.

Recent patchsets made sure that callers of the FIB lookup API now
populate the entire DSCP field in the IPv4 flow key. Therefore, it is
now possible to extend FIB rules to match on DSCP.

This is done by adding a new DSCP attribute which is implemented for
both IPv4 and IPv6 to provide user space programs a consistent behavior
between both address families.

The behavior of the old TOS selector is unchanged and IPv4 FIB rules
using it will only match on the lower three DSCP bits. The kernel will
reject rules that try to use both selectors.

Patch #1 adds the new DSCP attribute but rejects its usage.

Patches #2-#3 implement IPv4 and IPv6 support.

Patch #4 allows user space to use the new attribute.

Patches #5-#6 add selftests.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2024
Nelson Escobar says:

====================
enic: Report per queue stats

Patch #1: Use a macro instead of static const variables for array sizes.  I
          didn't want to add more static const variables in the next patch
          so clean up the existing ones first.

Patch #2: Collect per queue statistics

Patch #3: Report per queue stats in netdev qstats

Patch #4: Report some per queue stats in ethtool

 # NETIF="eno6" tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..5
ok 1 stats.check_pause # XFAIL pause not supported by the device
ok 2 stats.check_fec # XFAIL FEC not supported by the device
ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
ok 5 stats.check_down

 # tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
     --dump qstats-get --json '{"ifindex": "34"}'
[{'ifindex': 34,
  'rx-bytes': 66762680,
  'rx-csum-unnecessary': 1009345,
  'rx-hw-drop-overruns': 0,
  'rx-hw-drops': 0,
  'rx-packets': 1009673,
  'tx-bytes': 137936674899,
  'tx-csum-none': 125,
  'tx-hw-gso-packets': 2408712,
  'tx-needs-csum': 2431531,
  'tx-packets': 15475466,
  'tx-stop': 0,
  'tx-wake': 0}]

v2: https://lore.kernel.org/[email protected]
v1: https://lore.kernel.org/[email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2024
iter_finish_branch_entry() doesn't put the branch_info from/to map
elements creating memory leaks. This can be seen with:

```
$ perf record -e cycles -b perf test -w noploop
$ perf report -D
...
Direct leak of 984344 byte(s) in 123043 object(s) allocated from:
    #0 0x7fb2654f3bd7 in malloc libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x564d3400d10b in map__get util/map.h:186
    #2 0x564d3400d10b in ip__resolve_ams util/machine.c:1981
    #3 0x564d34014d81 in sample__resolve_bstack util/machine.c:2151
    #4 0x564d34094790 in iter_prepare_branch_entry util/hist.c:898
    #5 0x564d34098fa4 in hist_entry_iter__add util/hist.c:1238
    #6 0x564d33d1f0c7 in process_sample_event tools/perf/builtin-report.c:334
    #7 0x564d34031eb7 in perf_session__deliver_event util/session.c:1655
    #8 0x564d3403ba52 in do_flush util/ordered-events.c:245
    #9 0x564d3403ba52 in __ordered_events__flush util/ordered-events.c:324
    #10 0x564d3402d32e in perf_session__process_user_event util/session.c:1708
    #11 0x564d34032480 in perf_session__process_event util/session.c:1877
    #12 0x564d340336ad in reader__read_event util/session.c:2399
    #13 0x564d34033fdc in reader__process_events util/session.c:2448
    #14 0x564d34033fdc in __perf_session__process_events util/session.c:2495
    #15 0x564d34033fdc in perf_session__process_events util/session.c:2661
    #16 0x564d33d27113 in __cmd_report tools/perf/builtin-report.c:1065
    #17 0x564d33d27113 in cmd_report tools/perf/builtin-report.c:1805
    #18 0x564d33e0ccb7 in run_builtin tools/perf/perf.c:350
    #19 0x564d33e0d45e in handle_internal_command tools/perf/perf.c:403
    #20 0x564d33cdd827 in run_argv tools/perf/perf.c:447
    #21 0x564d33cdd827 in main tools/perf/perf.c:561
...
```

Clearing up the map_symbols properly creates maps reference count
issues so resolve those. Resolving this issue doesn't improve peak
heap consumption for the test above.

Committer testing:

  $ sudo dnf install libasan
  $ make -k CORESIGHT=1 EXTRA_CFLAGS="-fsanitize=address" CC=clang O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin

Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Yanteng Si <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2024
AddressSanitizer found a use-after-free bug in the symbol code which
manifested as 'perf top' segfaulting.

  ==1238389==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b00c48844b at pc 0x5650d8035961 bp 0x7f751aaecc90 sp 0x7f751aaecc80
  READ of size 1 at 0x60b00c48844b thread T193
      #0 0x5650d8035960 in _sort__sym_cmp util/sort.c:310
      #1 0x5650d8043744 in hist_entry__cmp util/hist.c:1286
      #2 0x5650d8043951 in hists__findnew_entry util/hist.c:614
      #3 0x5650d804568f in __hists__add_entry util/hist.c:754
      #4 0x5650d8045bf9 in hists__add_entry util/hist.c:772
      #5 0x5650d8045df1 in iter_add_single_normal_entry util/hist.c:997
      #6 0x5650d8043326 in hist_entry_iter__add util/hist.c:1242
      #7 0x5650d7ceeefe in perf_event__process_sample /home/matt/src/linux/tools/perf/builtin-top.c:845
      #8 0x5650d7ceeefe in deliver_event /home/matt/src/linux/tools/perf/builtin-top.c:1208
      #9 0x5650d7fdb51b in do_flush util/ordered-events.c:245
      #10 0x5650d7fdb51b in __ordered_events__flush util/ordered-events.c:324
      #11 0x5650d7ced743 in process_thread /home/matt/src/linux/tools/perf/builtin-top.c:1120
      #12 0x7f757ef1f133 in start_thread nptl/pthread_create.c:442
      #13 0x7f757ef9f7db in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

When updating hist maps it's also necessary to update the hist symbol
reference because the old one gets freed in map__put().

While this bug was probably introduced with 5c24b67 ("perf
tools: Replace map->referenced & maps->removed_maps with map->refcnt"),
the symbol objects were leaked until c087e94 ("perf machine:
Fix refcount usage when processing PERF_RECORD_KSYMBOL") was merged so
the bug was masked.

Fixes: c087e94 ("perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL")
Reported-by: Yunzhao Li <[email protected]>
Signed-off-by: Matt Fleming (Cloudflare) <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: [email protected]
Cc: Namhyung Kim <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: [email protected] # v5.13+
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2024
The fields in the hist_entry are filled on-demand which means they only
have meaningful values when relevant sort keys are used.

So if neither of 'dso' nor 'sym' sort keys are used, the map/symbols in
the hist entry can be garbage.  So it shouldn't access it
unconditionally.

I got a segfault, when I wanted to see cgroup profiles.

  $ sudo perf record -a --all-cgroups --synth=cgroup true

  $ sudo perf report -s cgroup

  Program received signal SIGSEGV, Segmentation fault.
  0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
  48		return RC_CHK_ACCESS(map)->dso;
  (gdb) bt
  #0  0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
  #1  0x00005555557aa39b in map__load (map=0x0) at util/map.c:344
  #2  0x00005555557aa592 in map__find_symbol (map=0x0, addr=140736115941088) at util/map.c:385
  #3  0x00005555557ef000 in hists__findnew_entry (hists=0x555556039d60, entry=0x7fffffffa4c0, al=0x7fffffffa8c0, sample_self=true)
      at util/hist.c:644
  #4  0x00005555557ef61c in __hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
      block_info=0x0, sample=0x7fffffffaa90, sample_self=true, ops=0x0) at util/hist.c:761
  #5  0x00005555557ef71f in hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
      sample=0x7fffffffaa90, sample_self=true) at util/hist.c:779
  #6  0x00005555557f00fb in iter_add_single_normal_entry (iter=0x7fffffffa900, al=0x7fffffffa8c0) at util/hist.c:1015
  #7  0x00005555557f09a7 in hist_entry_iter__add (iter=0x7fffffffa900, al=0x7fffffffa8c0, max_stack_depth=127, arg=0x7fffffffbce0)
      at util/hist.c:1260
  #8  0x00005555555ba7ce in process_sample_event (tool=0x7fffffffbce0, event=0x7ffff7c14128, sample=0x7fffffffaa90, evsel=0x555556039ad0,
      machine=0x5555560388e8) at builtin-report.c:334
  #9  0x00005555557b30c8 in evlist__deliver_sample (evlist=0x555556039010, tool=0x7fffffffbce0, event=0x7ffff7c14128,
      sample=0x7fffffffaa90, evsel=0x555556039ad0, machine=0x5555560388e8) at util/session.c:1232
  #10 0x00005555557b32bc in machines__deliver_event (machines=0x5555560388e8, evlist=0x555556039010, event=0x7ffff7c14128,
      sample=0x7fffffffaa90, tool=0x7fffffffbce0, file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1271
  #11 0x00005555557b3848 in perf_session__deliver_event (session=0x5555560386d0, event=0x7ffff7c14128, tool=0x7fffffffbce0,
      file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1354
  #12 0x00005555557affaf in ordered_events__deliver_event (oe=0x555556038e60, event=0x555556135aa0) at util/session.c:132
  #13 0x00005555557bb605 in do_flush (oe=0x555556038e60, show_progress=false) at util/ordered-events.c:245
  #14 0x00005555557bb95c in __ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND, timestamp=0) at util/ordered-events.c:324
  #15 0x00005555557bba46 in ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND) at util/ordered-events.c:342
  #16 0x00005555557b1b3b in perf_event__process_finished_round (tool=0x7fffffffbce0, event=0x7ffff7c15bb8, oe=0x555556038e60)
      at util/session.c:780
  #17 0x00005555557b3b27 in perf_session__process_user_event (session=0x5555560386d0, event=0x7ffff7c15bb8, file_offset=117688,
      file_path=0x555556038ff0 "perf.data") at util/session.c:1406

As you can see the entry->ms.map was NULL even if he->ms.map has a
value.  This is because 'sym' sort key is not given, so it cannot assume
whether he->ms.sym and entry->ms.sym is the same.  I only checked the
'sym' sort key here as it implies 'dso' behavior (so maps are the same).

Fixes: ac01c8c ("perf hist: Update hist symbol when updating maps")
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2024
When ib_cache_update return an error, we exit ib_cache_setup_one
instantly with no proper cleanup, even though before this we had
already successfully done gid_table_setup_one, that results in
the kernel WARN below.

Do proper cleanup using gid_table_cleanup_one before returning
the err in order to fix the issue.

WARNING: CPU: 4 PID: 922 at drivers/infiniband/core/cache.c:806 gid_table_release_one+0x181/0x1a0
Modules linked in:
CPU: 4 UID: 0 PID: 922 Comm: c_repro Not tainted 6.11.0-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:gid_table_release_one+0x181/0x1a0
Code: 44 8b 38 75 0c e8 2f cb 34 ff 4d 8b b5 28 05 00 00 e8 23 cb 34 ff 44 89 f9 89 da 4c 89 f6 48 c7 c7 d0 58 14 83 e8 4f de 21 ff <0f> 0b 4c 8b 75 30 e9 54 ff ff ff 48 8    3 c4 10 5b 5d 41 5c 41 5d 41
RSP: 0018:ffffc90002b835b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c8527
RDX: 0000000000000000 RSI: ffffffff811c8534 RDI: 0000000000000001
RBP: ffff8881011b3d00 R08: ffff88810b3abe00 R09: 205d303839303631
R10: 666572207972746e R11: 72746e6520444947 R12: 0000000000000001
R13: ffff888106390000 R14: ffff8881011f2110 R15: 0000000000000001
FS:  00007fecc3b70800(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000340 CR3: 000000010435a001 CR4: 00000000003706b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? show_regs+0x94/0xa0
 ? __warn+0x9e/0x1c0
 ? gid_table_release_one+0x181/0x1a0
 ? report_bug+0x1f9/0x340
 ? gid_table_release_one+0x181/0x1a0
 ? handle_bug+0xa2/0x110
 ? exc_invalid_op+0x31/0xa0
 ? asm_exc_invalid_op+0x16/0x20
 ? __warn_printk+0xc7/0x180
 ? __warn_printk+0xd4/0x180
 ? gid_table_release_one+0x181/0x1a0
 ib_device_release+0x71/0xe0
 ? __pfx_ib_device_release+0x10/0x10
 device_release+0x44/0xd0
 kobject_put+0x135/0x3d0
 put_device+0x20/0x30
 rxe_net_add+0x7d/0xa0
 rxe_newlink+0xd7/0x190
 nldev_newlink+0x1b0/0x2a0
 ? __pfx_nldev_newlink+0x10/0x10
 rdma_nl_rcv_msg+0x1ad/0x2e0
 rdma_nl_rcv_skb.constprop.0+0x176/0x210
 netlink_unicast+0x2de/0x400
 netlink_sendmsg+0x306/0x660
 __sock_sendmsg+0x110/0x120
 ____sys_sendmsg+0x30e/0x390
 ___sys_sendmsg+0x9b/0xf0
 ? kstrtouint+0x6e/0xa0
 ? kstrtouint_from_user+0x7c/0xb0
 ? get_pid_task+0xb0/0xd0
 ? proc_fail_nth_write+0x5b/0x140
 ? __fget_light+0x9a/0x200
 ? preempt_count_add+0x47/0xa0
 __sys_sendmsg+0x61/0xd0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 1901b91 ("IB/core: Fix potential NULL pointer dereference in pkey cache")
Signed-off-by: Patrisious Haddad <[email protected]>
Reviewed-by: Maher Sanalla <[email protected]>
Link: https://patch.msgid.link/79137687d829899b0b1c9835fcb4b258004c439a.1725273354.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 30, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

v2: with kdoc fixes per Paolo Abeni.

The following patchset contains Netfilter fixes for net:

Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.

1) NAT engine can falsely detect a port collision if it happens to pick
   up a reply packet as NEW rather than ESTABLISHED. Add extra code to
   detect this and suppress port reallocation in this case.

2) To complete the clash resolution in the reply direction, extend conntrack
   logic to detect clashing conntrack in the reply direction to existing entry.

3) Adds a test case.

Then, an assorted list of fixes follow:

4) Add a selftest for tproxy, from Antonio Ojea.

5) Guard ctnetlink_*_size() functions under
   #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
   From Andy Shevchenko.

6) Use -m socket --transparent in iptables tproxy documentation.
   From XIE Zhibang.

7) Call kfree_rcu() when releasing flowtable hooks to address race with
   netlink dump path, from Phil Sutter.

8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
   From Simon Horman.

9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
   is its only user, to address a compilation warning. From Simon Horman.

10) Use rcu-protected list iteration over basechain hooks from netlink
    dump path.

11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.

12) Remove old nfqueue conntrack clash resolution. Instead trying to
    use same destination address consistently which requires double DNAT,
    use the existing clash resolution which allows clashing packets
    go through with different destination. Antonio Ojea originally
    reported an issue from the postrouting chain, I proposed a fix:
    https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
    which he reported it did not work for him.

13) Adds a selftest for patch 12.

14) Fixes ipvs.sh selftest.

netfilter pull request 24-09-26

* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: Avoid hanging ipvs.sh
  kselftest: add test for nfqueue induced conntrack race
  netfilter: nfnetlink_queue: remove old clash resolution logic
  netfilter: nf_tables: missing objects with no memcg accounting
  netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
  netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
  netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
  netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
  docs: tproxy: ignore non-transparent sockets in iptables
  netfilter: ctnetlink: Guard possible unused functions
  selftests: netfilter: nft_tproxy.sh: add tcp tests
  selftests: netfilter: add reverse-clash resolution test case
  netfilter: conntrack: add clash resolution for reverse collisions
  netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 4, 2024
The following calculation used in coalesced_mmio_has_room() to check
whether the ring buffer is full is wrong and results in premature exits if
the start of the valid entries is in the first half of the ring buffer.

  avail = (ring->first - last - 1) % KVM_COALESCED_MMIO_MAX;
  if (avail == 0)
	  /* full */

Because negative values are handled using two's complement, and KVM
computes the result as an unsigned value, the above will get a false
positive if "first < last" and the ring is half-full.

The above might have worked as expected in python for example:
  >>> (-86) % 170
  84

However it doesn't work the same way in C.

  printf("avail: %d\n", (-86) % 170);
  printf("avail: %u\n", (-86) % 170);
  printf("avail: %u\n", (-86u) % 170u);

Using gcc-11 these print:

  avail: -86
  avail: 4294967210
  avail: 0

For illustration purposes, given a 4-bit integer and a ring size of 0xA
(unsigned), 0xA == 0x1010 == -6, and thus (-6u % 0xA) == 0.

Fix the calculation and allow all but one entries in the buffer to be
used as originally intended.

Note, KVM's behavior is self-healing to some extent, as KVM will allow the
entire buffer to be used if ring->first is beyond the halfway point.  In
other words, in the unlikely scenario that a use case benefits from being
able to coalesce more than 86 entries at once, KVM will still provide such
behavior, sometimes.

Note #2, the % operator in C is not the modulo operator but the remainder
operator. Modulo and remainder operators differ with respect to negative
values.  But, the relevant values in KVM are all unsigned, so it's a moot
point in this case anyway.

Note #3, this is almost a pure revert of the buggy commit, plus a
READ_ONCE() to provide additional safety.  Thue buggy commit justified the
change with "it paves the way for making this function lockless", but it's
not at all clear what was intended, nor is there any evidence that the
buggy code was somehow safer.  (a) the fields in question were already
accessed locklessly, from the perspective that they could be modified by
userspace at any time, and (b) the lock guarding the ring itself was
changed, but never dropped, i.e. whatever lockless scheme (SRCU?) was
planned never landed.

Fixes: 105f8d4 ("KVM: Calculate available entries in coalesced mmio ring")
Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Paul Durrant <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rework changelog to clarify behavior, call out weirdness of buggy commit]
Signed-off-by: Sean Christopherson <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 4, 2024
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&kvm->slots_lock);
2                                                     lock(&vcpu->mutex);
3                                                     lock(&kvm->srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&kvm->slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&kvm->srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -> cpufreq_online()
     |
     -> cpufreq_gov_performance_limits()
        |
        -> __cpufreq_driver_target()
           |
           -> __target_index()
              |
              -> cpufreq_freq_transition_begin()
                 |
                 -> cpufreq_notify_transition()
                    |
                    -> ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -> #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&kvm->srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao <[email protected]>
Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: [email protected]
Reviewed-by: Kai Huang <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Farrah Chen <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Tariq Toukan says:

====================
net/mlx5: hw counters refactor

This is a patchset re-post, see:
https://lore.kernel.org/[email protected]

In this patchset, Cosmin refactors hw counters and solves perf scaling
issue.

Series generated against:
commit c824deb ("cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"")

HW counters are central to mlx5 driver operations. They are hardware
objects created and used alongside most steering operations, and queried
from a variety of places. Most counters are queried in bulk from a
periodic task in fs_counters.c.

Counter performance is important and as such, a variety of improvements
have been done over the years. Currently, counters are allocated from
pools, which are bulk allocated to amortize the cost of firmware
commands. Counters are managed through an IDR, a doubly linked list and
two atomic single linked lists. Adding/removing counters is a complex
dance between user contexts requesting it and the mlx5_fc_stats_work
task which does most of the work.

Under high load (e.g. from connection tracking flow insertion/deletion),
the counter code becomes a bottleneck, as seen on flame graphs. Whenever
a counter is deleted, it gets added to a list and the wq task is
scheduled to run immediately to actually delete it. This is done via
mod_delayed_work which uses an internal spinlock. In some tests, waiting
for this spinlock took up to 66% of all samples.

This series refactors the counter code to use a more straight-forward
approach, avoiding the mod_delayed_work problem and making the code
easier to understand. For that:

- patch #1 moves counters data structs to a more appropriate place.
- patch #2 simplifies the bulk query allocation scheme by using vmalloc.
- patch #3 replaces the IDR+3 lists with an xarray. This is the main
  patch of the series, solving the spinlock congestion issue.
- patch #4 removes an unnecessary cacheline alignment causing a lot of
  memory to be wasted.
- patches #5 and #6 are small cleanups enabled by the refactoring.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Edward Cree says:

====================
sfc: per-queue stats

This series implements the netdev_stat_ops interface for per-queue
 statistics in the sfc driver, partly using existing counters that
 were originally added for ethtool -S output.

Changed in v4:
* remove RFC tags

Changed in v3:
* make TX stats count completions rather than enqueues
* add new patch #4 to account for XDP TX separately from netdev
  traffic and include it in base_stats
* move the tx_queue->old_* members out of the fastpath cachelines
* note on patch #6 that our hw_gso stats still count enqueues
* RFC since net-next is closed right now

Changed in v2:
* exclude (dedicated) XDP TXQ stats from per-queue TX stats
* explain patch #3 better
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Daniel Machon says:

====================
net: sparx5: prepare for lan969x switch driver

== Description:

This series is the first of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts is split into multiple series (might change a
bit as we go along):

    1) Prepare the Sparx5 driver for lan969x (this series)
    2) Add support lan969x (same basic features as Sparx5 provides +
       RGMII, excl.  FDMA and VCAP)
    3) Add support for lan969x FDMA
    4) Add support for lan969x VCAP

== Lan969x in short:

The lan969x Ethernet switch family [1] provides a rich set of
switching features and port configurations (up to 30 ports) from 10Mbps
to 10Gbps, with support for RGMII, SGMII, QSGMII, USGMII, and USXGMII,
ideal for industrial & process automation infrastructure applications,
transport, grid automation, power substation automation, and ring &
intra-ring topologies. The LAN969x family is hardware and software
compatible and scalable supporting 46Gbps to 102Gbps switch bandwidths.

== Preparing Sparx5 for lan969x:

The lan969x switch chip reuses many of the IP's of the Sparx5 switch
chip, therefore it has been decided to add support through the existing
Sparx5 driver, in order to avoid a bunch of duplicate code. However, in
order to reuse the Sparx5 switch driver, we have to introduce some
mechanisms to handle the chip differences that are there.  These
mechanisms are:

    - Platform match data to contain all the differences that needs to
      be handled (constants, ops etc.)

    - Register macro indirection layer so that we can reuse the existing
      register macros.

    - Function for branching out on platform type where required.

In some places we ops out functions and in other places we branch on the
chip type. Exactly when we choose one over the other, is an estimate in
each case.

After this series is applied, the Sparx5 driver will be prepared for
lan969x and still function exactly as before.

== Patch breakdown:

Patch #1        adds private match data

Patch #2        adds register macro indirection layer

Patch #3-#4     does some preparation work

Patch #5-#7     adds chip constants and updates the code to use them

Patch #8-#13    adds and uses ops for handling functions differently on the
                two platforms.

Patch #14       adds and uses a macro for branching out on the chip type.

Patch #15 (NEW) redefines macros for internal ports and PGID's.

[1] https://www.microchip.com/en-us/product/lan9698

To: David S. Miller <[email protected]>
To: Eric Dumazet <[email protected]>
To: Jakub Kicinski <[email protected]>
To: Paolo Abeni <[email protected]>
To: Lars Povlsen <[email protected]>
To: Steen Hegelund <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: Richard Cochran <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/20241004-b4-sparx5-lan969x-switch-driver-v2-0-d3290f581663@microchip.com
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Oct 15, 2024
Eric Dumazet says:

====================
net: remove RTNL from fib_seq_sum()

This series is inspired by a syzbot report showing
rtnl contention and one thread blocked in:

7 locks held by syz-executor/10835:
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2931 [inline]
  #0: ffff888033390420 (sb_writers#8){.+.+}-{0:0}, at: vfs_write+0x224/0xc90 fs/read_write.c:679
  #1: ffff88806df6bc88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x1ea/0x500 fs/kernfs/file.c:325
  #2: ffff888026fcf3c8 (kn->active#50){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x20e/0x500 fs/kernfs/file.c:326
  #3: ffffffff8f56f848 (nsim_bus_dev_list_lock){+.+.}-{3:3}, at: new_device_store+0x1b4/0x890 drivers/net/netdevsim/bus.c:166
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: device_lock include/linux/device.h:1014 [inline]
  #4: ffff88805e0140e8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8e/0x520 drivers/base/dd.c:1005
  #5: ffff88805c5fb250 (&devlink->lock_key#55){+.+.}-{3:3}, at: nsim_drv_probe+0xcb/0xb80 drivers/net/netdevsim/dev.c:1534
  #6: ffffffff8fcd1748 (rtnl_mutex){+.+.}-{3:3}, at: fib_seq_sum+0x31/0x290 net/core/fib_notifier.c:46
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant