Skip to content

Commit 5b25b13

Browse files
compudjtorvalds
authored andcommitted
sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: Mathieu Desnoyers <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Nicholas Miell <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Alan Cox <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: David Howells <[email protected]> Cc: Pranith Kumar <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 7c0d35a commit 5b25b13

File tree

11 files changed

+151
-1
lines changed

11 files changed

+151
-1
lines changed

MAINTAINERS

+8
Original file line numberDiff line numberDiff line change
@@ -6789,6 +6789,14 @@ W: http://www.mellanox.com
67896789
Q: http://patchwork.ozlabs.org/project/netdev/list/
67906790
F: drivers/net/ethernet/mellanox/mlxsw/
67916791

6792+
MEMBARRIER SUPPORT
6793+
M: Mathieu Desnoyers <[email protected]>
6794+
M: "Paul E. McKenney" <[email protected]>
6795+
6796+
S: Supported
6797+
F: kernel/membarrier.c
6798+
F: include/uapi/linux/membarrier.h
6799+
67926800
MEMORY MANAGEMENT
67936801
67946802
W: http://www.linux-mm.org

arch/x86/entry/syscalls/syscall_32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -381,3 +381,4 @@
381381
372 i386 recvmsg sys_recvmsg compat_sys_recvmsg
382382
373 i386 shutdown sys_shutdown
383383
374 i386 userfaultfd sys_userfaultfd
384+
375 i386 membarrier sys_membarrier

arch/x86/entry/syscalls/syscall_64.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,7 @@
330330
321 common bpf sys_bpf
331331
322 64 execveat stub_execveat
332332
323 common userfaultfd sys_userfaultfd
333+
324 common membarrier sys_membarrier
333334

334335
#
335336
# x32-specific system call numbers start at 512 to avoid cache impact

include/linux/syscalls.h

+2
Original file line numberDiff line numberDiff line change
@@ -885,4 +885,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
885885
const char __user *const __user *argv,
886886
const char __user *const __user *envp, int flags);
887887

888+
asmlinkage long sys_membarrier(int cmd, int flags);
889+
888890
#endif

include/uapi/asm-generic/unistd.h

+3-1
Original file line numberDiff line numberDiff line change
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
709709
__SYSCALL(__NR_bpf, sys_bpf)
710710
#define __NR_execveat 281
711711
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
712+
#define __NR_membarrier 282
713+
__SYSCALL(__NR_membarrier, sys_membarrier)
712714

713715
#undef __NR_syscalls
714-
#define __NR_syscalls 282
716+
#define __NR_syscalls 283
715717

716718
/*
717719
* All syscalls below here should go away really,

include/uapi/linux/Kbuild

+1
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,7 @@ header-y += mdio.h
252252
header-y += media.h
253253
header-y += media-bus-format.h
254254
header-y += mei.h
255+
header-y += membarrier.h
255256
header-y += memfd.h
256257
header-y += mempolicy.h
257258
header-y += meye.h

include/uapi/linux/membarrier.h

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#ifndef _UAPI_LINUX_MEMBARRIER_H
2+
#define _UAPI_LINUX_MEMBARRIER_H
3+
4+
/*
5+
* linux/membarrier.h
6+
*
7+
* membarrier system call API
8+
*
9+
* Copyright (c) 2010, 2015 Mathieu Desnoyers <[email protected]>
10+
*
11+
* Permission is hereby granted, free of charge, to any person obtaining a copy
12+
* of this software and associated documentation files (the "Software"), to deal
13+
* in the Software without restriction, including without limitation the rights
14+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15+
* copies of the Software, and to permit persons to whom the Software is
16+
* furnished to do so, subject to the following conditions:
17+
*
18+
* The above copyright notice and this permission notice shall be included in
19+
* all copies or substantial portions of the Software.
20+
*
21+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
27+
* SOFTWARE.
28+
*/
29+
30+
/**
31+
* enum membarrier_cmd - membarrier system call command
32+
* @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
33+
* a bitmask of valid commands.
34+
* @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
35+
* Upon return from system call, the caller thread
36+
* is ensured that all running threads have passed
37+
* through a state where all memory accesses to
38+
* user-space addresses match program order between
39+
* entry to and return from the system call
40+
* (non-running threads are de facto in such a
41+
* state). This covers threads from all processes
42+
* running on the system. This command returns 0.
43+
*
44+
* Command to be passed to the membarrier system call. The commands need to
45+
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
46+
* the value 0.
47+
*/
48+
enum membarrier_cmd {
49+
MEMBARRIER_CMD_QUERY = 0,
50+
MEMBARRIER_CMD_SHARED = (1 << 0),
51+
};
52+
53+
#endif /* _UAPI_LINUX_MEMBARRIER_H */

init/Kconfig

+12
Original file line numberDiff line numberDiff line change
@@ -1602,6 +1602,18 @@ config PCI_QUIRKS
16021602
bugs/quirks. Disable this only if your target machine is
16031603
unaffected by PCI quirks.
16041604

1605+
config MEMBARRIER
1606+
bool "Enable membarrier() system call" if EXPERT
1607+
default y
1608+
help
1609+
Enable the membarrier() system call that allows issuing memory
1610+
barriers across all running threads, which can be used to distribute
1611+
the cost of user-space memory barriers asymmetrically by transforming
1612+
pairs of memory barriers into pairs consisting of membarrier() and a
1613+
compiler barrier.
1614+
1615+
If unsure, say Y.
1616+
16051617
config EMBEDDED
16061618
bool "Embedded system"
16071619
option allnoconfig_y

kernel/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
100100
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
101101
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
102102
obj-$(CONFIG_TORTURE_TEST) += torture.o
103+
obj-$(CONFIG_MEMBARRIER) += membarrier.o
103104

104105
obj-$(CONFIG_HAS_IOMEM) += memremap.o
105106

kernel/membarrier.c

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
/*
2+
* Copyright (C) 2010, 2015 Mathieu Desnoyers <[email protected]>
3+
*
4+
* membarrier system call
5+
*
6+
* This program is free software; you can redistribute it and/or modify
7+
* it under the terms of the GNU General Public License as published by
8+
* the Free Software Foundation; either version 2 of the License, or
9+
* (at your option) any later version.
10+
*
11+
* This program is distributed in the hope that it will be useful,
12+
* but WITHOUT ANY WARRANTY; without even the implied warranty of
13+
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14+
* GNU General Public License for more details.
15+
*/
16+
17+
#include <linux/syscalls.h>
18+
#include <linux/membarrier.h>
19+
20+
/*
21+
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
22+
* except MEMBARRIER_CMD_QUERY.
23+
*/
24+
#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
25+
26+
/**
27+
* sys_membarrier - issue memory barriers on a set of threads
28+
* @cmd: Takes command values defined in enum membarrier_cmd.
29+
* @flags: Currently needs to be 0. For future extensions.
30+
*
31+
* If this system call is not implemented, -ENOSYS is returned. If the
32+
* command specified does not exist, or if the command argument is invalid,
33+
* this system call returns -EINVAL. For a given command, with flags argument
34+
* set to 0, this system call is guaranteed to always return the same value
35+
* until reboot.
36+
*
37+
* All memory accesses performed in program order from each targeted thread
38+
* is guaranteed to be ordered with respect to sys_membarrier(). If we use
39+
* the semantic "barrier()" to represent a compiler barrier forcing memory
40+
* accesses to be performed in program order across the barrier, and
41+
* smp_mb() to represent explicit memory barriers forcing full memory
42+
* ordering across the barrier, we have the following ordering table for
43+
* each pair of barrier(), sys_membarrier() and smp_mb():
44+
*
45+
* The pair ordering is detailed as (O: ordered, X: not ordered):
46+
*
47+
* barrier() smp_mb() sys_membarrier()
48+
* barrier() X X O
49+
* smp_mb() X O O
50+
* sys_membarrier() O O O
51+
*/
52+
SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
53+
{
54+
if (unlikely(flags))
55+
return -EINVAL;
56+
switch (cmd) {
57+
case MEMBARRIER_CMD_QUERY:
58+
return MEMBARRIER_CMD_BITMASK;
59+
case MEMBARRIER_CMD_SHARED:
60+
if (num_online_cpus() > 1)
61+
synchronize_sched();
62+
return 0;
63+
default:
64+
return -EINVAL;
65+
}
66+
}

kernel/sys_ni.c

+3
Original file line numberDiff line numberDiff line change
@@ -245,3 +245,6 @@ cond_syscall(sys_bpf);
245245

246246
/* execveat */
247247
cond_syscall(sys_execveat);
248+
249+
/* membarrier */
250+
cond_syscall(sys_membarrier);

0 commit comments

Comments
 (0)