• Mathieu Desnoyers's avatar
    sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
    Mathieu Desnoyers authored
    Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system.  It is
    implemented by calling synchronize_sched().  It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier.  For synchronization primitives
    that distinguish between read-side and write-side (e.g.  userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.
    The existing applications of which I am aware that would be improved by
    this system call are as follows:
    * Through Userspace RCU library (http://urcu.so)
      - DNS server (Knot DNS) https://www.knot-dns.cz/
      - Network sniffer (http://netsniff-ng.org/)
      - Distributed object storage (https://sheepdog.github.io/sheepdog/)
      - User-space tracing (http://lttng.org)
      - Network storage system (https://www.gluster.org/)
      - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
      - Financial software (https://lkml.org/lkml/2015/3/23/189)
    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking.  Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().
    * Direct users of sys_membarrier
      - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux.  They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.
    To explain the benefit of this scheme, let's introduce two example threads:
    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".
    Before the change, we had, for each smp_mb() pairs:
    Thread A                    Thread B
    previous mem accesses       previous mem accesses
    smp_mb()                    smp_mb()
    following mem accesses      following mem accesses
    After the change, these pairs become:
    Thread A                    Thread B
    prev mem accesses           prev mem accesses
    sys_membarrier()            barrier()
    follow mem accesses         follow mem accesses
    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).
    1) Non-concurrent Thread A vs Thread B accesses:
    Thread A                    Thread B
    prev mem accesses
    follow mem accesses
                                prev mem accesses
                                follow mem accesses
    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.
    2) Concurrent Thread A vs Thread B accesses
    Thread A                    Thread B
    prev mem accesses           prev mem accesses
    sys_membarrier()            barrier()
    follow mem accesses         follow mem accesses
    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().
    * Benchmarks
    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
    * User-space user of this system call: Userspace RCU library
    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.
    Results in liburcu:
    Operations in 10s, 6 readers, 2 writers:
    memory barriers in reader:    1701557485 reads, 2202847 writes
    signal-based scheme:          9830061167 reads,    6700 writes
    sys_membarrier:               9952759104 reads,     425 writes
    sys_membarrier (dyn. check):  7970328887 reads,     425 writes
    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.
    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.
    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.
    This patch adds the system call to x86 and to asm-generic.
    [1] http://urcu.so
    membarrier(2) man page:
    MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
           membarrier - issue memory barriers on a set of threads
           #include <linux/membarrier.h>
           int membarrier(int cmd, int flags);
           The cmd argument is one of the following:
                  Query  the  set  of  supported commands. It returns a bitmask of
                  supported commands.
                  Execute a memory barrier on all threads running on  the  system.
                  Upon  return from system call, the caller thread is ensured that
                  all running threads have passed through a state where all memory
                  accesses  to  user-space  addresses  match program order between
                  entry to and return from the system  call  (non-running  threads
                  are de facto in such a state). This covers threads from all pro=E2=80=90
                  cesses running on the system.  This command returns 0.
           The flags argument needs to be 0. For future extensions.
           All memory accesses performed  in  program  order  from  each  targeted
           thread is guaranteed to be ordered with respect to sys_membarrier(). If
           we use the semantic "barrier()" to represent a compiler barrier forcing
           memory  accesses  to  be performed in program order across the barrier,
           and smp_mb() to represent explicit memory barriers forcing full  memory
           ordering  across  the barrier, we have the following ordering table for
           each pair of barrier(), sys_membarrier() and smp_mb():
           The pair ordering is detailed as (O: ordered, X: not ordered):
                                  barrier()   smp_mb() sys_membarrier()
                  barrier()          X           X            O
                  smp_mb()           X           O            O
                  sys_membarrier()   O           O            O
           On success, these system calls return zero.  On error, -1 is  returned,
           and errno is set appropriately. For a given command, with flags
           argument set to 0, this system call is guaranteed to always return the
           same value until reboot.
           ENOSYS System call is not implemented.
           EINVAL Invalid arguments.
    Linux                             2015-04-15                     MEMBARRIER(2)
    Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
    Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Nicholas Miell <nmiell@comcast.net>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
    Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
    Cc: Stephen Hemminger <stephen@networkplumber.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Pranith Kumar <bobby.prani@gmail.com>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Shuah Khan <shuahkh@osg.samsung.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
membarrier.c 2.4 KB