Skip to content
  • Jason Baron's avatar
    epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT · b6a515c8
    Jason Baron authored
    
    
    In the current implementation of the EPOLLEXCLUSIVE flag (added for
    4.5-rc1), if epoll waiters create different POLL* sets and register them
    as exclusive against the same target fd, the current implementation will
    stop waking any further waiters once it finds the first idle waiter.
    This means that waiters could miss wakeups in certain cases.
    
    For example, when we wake up a pipe for reading we do:
    wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
    one epoll set or epfd is added to pipe p with POLLIN and a second set
    epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
    wakeup since the current implementation will stop after it finds any
    intersection of events with a waiter that is blocked in epoll_wait().
    
    We could potentially address this by requiring all epoll waiters that
    are added to p be required to pass the same set of POLL* events.  IE the
    first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
    flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
    However, I think it might be somewhat confusing interface as we would
    have to reference count the number of users for that set, and so
    userspace would have to keep track of that count, or we would need a
    more involved interface.  It also adds some shared state that we'd have
    store somewhere.  I don't think anybody will want to bloat
    __wait_queue_head for this.
    
    I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
    such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
    that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
    once we hit the first idle waiter that specifies the EPOLLIN bit, since
    any remaining waiters that only have 'POLLOUT' set wouldn't need to be
    woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
    bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
    wake bit set (there is at least one example of this I saw in fs/pipe.c),
    then we just wake the entire exclusive list.  Having both 'POLLOUT' and
    'POLLIN' both set should not be on any performance critical path, so I
    think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
    to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
    the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
    so.
    
    Since epoll waiters may be interested in other events as well besides
    EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
    doing a 'dup' call on the target fd and adding that as one normally
    would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
    events are what we are interest in balancing, I think that the 'dup'
    thing could perhaps be added to only one of the waiter threads.
    However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
    sufficient for the majority of use-cases.
    
    Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
    among multiple epfds, where between 1 and n of the epfds may receive an
    event, it does not satisfy the semantics of EPOLLONESHOT where only 1
    epfd would get an event.  Thus, it is not allowed to be specified in
    conjunction with EPOLLEXCLUSIVE.
    
    EPOLL_CTL_MOD is also not allowed if the fd was previously added as
    EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
    interesting, but this could be relaxed at some further point.
    
    Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
    Tested-by: default avatarMadars Vitolins <m@silodev.com>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Al Viro <viro@ftp.linux.org.uk>
    Cc: Eric Wong <normalperson@yhbt.net>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Hagen Paul Pfeifer <hagen@jauu.net>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b6a515c8