1. 19 May, 2016 1 commit
  2. 17 Mar, 2016 1 commit
    • John Stultz's avatar
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz authored
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
  3. 05 Feb, 2016 1 commit
    • Jason Baron's avatar
      epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT · b6a515c8
      Jason Baron authored
      In the current implementation of the EPOLLEXCLUSIVE flag (added for
      4.5-rc1), if epoll waiters create different POLL* sets and register them
      as exclusive against the same target fd, the current implementation will
      stop waking any further waiters once it finds the first idle waiter.
      This means that waiters could miss wakeups in certain cases.
      
      For example, when we wake up a pipe for reading we do:
      wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
      one epoll set or epfd is added to pipe p with POLLIN and a second set
      epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
      wakeup since the current implementation will stop after it finds any
      intersection of events with a waiter that is blocked in epoll_wait().
      
      We could potentially address this by requiring all epoll waiters that
      are added to p be required to pass the same set of POLL* events.  IE the
      first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
      flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
      However, I think it might be somewhat confusing interface as we would
      have to reference count the number of users for that set, and so
      userspace would have to keep track of that count, or we would need a
      more involved interface.  It also adds some shared state that we'd have
      store somewhere.  I don't think anybody will want to bloat
      __wait_queue_head for this.
      
      I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
      such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
      that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
      once we hit the first idle waiter that specifies the EPOLLIN bit, since
      any remaining waiters that only have 'POLLOUT' set wouldn't need to be
      woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
      bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
      wake bit set (there is at least one example of this I saw in fs/pipe.c),
      then we just wake the entire exclusive list.  Having both 'POLLOUT' and
      'POLLIN' both set should not be on any performance critical path, so I
      think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
      to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
      the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
      so.
      
      Since epoll waiters may be interested in other events as well besides
      EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
      doing a 'dup' call on the target fd and adding that as one normally
      would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
      events are what we are interest in balancing, I think that the 'dup'
      thing could perhaps be added to only one of the waiter threads.
      However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
      sufficient for the majority of use-cases.
      
      Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
      among multiple epfds, where between 1 and n of the epfds may receive an
      event, it does not satisfy the semantics of EPOLLONESHOT where only 1
      epfd would get an event.  Thus, it is not allowed to be specified in
      conjunction with EPOLLEXCLUSIVE.
      
      EPOLL_CTL_MOD is also not allowed if the fd was previously added as
      EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
      interesting, but this could be relaxed at some further point.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarMadars Vitolins <m@silodev.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6a515c8
  4. 20 Jan, 2016 1 commit
    • Jason Baron's avatar
      epoll: add EPOLLEXCLUSIVE flag · df0108c5
      Jason Baron authored
      Currently, epoll file descriptors or epfds (the fd returned from
      epoll_create[1]()) that are added to a shared wakeup source are always
      added in a non-exclusive manner.  This means that when we have multiple
      epfds attached to a shared fd source they are all woken up.  This creates
      thundering herd type behavior.
      
      Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
      'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation.  This new
      flag allows for exclusive wakeups when there are multiple epfds attached
      to a shared fd event source.
      
      The implementation walks the list of exclusive waiters, and queues an
      event to each epfd, until it finds the first waiter that has threads
      blocked on it via epoll_wait().  The idea is to search for threads which
      are idle and ready to process the wakeup events.  Thus, we queue an event
      to at least 1 epfd, but may still potentially queue an event to all epfds
      that are attached to the shared fd source.
      
      Performance testing was done by Madars Vitolins using a modified version
      of Enduro/X.  The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
      this particular workload from 860s down to 24s.
      
      Sample epoll_clt text:
      
      EPOLLEXCLUSIVE
      
        Sets an exclusive wakeup mode for the epfd file descriptor that is
        being attached to the target file descriptor, fd.  Thus, when an event
        occurs and multiple epfd file descriptors are attached to the same
        target file using EPOLLEXCLUSIVE, one or more epfds will receive an
        event with epoll_wait(2).  The default in this scenario (when
        EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
        EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarMadars Vitolins <m@silodev.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df0108c5
  5. 13 Feb, 2015 1 commit
  6. 05 Nov, 2014 1 commit
  7. 10 Sep, 2014 1 commit
    • Nicolas Iooss's avatar
      eventpoll: fix uninitialized variable in epoll_ctl · c680e41b
      Nicolas Iooss authored
      When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
      not initialized but ep_take_care_of_epollwakeup reads its event field.
      When this unintialized field has EPOLLWAKEUP bit set, a capability check
      is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
      produces unexpected messages in the audit log, such as (on a system
      running SELinux):
      
          type=AVC msg=audit(1408212798.866:410): avc:  denied
          { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
          scontext=unconfined_u:unconfined_r:unconfined_t
          tcontext=unconfined_u:unconfined_r:unconfined_t
          tclass=capability2 permissive=1
      
          type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
          success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
          pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
          fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
          exe="/usr/bin/dbus-daemon"
          subj=unconfined_u:unconfined_r:unconfined_t key=(null)
      
      ("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")
      
      Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
      
      Fixes: 4d7e30d9 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
      Signed-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c680e41b
  8. 16 Jun, 2014 1 commit
  9. 06 Jun, 2014 1 commit
  10. 02 Jan, 2014 1 commit
    • Jason Baron's avatar
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 4ff36ee9
      Jason Baron authored
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ff36ee9
  11. 03 Dec, 2013 1 commit
  12. 12 Nov, 2013 2 commits
    • Jason Baron's avatar
      epoll: do not take global 'epmutex' for simple topologies · 67347fe4
      Jason Baron authored
      When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
      directly to a wakeup source, we do not need to take the global 'epmutex',
      unless the epoll file descriptor is nested.  The purpose of taking the
      'epmutex' on add is to prevent complex topologies such as loops and deep
      wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
      operations.  However, for the simple case of an epoll file descriptor
      attached directly to a wakeup source (with no nesting), we do not need to
      hold the 'epmutex'.
      
      This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
      scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
      performance:
      
      "On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
      addition the benchmark when from scaling well on 10 sockets to scaling
      well on just over 40 sockets.
      
      ...
      
      Currently the benchmark stops scaling at around 40-44 sockets but it seems like
      I found a second unrelated bottleneck."
      
      [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67347fe4
    • Jason Baron's avatar
      epoll: optimize EPOLL_CTL_DEL using rcu · ae10b2b4
      Jason Baron authored
      Nathan Zimmer found that once we get over 10+ cpus, the scalability of
      SPECjbb falls over due to the contention on the global 'epmutex', which is
      taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.
      
      Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
      by using rcu to guard against any concurrent traversals.
      
      Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
      simple topologies.  IE when adding a link from an epoll file descriptor to
      a wakeup source, where the epoll file descriptor is not nested.
      
      This patch (of 2):
      
      Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
      converting the file->f_ep_links list into an rcu one.  In this way, we can
      traverse the epoll network on the add path in parallel with deletes.
      Since deletes can't create loops or worse wakeup paths, this is safe.
      
      This patch in combination with the patch "epoll: Do not take global 'epmutex'
      for simple topologies", shows a dramatic performance improvement in
      scalability for SPECjbb.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      CC: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae10b2b4
  13. 30 Oct, 2013 1 commit
  14. 24 Oct, 2013 1 commit
  15. 11 Sep, 2013 1 commit
  16. 03 Sep, 2013 1 commit
  17. 03 Jul, 2013 1 commit
  18. 12 May, 2013 1 commit
    • Colin Cross's avatar
      epoll: use freezable blocking call · 1c441e92
      Colin Cross authored
      Avoid waking up every thread sleeping in an epoll_wait call during
      suspend and resume by calling a freezable blocking call.  Previous
      patches modified the freezer to avoid sending wakeups to threads
      that are blocked in freezable blocking calls.
      
      This call was selected to be converted to a freezable call because
      it doesn't hold any locks or release any resources when interrupted
      that might be needed by another freezing task or a kernel driver
      during suspend, and is a common site where idle userspace tasks are
      blocked.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarColin Cross <ccross@android.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1c441e92
  19. 30 Apr, 2013 5 commits
  20. 03 Mar, 2013 1 commit
  21. 02 Jan, 2013 1 commit
    • Eric Wong's avatar
      epoll: prevent missed events on EPOLL_CTL_MOD · 128dd175
      Eric Wong authored
      EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
      ensure events are not missed.  Since the modifications to the interest
      mask are not protected by the same lock as ep_poll_callback, we need to
      ensure the change is visible to other CPUs calling ep_poll_callback.
      
      We also need to ensure f_op->poll() has an up-to-date view of past
      events which occured before we modified the interest mask.  So this
      barrier also pairs with the barrier in wq_has_sleeper().
      
      This should guarantee either ep_poll_callback or f_op->poll() (or both)
      will notice the readiness of a recently-ready/modified item.
      
      This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
      http://thread.gmane.org/gmane.linux.kernel/1408782/Signed-off-by: default avatarEric Wong <normalperson@yhbt.net>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
      Tested-by: default avatar"Junchang(Jason) Wang" <junchang.wang@yale.edu>
      Cc: netdev@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      128dd175
  22. 17 Dec, 2012 1 commit
    • Cyrill Gorcunov's avatar
      fs, epoll: add procfs fdinfo helper · 138d22b5
      Cyrill Gorcunov authored
      This allows us to print out eventpoll target file descriptor, events and
      data, the /proc/pid/fdinfo/fd consists of
      
       | pos:	0
       | flags:	02
       | tfd:        5 events:       1d data: ffffffffffffffff enabled: 1
      
      [avagin@: fix for unitialized ret variable]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      138d22b5
  23. 08 Nov, 2012 1 commit
    • Andrew Morton's avatar
      revert "epoll: support for disabling items, and a self-test app" · a80a6b85
      Andrew Morton authored
      Revert commit 03a7beb5 ("epoll: support for disabling items, and a
      self-test app") pending resolution of the issues identified by Michael
      Kerrisk, copied below.
      
      We'll revisit this for 3.8.
      
      : I've taken a look at this patch as it currently stands in 3.7-rc1, and
      : done a bit of testing. (By the way, the test program
      : tools/testing/selftests/epoll/test_epoll.c does not compile...)
      :
      : There are one or two places where the behavior seems a little strange,
      : so I have a question or two at the end of this mail. But other than
      : that, I want to check my understanding so that the interface can be
      : correctly documented.
      :
      : Just to go though my understanding, the problem is the following
      : scenario in a multithreaded application:
      :
      : 1. Multiple threads are performing epoll_wait() operations,
      :    and maintaining a user-space cache that contains information
      :    corresponding to each file descriptor being monitored by
      :    epoll_wait().
      :
      : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
      :    a file descriptor from the epoll interest list, and
      :    delete the corresponding record from the user-space cache.
      :
      : 3. The problem with (2) is that some other thread may have
      :    previously done an epoll_wait() that retrieved information
      :    about the fd in question, and may be in the middle of using
      :    information in the cache that relates to that fd. Thus,
      :    there is a potential race.
      :
      : 4. The race can't solved purely in user space, because doing
      :    so would require applying a mutex across the epoll_wait()
      :    call, which would of course blow thread concurrency.
      :
      : Right?
      :
      : Your solution is the EPOLL_CTL_DISABLE operation. I want to
      : confirm my understanding about how to use this flag, since
      : the description that has accompanied the patches so far
      : has been a bit sparse
      :
      : 0. In the scenario you're concerned about, deleting a file
      :    descriptor means (safely) doing the following:
      :    (a) Deleting the file descriptor from the epoll interest list
      :        using EPOLL_CTL_DEL
      :    (b) Deleting the corresponding record in the user-space cache
      :
      : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
      :    conjunction with EPOLLONESHOT.
      :
      : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
      :    conjunction is a logical error.
      :
      : 3. The correct way to code multithreaded applications using
      :    EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
      :
      :    a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
      :       should EPOLLONESHOT.
      :
      :    b. When a thread wants to delete a file descriptor, it
      :       should do the following:
      :
      :       [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
      :       [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
      :           was zero, then the file descriptor can be safely
      :           deleted by the thread that made this call.
      :       [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
      :           then the descriptor is in use. In this case, the calling
      :           thread should set a flag in the user-space cache to
      :           indicate that the thread that is using the descriptor
      :           should perform the deletion operation.
      :
      : Is all of the above correct?
      :
      : The implementation depends on checking on whether
      : (events & ~EP_PRIVATE_BITS) == 0
      : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
      : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
      : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
      : cleared.
      :
      : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
      : is only useful in conjunction with EPOLLONESHOT. However, as things
      : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
      : not have EPOLLONESHOT set in 'events' This results in the following
      : (slightly surprising) behavior:
      :
      : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
      :     (the indicator that the file descriptor can be safely deleted).
      : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
      :
      : This doesn't seem particularly useful, and in fact is probably an
      : indication that the user made a logic error: they should only be using
      : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
      : EPOLLONESHOT was set in 'events'. If that is correct, then would it
      : not make sense to return an error to user space for this case?
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paton J. Lewis" <palewis@adobe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a80a6b85
  24. 05 Oct, 2012 1 commit
  25. 26 Sep, 2012 2 commits
  26. 22 Aug, 2012 1 commit
  27. 17 Jul, 2012 1 commit
  28. 01 Jun, 2012 1 commit
  29. 22 May, 2012 1 commit
    • Rafael J. Wysocki's avatar
      epoll: Fix user space breakage related to EPOLLWAKEUP · a8159414
      Rafael J. Wysocki authored
      Commit 4d7e30d9 (epoll: Add a flag, EPOLLWAKEUP, to prevent
      suspend while epoll events are ready) caused some applications to
      malfunction, because they set the bit corresponding to the new
      EPOLLWAKEUP flag in their eventpoll flags and they don't have the
      new CAP_EPOLLWAKEUP capability.
      
      To prevent that from happening, change epoll_ctl() to clear
      EPOLLWAKEUP in epds.events if the caller doesn't have the
      CAP_EPOLLWAKEUP capability instead of failing and returning an
      error code, which allows the affected applications to function
      normally.
      Reported-and-tested-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      a8159414
  30. 05 May, 2012 1 commit
  31. 25 Apr, 2012 1 commit
  32. 28 Mar, 2012 1 commit
  33. 23 Mar, 2012 2 commits
    • Hans Verkuil's avatar
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil authored
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: default avatarJonathan Corbet <corbet@lwn.net>
      Reviewed-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
    • Dan Carpenter's avatar
      da0503aa