Skip to content
Snippets Groups Projects
  1. Jul 25, 2011
    • Benjamin Herrenschmidt's avatar
      mm/futex: fix futex writes on archs with SW tracking of dirty & young · 2efaca92
      Benjamin Herrenschmidt authored
      
      I haven't reproduced it myself but the fail scenario is that on such
      machines (notably ARM and some embedded powerpc), if you manage to hit
      that futex path on a writable page whose dirty bit has gone from the PTE,
      you'll livelock inside the kernel from what I can tell.
      
      It will go in a loop of trying the atomic access, failing, trying gup to
      "fix it up", getting succcess from gup, go back to the atomic access,
      failing again because dirty wasn't fixed etc...
      
      So I think you essentially hang in the kernel.
      
      The scenario is probably rare'ish because affected architecture are
      embedded and tend to not swap much (if at all) so we probably rarely hit
      the case where dirty is missing or young is missing, but I think Shan has
      a piece of SW that can reliably reproduce it using a shared writable
      mapping & fork or something like that.
      
      On archs who use SW tracking of dirty & young, a page without dirty is
      effectively mapped read-only and a page without young unaccessible in the
      PTE.
      
      Additionally, some architectures might lazily flush the TLB when relaxing
      write protection (by doing only a local flush), and expect a fault to
      invalidate the stale entry if it's still present on another processor.
      
      The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
      "fix it up" by causing get_user_pages() which would then be equivalent to
      taking the fault.
      
      However that isn't the case.  get_user_pages() will not call
      handle_mm_fault() in the case where the PTE seems to have the right
      permissions, regardless of the dirty and young state.  It will eventually
      update those bits ...  in the struct page, but not in the PTE.
      
      Additionally, it will not handle the lazy TLB flushing that can be
      required by some architectures in the fault case.
      
      Basically, gup is the wrong interface for the job.  The patch provides a
      more appropriate one which boils down to just calling handle_mm_fault()
      since what we are trying to do is simulate a real page fault.
      
      The futex code currently attempts to write to user memory within a
      pagefault disabled section, and if that fails, tries to fix it up using
      get_user_pages().
      
      This doesn't work on archs where the dirty and young bits are maintained
      by software, since they will gate access permission in the TLB, and will
      not be updated by gup().
      
      In addition, there's an expectation on some archs that a spurious write
      fault triggers a local TLB flush, and that is missing from the picture as
      well.
      
      I decided that adding those "features" to gup() would be too much for this
      already too complex function, and instead added a new simpler
      fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
      which the futex code can call.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-by: default avatarShan Hai <haishan.bai@gmail.com>
      Tested-by: default avatarShan Hai <haishan.bai@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Darren Hart <darren.hart@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2efaca92
  2. Jul 08, 2011
  3. Apr 15, 2011
  4. Mar 25, 2011
  5. Mar 23, 2011
  6. Mar 14, 2011
    • Thomas Gleixner's avatar
      futex: Deobfuscate handle_futex_death() · 6e0aa9f8
      Thomas Gleixner authored
      
      handle_futex_death() uses futex_atomic_cmpxchg_inatomic() without
      disabling page faults. That's ok, but totally non obvious.
      
      We don't hold locks so we actually can and want to fault here, because
      the get_user() before futex_atomic_cmpxchg_inatomic() does not
      guarantee a R/W mapping.
      
      We could just add a big fat comment to explain this, but actually
      changing the code so that the functionality is entirely clear is
      better.
      
      Use the helper function which disables page faults around the
      futex_atomic_cmpxchg_inatomic() and handle a fault with a call to
      fault_in_user_writeable() as all other places in the futex code do as
      well.
      
      Pointed-out-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarDarren Hart <darren@dvhart.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      LKML-Reference: <alpine.LFD.2.00.1103141126590.2787@localhost6.localdomain6>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      6e0aa9f8
  7. Mar 11, 2011
    • Lai Jiangshan's avatar
      futex,plist: Remove debug lock assignment from plist_node · 017f2b23
      Lai Jiangshan authored
      
      The original code uses &plist_node->plist as the fake head of
      the priority list for plist_del(), these debug locks in
      the fake head are needed for CONFIG_DEBUG_PI_LIST.
      
      But now we always pass the real head to plist_del(), the debug locks
      in plist_node will not be used, so we remove these assignments.
      
      Acked-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4D10797E.7040803@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      017f2b23
    • Lai Jiangshan's avatar
      futex,plist: Pass the real head of the priority list to plist_del() · 2e12978a
      Lai Jiangshan authored
      
      Some plist_del()s in kernel/futex.c are passed a faked head of the
      priority list.
      
      It does not fail because the current code does not require the real head
      in plist_del(). The current code of plist_del() just uses the head for checking,
      so it will not cause a bad result even when we use a faked head.
      
      But it is undocumented usage:
      
      /**
       * plist_del - Remove a @node from plist.
       *
       * @node:	&struct plist_node pointer - entry to be removed
       * @head:	&struct plist_head pointer - list head
       */
      
      The document says that the @head is the "list head" head of the priority list.
      
      In futex code, several places use "plist_del(&q->list, &q->list.plist);",
      they pass a fake head. We need to fix them all.
      
      Thanks to Darren Hart for many suggestions.
      
      Acked-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4D11984A.5030203@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      2e12978a
    • Michel Lespinasse's avatar
      futex: Sanitize cmpxchg_futex_value_locked API · 37a9d912
      Michel Lespinasse authored
      
      The cmpxchg_futex_value_locked API was funny in that it returned either
      the original, user-exposed futex value OR an error code such as -EFAULT.
      This was confusing at best, and could be a source of livelocks in places
      that retry the cmpxchg_futex_value_locked after trying to fix the issue
      by running fault_in_user_writeable().
          
      This change makes the cmpxchg_futex_value_locked API more similar to the
      get_futex_value_locked one, returning an error code and updating the
      original value through a reference argument.
          
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>  [tile]
      Acked-by: Tony Luck <tony.luck@intel.com>  [ia64]
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: Michal Simek <monstr@monstr.eu>  [microblaze]
      Acked-by: David Howells <dhowells@redhat.com> [frv]
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110311024851.GC26122@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      37a9d912
    • Thomas Gleixner's avatar
      futex: Avoid redudant evaluation of task_pid_vnr() · c0c9ed15
      Thomas Gleixner authored
      
      The result is not going to change under us, so no need to reevaluate
      this over and over. Seems to be a leftover from the mechanical mass
      conversion of task->pid to task_pid_vnr(tsk).
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      c0c9ed15
  8. Mar 10, 2011
  9. Jan 27, 2011
    • Lai Jiangshan's avatar
      rtmutex: Simplify PI algorithm and make highest prio task get lock · 8161239a
      Lai Jiangshan authored
      
      In current rtmutex, the pending owner may be boosted by the tasks
      in the rtmutex's waitlist when the pending owner is deboosted
      or a task in the waitlist is boosted. This boosting is unrelated,
      because the pending owner does not really take the rtmutex.
      It is not reasonable.
      
      Example.
      
      time1:
      A(high prio) onwers the rtmutex.
      B(mid prio) and C (low prio) in the waitlist.
      
      time2
      A release the lock, B becomes the pending owner
      A(or other high prio task) continues to run. B's prio is lower
      than A, so B is just queued at the runqueue.
      
      time3
      A or other high prio task sleeps, but we have passed some time
      The B and C's prio are changed in the period (time2 ~ time3)
      due to boosting or deboosting. Now C has the priority higher
      than B. ***Is it reasonable that C has to boost B and help B to
      get the rtmutex?
      
      NO!! I think, it is unrelated/unneed boosting before B really
      owns the rtmutex. We should give C a chance to beat B and
      win the rtmutex.
      
      This is the motivation of this patch. This patch *ensures*
      only the top waiter or higher priority task can take the lock.
      
      How?
      1) we don't dequeue the top waiter when unlock, if the top waiter
         is changed, the old top waiter will fail and go to sleep again.
      2) when requiring lock, it will get the lock when the lock is not taken and:
         there is no waiter OR higher priority than waiters OR it is top waiter.
      3) In any time, the top waiter is changed, the top waiter will be woken up.
      
      The algorithm is much simpler than before, no pending owner, no
      boosting for pending owner.
      
      Other advantage of this patch:
      1) The states of a rtmutex are reduced a half, easier to read the code.
      2) the codes become shorter.
      3) top waiter is not dequeued until it really take the lock:
         they will retain FIFO when it is stolen.
      
      Not advantage nor disadvantage
      1) Even we may wakeup multiple waiters(any time when top waiter changed),
         we hardly cause "thundering herd",
         the number of wokenup task is likely 1 or very little.
      2) two APIs are changed.
         rt_mutex_owner() will not return pending owner, it will return NULL when
                          the top waiter is going to take the lock.
         rt_mutex_next_owner() always return the top waiter.
      	                 will not return NULL if we have waiters
                               because the top waiter is not dequeued.
      
         I have fixed the code that use these APIs.
      
      need updated after this patch is accepted
      1) Document/*
      2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst
      
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      8161239a
  10. Jan 13, 2011
    • Andrea Arcangeli's avatar
      thp: update futex compound knowledge · a5b338f2
      Andrea Arcangeli authored
      
      Futex code is smarter than most other gup_fast O_DIRECT code and knows
      about the compound internals.  However now doing a put_page(head_page)
      will not release the pin on the tail page taken by gup-fast, leading to
      all sort of refcounting bugchecks.  Getting a stable head_page is a little
      tricky.
      
      page_head = page is there because if this is not a tail page it's also the
      page_head.  Only in case this is a tail page, compound_head is called,
      otherwise it's guaranteed unnecessary.  And if it's a tail page
      compound_head has to run atomically inside irq disabled section
      __get_user_pages_fast before returning.  Otherwise ->first_page won't be a
      stable pointer.
      
      Disableing irq before __get_user_page_fast and releasing irq after running
      compound_head is needed because if __get_user_page_fast returns == 1, it
      means the huge pmd is established and cannot go away from under us.
      pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
      wait for local_irq_enable before the IPI delivery can return.  This means
      __split_huge_page_refcount can't be running from under us, and in turn
      when we run compound_head(page) we're not reading a dangling pointer from
      tailpage->first_page.  Then after we get to stable head page, we are
      always safe to call compound_lock and after taking the compound lock on
      head page we can finally re-check if the page returned by gup-fast is
      still a tail page.  in which case we're set and we didn't need to split
      the hugepage in order to take a futex on it.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5b338f2
  11. Jan 11, 2011
  12. Nov 10, 2010
    • Darren Hart's avatar
      futex: Add futex_q static initializer · 5bdb05f9
      Darren Hart authored
      
      The futex_q struct has grown considerably over the last couple years. I
      believe it now merits a static initializer to avoid uninitialized data
      errors (having spent more time than I care to admit debugging an uninitialized
      q.bitset in an experimental new op code).
      
      With the key initializer built in, several of the FUTEX_KEY_INIT calls can
      be removed.
      
      V2: use a static variable instead of an init macro.
          use a C99 initializer and don't rely on variable ordering in the struct.
      V3: make futex_q_init const
      
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      LKML-Reference: <1289252428-18383-1-git-send-email-dvhart@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5bdb05f9
    • Darren Hart's avatar
      futex: Replace fshared and clockrt with combined flags · b41277dc
      Darren Hart authored
      
      In the early days we passed the mmap sem around. That became the
      "int fshared" with the fast gup improvements. Then we added
      "int clockrt" in places. This patch unifies these options as "flags".
      
      [ tglx: Split out the stale fshared cleanup ]
      
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      b41277dc
    • Thomas Gleixner's avatar
      futex: Cleanup stale fshared flag interfaces · ae791a2d
      Thomas Gleixner authored
      
      The fast GUP changes stopped using the fshared flag in
      put_futex_keys(), but we kept the interface the same.
      
      Cleanup all stale users.
      
      This patch is split out from Darren Harts combo patch which also
      combines various flags. This way the changes are clearly separated.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com>
      ae791a2d
    • Darren Hart's avatar
      futex: Address compiler warnings in exit_robust_list · 4c115e95
      Darren Hart authored
      
      Since commit 1dcc41bb (futex: Change 3rd arg of fetch_robust_entry()
      to unsigned int*) some gcc versions decided to emit the following
      warning:
      
      kernel/futex.c: In function ‘exit_robust_list’:
      kernel/futex.c:2492: warning: ‘next_pi’ may be used uninitialized in this function
      
      The commit did not introduce the warning as gcc should have warned
      before that commit as well. It's just gcc being silly.
      
      The code path really can't result in next_pi being unitialized (or
      should not), but let's keep the build clean. Annotate next_pi as an
      uninitialized_var.
      
      [ tglx: Addressed the same issue in futex_compat.c and massaged the
        	changelog ]
      
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Tested-by: default avatarMatt Fleming <matt@console-pimps.org>
      Tested-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      LKML-Reference: <1288897200-13008-1-git-send-email-dvhart@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      4c115e95
  13. Oct 25, 2010
  14. Oct 19, 2010
    • Darren Hart's avatar
      futex: Fix errors in nested key ref-counting · 7ada876a
      Darren Hart authored
      
      futex_wait() is leaking key references due to futex_wait_setup()
      acquiring an additional reference via the queue_lock() routine. The
      nested key ref-counting has been masking bugs and complicating code
      analysis. queue_lock() is only called with a previously ref-counted
      key, so remove the additional ref-counting from the queue_(un)lock()
      functions.
      
      Also futex_wait_requeue_pi() drops one key reference too many in
      unqueue_me_pi(). Remove the key reference handling from
      unqueue_me_pi(). This was paired with a queue_lock() in
      futex_lock_pi(), so the count remains unchanged.
      
      Document remaining nested key ref-counting sites.
      
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reported-and-tested-by: default avatarMatthieu <Fertré&lt;matthieu.fertre@kerlabs.com>
      Reported-by: default avatarLouis <Rilling&lt;louis.rilling@kerlabs.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <4CBB17A8.70401@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      7ada876a
  15. Oct 14, 2010
  16. Sep 18, 2010
  17. Jun 30, 2010
    • Michal Hocko's avatar
      futex: futex_find_get_task remove credentails check · 7a0ea09a
      Michal Hocko authored
      
      futex_find_get_task is currently used (through lookup_pi_state) from two
      contexts, futex_requeue and futex_lock_pi_atomic.  None of the paths
      looks it needs the credentials check, though.  Different (e)uids
      shouldn't matter at all because the only thing that is important for
      shared futex is the accessibility of the shared memory.
      
      The credentail check results in glibc assert failure or process hang (if
      glibc is compiled without assert support) for shared robust pthread
      mutex with priority inheritance if a process tries to lock already held
      lock owned by a process with a different euid:
      
      pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
      
      The problem is that futex_lock_pi_atomic which is called when we try to
      lock already held lock checks the current holder (tid is stored in the
      futex value) to get the PI state.  It uses lookup_pi_state which in turn
      gets task struct from futex_find_get_task.  ESRCH is returned either
      when the task is not found or if credentials check fails.
      
      futex_lock_pi_atomic simply returns if it gets ESRCH.  glibc code,
      however, doesn't expect that robust lock returns with ESRCH because it
      should get either success or owner died.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a0ea09a
  18. Feb 03, 2010
    • Thomas Gleixner's avatar
      futex: Handle futex value corruption gracefully · 59647b6a
      Thomas Gleixner authored
      
      The WARN_ON in lookup_pi_state which complains about a mismatch
      between pi_state->owner->pid and the pid which we retrieved from the
      user space futex is completely bogus.
      
      The code just emits the warning and then continues despite the fact
      that it detected an inconsistent state of the futex. A conveniant way
      for user space to spam the syslog.
      
      Replace the WARN_ON by a consistency check. If the values do not match
      return -EINVAL and let user space deal with the mess it created.
      
      This also fixes the missing task_pid_vnr() when we compare the
      pi_state->owner pid with the futex value.
      
      Reported-by: default avatarJermome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      59647b6a
    • Thomas Gleixner's avatar
      futex: Handle user space corruption gracefully · 51246bfd
      Thomas Gleixner authored
      
      If the owner of a PI futex dies we fix up the pi_state and set
      pi_state->owner to NULL. When a malicious or just sloppy programmed
      user space application sets the futex value to 0 e.g. by calling
      pthread_mutex_init(), then the futex can be acquired again. A new
      waiter manages to enqueue itself on the pi_state w/o damage, but on
      unlock the kernel dereferences pi_state->owner and oopses.
      
      Prevent this by checking pi_state->owner in the unlock path. If
      pi_state->owner is not current we know that user space manipulated the
      futex value. Ignore the mess and return -EINVAL.
      
      This catches the above case and also the case where a task hijacks the
      futex by setting the tid value and then tries to unlock it.
      
      Reported-by: default avatarJermome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      51246bfd
    • Mikael Pettersson's avatar
      futex_lock_pi() key refcnt fix · 5ecb01cf
      Mikael Pettersson authored
      This fixes a futex key reference count bug in futex_lock_pi(),
      where a key's reference count is incremented twice but decremented
      only once, causing the backing object to not be released.
      
      If the futex is created in a temporary file in an ext3 file system,
      this bug causes the file's inode to become an "undead" orphan,
      which causes an oops from a BUG_ON() in ext3_put_super() when the
      file system is unmounted. glibc's test suite is known to trigger this,
      see <http://bugzilla.kernel.org/show_bug.cgi?id=14256
      
      >.
      
      The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's
      38d47c1b "[PATCH] futex: rely on
      get_user_pages() for shared futexes". That commit made get_futex_key()
      also increment the reference count of the futex key, and updated its
      callers to decrement the key's reference count before returning.
      Unfortunately the normal exit path in futex_lock_pi() wasn't corrected:
      the reference count is incremented by get_futex_key() and queue_lock(),
      but the normal exit path only decrements once, via unqueue_me_pi().
      The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31
      this is easily done by 'goto out_put_key' rather than 'goto out'.
      
      Signed-off-by: default avatarMikael Pettersson <mikpe@it.uu.se>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      5ecb01cf
  19. Jan 13, 2010
    • KOSAKI Motohiro's avatar
      futexes: Remove rw parameter from get_futex_key() · 7485d0d3
      KOSAKI Motohiro authored
      
      Currently, futexes have two problem:
      
      A) The current futex code doesn't handle private file mappings properly.
      
      get_futex_key() uses PageAnon() to distinguish file and
      anon, which can cause the following bad scenario:
      
        1) thread-A call futex(private-mapping, FUTEX_WAIT), it
           sleeps on file mapping object.
        2) thread-B writes a variable and it makes it cow.
        3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
           wakes up blocked thread on the anonymous page. (but it's nothing)
      
      B) Current futex code doesn't handle zero page properly.
      
      Read mode get_user_pages() can return zero page, but current
      futex code doesn't handle it at all. Then, zero page makes
      infinite loop internally.
      
      The solution is to use write mode get_user_page() always for
      page lookup. It prevents the lookup of both file page of private
      mappings and zero page.
      
      Performance concerns:
      
      Probaly very little, because glibc always initialize variables
      for futex before to call futex(). It means glibc users never see
      the overhead of this patch.
      
      Compatibility concerns:
      
      This patch has few compatibility issues. After this patch,
      FUTEX_WAIT require writable access to futex variables (read-only
      mappings makes EFAULT). But practically it's not a problem,
      glibc always initalizes variables for futexes explicitly - nobody
      uses read-only mappings.
      
      Reported-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: <stable@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Ulrich Drepper <drepper@gmail.com>
      LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7485d0d3
  20. Dec 14, 2009
  21. Dec 08, 2009
  22. Oct 28, 2009
    • Thomas Gleixner's avatar
      futex: Fix spurious wakeup for requeue_pi really · 11df6ddd
      Thomas Gleixner authored
      
      The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
      NULL test) nor does it use the wake_list of futex_wake() which where
      the reason for commit 41890f2 (futex: Handle spurious wake up)
      
      See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
      
      The changes in this fix to the wait_requeue_pi path were considered to
      be a likely unecessary, but harmless safety net. But it turns out that
      due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
      as EAGAIN we built an endless loop in the code path which returns
      correctly EWOULDBLOCK.
      
      Spurious wakeups in wait_requeue_pi code path are unlikely so we do
      the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
      it deal with the spurious wakeup.
      
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      LKML-Reference: <4AE23C74.1090502@us.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      11df6ddd
  23. Oct 16, 2009
    • Darren Hart's avatar
      futex: Move drop_futex_key_refs out of spinlock'ed region · 89061d3d
      Darren Hart authored
      
      When requeuing tasks from one futex to another, the reference held
      by the requeued task to the original futex location needs to be
      dropped eventually.
      
      Dropping the reference may ultimately lead to a call to
      "iput_final" and subsequently call into filesystem- specific code -
      which may be non-atomic.
      
      It is therefore safer to defer this drop operation until after the
      futex_hash_bucket spinlock has been dropped.
      
      Originally-From: Helge Bahmann <hcb@chaoticmind.net>
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: <stable@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Sven-Thorsten Dietrich <sdietrich@novell.com>
      Cc: John Kacur <jkacur@redhat.com>
      LKML-Reference: <4AD7A298.5040802@us.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      89061d3d
  24. Oct 14, 2009
    • Darren Hart's avatar
      futex: Check for NULL keys in match_futex · 2bc87203
      Darren Hart authored
      
      If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
      it will find the futex_q->requeue_pi_key to be NULL and OOPS.
      
      Check for NULL in match_futex() instead of doing explicit NULL pointer
      checks on all call sites.  While match_futex(NULL, NULL) returning
      false is a little odd, it's still correct as we expect valid key
      references.
      
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Dinakar Guniguntala <dino@in.ibm.com>
      CC: John Stultz <johnstul@us.ibm.com>
      Cc: stable@kernel.org
      LKML-Reference: <4AD60687.10306@us.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      2bc87203
  25. Oct 13, 2009
    • Thomas Gleixner's avatar
      futex: Handle spurious wake up · d58e6576
      Thomas Gleixner authored
      
      The futex code does not handle spurious wake up in futex_wait and
      futex_wait_requeue_pi.
      
      The code assumes that any wake up which was not caused by futex_wake /
      requeue or by a timeout was caused by a signal wake up and returns one
      of the syscall restart error codes.
      
      In case of a spurious wake up the signal delivery code which deals
      with the restart error codes is not invoked and we return that error
      code to user space. That causes applications which actually check the
      return codes to fail. Blaise reported that on preempt-rt a python test
      program run into a exception trap. -rt exposed that due to a built in
      spurious wake up accelerator :)
      
      Solve this by checking signal_pending(current) in the wake up path and
      handle the spurious wake up case w/o returning to user space.
      
      Reported-by: default avatarBlaise Gassend <blaise@willowgarage.com>
      Debugged-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@kernel.org
      LKML-Reference: <new-submission>
      d58e6576
  26. Oct 07, 2009
    • Darren Hart's avatar
      futex: fix requeue_pi key imbalance · da085681
      Darren Hart authored
      
      If futex_wait_requeue_pi() wakes prior to requeue, we drop the
      reference to the source futex_key twice, once in
      handle_early_requeue_pi_wakeup() and once on our way out.
      
      Remove the drop from the handle_early_requeue_pi_wakeup() and keep
      the get/drops together in futex_wait_requeue_pi().
      
      Reported-by: default avatarHelge Bahmann <hcb@chaoticmind.net>
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Helge Bahmann <hcb@chaoticmind.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: stable-2.6.31 <stable@kernel.org>
      LKML-Reference: <4ACCE21E.5030805@us.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      da085681
  27. Oct 05, 2009
  28. Sep 24, 2009
Loading