1. 30 Jul, 2012 3 commits
  2. 22 Jul, 2012 1 commit
  3. 14 Jul, 2012 2 commits
  4. 07 Jun, 2012 1 commit
  5. 01 Jun, 2012 2 commits
  6. 31 May, 2012 10 commits
    • Doug Ledford's avatar
      ipc/mqueue: add rbtree node caching support · ce2d52cc
      Doug Ledford authored
      
      
      When I wrote the first patch that added the rbtree support for message
      queue insertion, it sped up the case where the queue was very full
      drastically from the original code.  It, however, slowed down the case
      where the queue was empty (not drastically though).
      
      This patch caches the last freed rbtree node struct so we can quickly
      reuse it when we get a new message.  This is the common path for any queue
      that very frequently goes from 0 to 1 then back to 0 messages in queue.
      
      Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
      msg_insert, so this patch attempts to speculatively allocate a new node
      struct outside of the spin lock when we know we need it, but will still
      fall back to a GFP_ATOMIC allocation if it has to.
      
      Once I added the caching, the necessary various ret = ; spin_unlock
      gyrations in mq_timedsend were getting pretty ugly, so this also slightly
      refactors that function to streamline the flow of the code and the
      function exit.
      
      Finally, while working on getting performance back I made sure that all of
      the node structs were always fully initialized when they were first used,
      rendering the use of kzalloc unnecessary and a waste of CPU cycles.
      
      The net result of all of this is:
      
      1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
         on it when necessary.
      
      2) We will speculatively allocate a node struct using GFP_KERNEL if our
         cache is empty (and save the struct to our cache if it's still empty
         after we have obtained the spin lock).
      
      3) The performance of the common queue empty case has significantly
         improved and is now much more in line with the older performance for
         this case.
      
      The performance changes are:
      
                  Old mqueue      new mqueue      new mqueue + caching
      queue empty
      send/recv   305/288ns       349/318ns       310/322ns
      
      I don't think we'll ever be able to get the recv performance back, but
      that's because the old recv performance was a direct result and
      consequence of the old methods abysmal send performance.  The recv path
      simply must do more so that the send path does not incur such a penalty
      under higher queue depths.
      
      As it turns out, the new caching code also sped up the various queue full
      cases relative to my last patch.  That could be because of the difference
      between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
      change in code flow in the mq_timedsend routine.  Regardless, I'll take
      it.  It wasn't huge, and I *would* say it was within the margin for error,
      but after many repeated runs what I'm seeing is that the old numbers trend
      slightly higher (about 10 to 20ns depending on which test is the one
      running).
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce2d52cc
    • Doug Ledford's avatar
      ipc/mqueue: strengthen checks on mqueue creation · 113289cc
      Doug Ledford authored
      
      
      We already check the mq attr struct if it's passed in, but now that the
      admin can set system wide defaults separate from maximums, it's actually
      possible to set the defaults to something that would overflow.  So, if
      there is no attr struct passed in to the open call, check the default
      values.
      
      While we are at it, simplify mq_attr_ok() by making it return 0 or an
      error condition, so that way if we add more tests to it later, we have the
      option of what error should be returned instead of the calling location
      having to pick a possibly inaccurate error code.
      
      [akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      113289cc
    • Doug Ledford's avatar
      ipc/mqueue: correct mq_attr_ok test · 2c12ea49
      Doug Ledford authored
      
      
      While working on the other parts of the mqueue stuff, I noticed that the
      calculation for overflow in mq_attr_ok didn't actually match reality (this
      is especially true since my last patch which changed how we account memory
      slightly).
      
      In particular, we used to test for overflow using:
        msgs * msgsize + msgs * sizeof(struct msg_msg *)
      
      That was never really correct because each message we allocate via
      load_msg() is actually a struct msg_msg followed by the data for the
      message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
      allocating struct msg_msgseg structs too, but accounting for them would
      get really tedious, so let's ignore those...they're only a pointer in size
      anyway).  This patch updates the calculation to be more accurate in
      regards to maximum possible memory consumption by the mqueue.
      
      [akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c12ea49
    • Doug Ledford's avatar
      ipc/mqueue: improve performance of send/recv · d6629859
      Doug Ledford authored
      
      
      The existing implementation of the POSIX message queue send and recv
      functions is, well, abysmal.  Even worse than abysmal.  I submitted a
      patch to increase the maximum POSIX message queue limit to 65536 due to
      customer needs, however, upon looking over the send/recv implementation, I
      realized that my customer needs help with that too even if they don't know
      it.  The basic problem is that, given the fairly typical use case scenario
      for a large queue of queueing lots of messages all at the same priority (I
      verified with my customer that this is indeed what their app does), the
      msg_insert routine is basically a frikkin' bubble sort.  I mean, whoa,
      that's *so* middle school.
      
      OK, OK, to not slam the original author too much, I'm sure they didn't
      envision a queue depth of 50,000+ messages.  No one would think that
      moving elements in an array, one at a time, and dereferencing each pointer
      in that array to check priority of the message being pointed too, again
      one at a time, for 50,000+ times would be good.  So let's assume that, as
      is typical, the users have found a way to break our code simply by using
      it in a way we didn't envision.  Fair enough.
      
      "So, just how broken is it?", you ask.  I wondered the same thing, so I
      wrote an app to let me know.  It's my next patch.  It gave me some
      interesting results.  Here's what it tested:
      
      Interference with other apps - In continuous mode, the app just sits there
      and hits a message queue forever, while you go do something productive on
      another terminal using other CPUs.  You then measure how long it takes you
      to do that something productive.  Then you restart the app in fake
      continuous mode, and it sits in a tight loop on a CPU while you repeat
      your tests.  The whole point of this is to keep one CPU tied up (so it
      can't be used in your other work) but in one case tied up hitting the
      mqueue code so we can see the effect of walking that 65,528 element array
      one pointer at a time on the global CPU cache.  If it's bad, then it will
      slow down your app on the other CPUs just by polluting cache mercilessly.
      In the fake case, it will be in a tight loop, but not polluting cache.
      Testing the mqueue subsystem directly - Here we just run a number of tests
      to see how the mqueue subsystem performs under different conditions.  A
      couple conditions are known to be worst case for the old system, and some
      routines, so this tests all of them.
      
      So, on to the results already:
      
      Subsystem/Test                  Old                         New
      
      Time to compile linux
      kernel (make -j12 on a
      6 core CPU)
        Running mqueue test     user 49m10.744s             user 45m26.294s
      			   sys  5m51.924s              sys  4m59.894s
      			 total 55m02.668s            total 50m26.188s
      
        Running fake test       user 45m32.686s             user 45m18.552s
                                 sys  5m12.465s              sys  4m56.468s
                               total 50m45.151s            total 50m15.020s
      
        % slowdown from mqueue
          cache thrashing            ~8%                         ~.5%
      
      Avg time to send/recv (in nanoseconds per message)
        when queue empty            305/288                    349/318
        when queue full (65528 messages)
          constant priority      526589/823                    362/314
          increasing priority    403105/916                    495/445
          decreasing priority     73420/594                    482/409
          random priority        280147/920                    546/436
      
      Time to fill/drain queue (65528 messages, in seconds)
        constant priority         17.37/.12                    .13/.12
        increasing priority        4.14/.14                    .21/.18
        decreasing priority       12.93/.13                    .21/.18
        random priority            8.88/.16                    .22/.17
      
      So, I think the results speak for themselves.  It's possible this
      implementation could be improved by cacheing at least one priority level
      in the node tree (that would bring the queue empty performance more in
      line with the old implementation), but this works and is *so* much better
      than what we had, especially for the common case of a single priority in
      use, that further refinements can be in follow on patches.
      
      [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
      [levinsasha928@gmail.com: use correct gfp flags in msg_insert]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6629859
    • KOSAKI Motohiro's avatar
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro authored
      Commit b231cca4
      
       ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • KOSAKI Motohiro's avatar
      mqueue: don't use kmalloc with KMALLOC_MAX_SIZE · fd1f87d2
      KOSAKI Motohiro authored
      
      
      KMALLOC_MAX_SIZE is not a good threshold.  It is extremely high and
      problematic.  Unfortunately, some silly drivers depend on this and we
      can't change it.  But any new code needn't use such extreme ugly high
      order allocations.  It brings us awful fragmentation issues and system
      slowdown.
      Signed-off-by: default avatarKOSAKI Motohiro <mkosaki@jp.fujitsu.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd1f87d2
    • Doug Ledford's avatar
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford authored
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d
      
       ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • Doug Ledford's avatar
      ipc/mqueue: enforce hard limits · 02967ea0
      Doug Ledford authored
      
      
      In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
      In preparation for making more reasonable hard limits, start enforcing
      them even on CAP_SYS_RESOURCE.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02967ea0
    • Doug Ledford's avatar
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford authored
      Commit b231cca4
      
       ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858ee378
    • Doug Ledford's avatar
      ipc/mqueue: cleanup definition names and locations · 93e6f119
      Doug Ledford authored
      Since commit b231cca4 ("message queues: increase range limits") on
      Oct 18, 2008, calls to mq_open() that did not pass in an attribute
      struct and expected to get default values for the size of the queue and
      the max message size now get the system wide maximums instead of
      hardwired defaults like they used to get.
      
      This was uncovered when one of the earlier patches in this patch set
      increased the default system wide maximums at the same time it increased
      the hard ceiling on the system wide maximums (a customer specifically
      needed the hard ceiling brought back up, the new ceiling that commit
      b231cca4
      
       introduced was too low for their production systems).  By
      increasing the default maximums and not realising they were tied to any
      attempt to create a message queue without an attribute struct, I had
      inadvertently made it such that all message queue creation attempts
      without an attribute struct were failing because the new default
      maximums would create a queue that exceeded the default rlimit for
      message queue bytes.
      
      As a result, the system wide defaults were brought back down to their
      previous levels, and the system wide ceilings on the maximums were
      raised to meet the customer's needs.  However, the fact that the no
      attribute struct behavior of mq_open() could be broken by changing the
      system wide maximums for message queues was seen as fundamentally broken
      itself.  So we hardwired the no attribute case back like it used to be.
      But, then we realized that on the very off chance that some piece of
      software in the wild depended on that behavior, we could work around
      that issue by adding two new knobs to /proc that allowed setting the
      defaults for message queues created without an attr struct separately
      from the system wide maximums.
      
      What is not an option IMO is to leave the current behavior in place.  No
      piece of software should ever rely on setting the system wide maximums
      in order to get a desired message queue.  Such a reliance would be so
      fundamentally multitasking OS unfriendly as to not really be tolerable.
      Fortunately, we don't know of any software in the wild that uses this
      except for a regression test program that caught the issue in the first
      place.  If there is though, we have made accommodations with the two new
      /proc knobs (and that's all the accommodations such fundamentally broken
      software can be allowed)..
      
      This patch:
      
      The various defines for minimums and maximums of the sysctl controllable
      mqueue values are scattered amongst different files and named
      inconsistently.  Move them all into ipc_namespace.h and make them have
      consistent names.  Additionally, make the number of queues per namespace
      also have a minimum and maximum and use the same sysctl function as the
      other two settable variables.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e6f119
  7. 05 May, 2012 1 commit
  8. 03 May, 2012 1 commit
  9. 07 Apr, 2012 2 commits
  10. 21 Mar, 2012 1 commit
  11. 20 Mar, 2012 1 commit
  12. 15 Mar, 2012 1 commit
    • Chris Metcalf's avatar
      [PATCH v3] ipc: provide generic compat versions of IPC syscalls · 48b25c43
      Chris Metcalf authored
      
      
      When using the "compat" APIs, architectures will generally want to
      be able to make direct syscalls to msgsnd(), shmctl(), etc., and
      in the kernel we would want them to be handled directly by
      compat_sys_xxx() functions, as is true for other compat syscalls.
      
      However, for historical reasons, several of the existing compat IPC
      syscalls do not do this.  semctl() expects a pointer to the fourth
      argument, instead of the fourth argument itself.  msgsnd(), msgrcv()
      and shmat() expect arguments in different order.
      
      This change adds an ARCH_WANT_OLD_COMPAT_IPC config option that can be
      set to preserve this behavior for ports that use it (x86, sparc, powerpc,
      s390, and mips).  No actual semantics are changed for those architectures,
      and there is only a minimal amount of code refactoring in ipc/compat.c.
      
      Newer architectures like tile (and perhaps future architectures such
      as arm64 and unicore64) should not select this option, and thus can
      avoid having any IPC-specific code at all in their architecture-specific
      compat layer.  In the same vein, if this option is not selected, IPC_64
      mode is assumed, since that's what the <asm-generic> headers expect.
      
      The workaround code in "tile" for msgsnd() and msgrcv() is removed
      with this change; it also fixes the bug that shmat() and semctl() were
      not being properly handled.
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      48b25c43
  13. 13 Feb, 2012 1 commit
  14. 23 Jan, 2012 3 commits
    • Hugh Dickins's avatar
      SHM_UNLOCK: fix Unevictable pages stranded after swap · 24513264
      Hugh Dickins authored
      Commit cc39c6a9
      
       ("mm: account skipped entries to avoid looping in
      find_get_pages") correctly fixed an infinite loop; but left a problem
      that find_get_pages() on shmem would return 0 (appearing to callers to
      mean end of tree) when it meets a run of nr_pages swap entries.
      
      The only uses of find_get_pages() on shmem are via pagevec_lookup(),
      called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
      scan_mapping_unevictable_pages().  The first is already commented, and
      not worth worrying about; but the second can leave pages on the
      Unevictable list after an unusual sequence of swapping and locking.
      
      Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
      swap) instead of pagevec_lookup().
      
      But I don't want to contaminate vmscan.c with shmem internals, nor
      shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
      shmem.c, renaming it shmem_unlock_mapping(); and rename
      check_move_unevictable_page() to check_move_unevictable_pages(), looping
      down an array of pages, oftentimes under the same lock.
      
      Leave out the "rotate unevictable list" block: that's a leftover from
      when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
      handling involved looking at pages at tail of LRU.
      
      Was there significance to the sequence first ClearPageUnevictable, then
      test page_evictable, then SetPageUnevictable here? I think not, we're
      under LRU lock, and have no barriers between those.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24513264
    • Hugh Dickins's avatar
      SHM_UNLOCK: fix long unpreemptible section · 85046579
      Hugh Dickins authored
      
      
      scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
      evictable again once the shared memory is unlocked.  It does this with
      pagevec_lookup()s across the whole object (which might occupy most of
      memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
      PAGEVEC_SIZE pages would be good.
      
      However, KOSAKI-san points out that this is called under shmem.c's
      info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
      There is no strong reason for that: we need to take these pages off the
      unevictable list soonish, but those locks are not required for it.
      
      So move the call to scan_mapping_unevictable_pages() from shmem.c's
      unlock handling up to shm.c's unlock handling.  Remove the recently
      added barrier, not needed now we have spin_unlock() before the scan.
      
      Use get_file(), with subsequent fput(), to make sure we have a reference
      to mapping throughout scan_mapping_unevictable_pages(): that's something
      that was previously guaranteed by the shm_lock().
      
      Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
      time, and we lazily discover them to be Unevictable later, so it serves
      no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
      pages still on pagevec are not marked Unevictable.
      
      The original code avoided redundant rescans by checking VM_LOCKED flag
      at its level: now avoid them by checking shp's SHM_LOCKED.
      
      The original code called scan_mapping_unevictable_pages() on a locked
      area at shm_destroy() time: perhaps we once had accounting cross-checks
      which required that, but not now, so skip the overhead and just let
      inode eviction deal with them.
      
      Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
      under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
      more as comment than to save space; comment them used for SHM_UNLOCK.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85046579
    • Davidlohr Bueso's avatar
      ipc/mqueue: simplify reading msgqueue limit · 2a4e64b8
      Davidlohr Bueso authored
      
      
      Because the current task is being used to get the limit, we can simply
      use rlimit() instead of task_rlimit().
      Signed-off-by: default avatarDavidlohr Bueso <dave@gnu.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a4e64b8
  15. 10 Jan, 2012 1 commit
    • Serge E. Hallyn's avatar
      user namespace: make signal.c respect user namespaces · 6b550f94
      Serge E. Hallyn authored
      
      
      ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
      user namespace. (new, thanks Oleg)
      
      __send_signal: convert current's uid to the recipient's user namespace
      for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
      again :)
      
      do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
      user namespace
      
      ptrace_signal maps parent's uid into current's user namespace before
      including in signal to current.  IIUC Oleg has argued that this shouldn't
      matter as the debugger will play with it, but it seems like not converting
      the value currently being set is misleading.
      
      Changelog:
      Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
      	simplify callers and help make clear what we are translating
              (which uid into which namespace).  Passing the target task would
      	make callers even easier to read, but we pass in user_ns because
      	current_user_ns() != task_cred_xxx(current, user_ns).
      Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
      	in ptrace_signal().
      Sep 23: In send_signal(), detect when (user) signal is coming from an
      	ancestor or unrelated user namespace.  Pass that on to __send_signal,
      	which sets si_uid to 0 or overflowuid if needed.
      Oct 12: Base on Oleg's fixup_uid() patch.  On top of that, handle all
      	SI_FROMKERNEL cases at callers, because we can't assume sender is
      	current in those cases.
      Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
      Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      From: Serge Hallyn <serge.hallyn@canonical.com>
      Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
      
      Eric Biederman pointed out that passing info is a bug and could lead to a
      NULL pointer deref to boot.
      
      A collection of signal, securebits, filecaps, cap_bounds, and a few other
      ltp tests passed with this kernel.
      
      Changelog:
          Nov 18: previous patch missed a leading '&'
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      From: Dan Carpenter <dan.carpenter@oracle.com>
      Subject: ipc/mqueue: lock() => unlock() typo
      
      There was a double lock typo introduced in b085f4bd6b21 "user namespace:
      make signal.c respect user namespaces"
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b550f94
  16. 03 Jan, 2012 4 commits
  17. 08 Dec, 2011 1 commit
  18. 02 Nov, 2011 3 commits
  19. 31 Oct, 2011 1 commit