1. 10 Jul, 2013 1 commit
  2. 09 Jul, 2013 2 commits
  3. 03 Jul, 2013 1 commit
  4. 09 May, 2013 1 commit
  5. 07 May, 2013 1 commit
  6. 29 Apr, 2013 7 commits
    • Cyril Hrubis's avatar
      mm/mmap: check for RLIMIT_AS before unmapping · e8420a8e
      Cyril Hrubis authored
      Fix a corner case for MAP_FIXED when requested mapping length is larger
      than rlimit for virtual memory.  In such case any overlapping mappings
      are unmapped before we check for the limit and return ENOMEM.
      
      The check is moved before the loop that unmaps overlapping parts of
      existing mappings.  When we are about to hit the limit (currently mapped
      pages + len > limit) we scan for overlapping pages and check again
      accounting for them.
      
      This fixes situation when userspace program expects that the previous
      mappings are preserved after the mmap() syscall has returned with error.
      (POSIX clearly states that successfull mapping shall replace any
      previous mappings.)
      
      This corner case was found and can be tested with LTP testcase:
      
      testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c
      
      In this case the mmap, which is clearly over current limit, unmaps
      dynamic libraries and the testcase segfaults right after returning into
      userspace.
      
      I've also looked at the second instance of the unmapping loop in the
      do_brk().  The do_brk() is called from brk() syscall and from vm_brk().
      The brk() syscall checks for overlapping mappings and bails out when
      there are any (so it can't be triggered from the brk syscall).  The
      vm_brk() is called only from binmft handlers so it shouldn't be
      triggered unless binmft handler created overlapping mappings.
      Signed-off-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8420a8e
    • Andrew Shewmaker's avatar
      mm: reinititalise user and admin reserves if memory is added or removed · 1640879a
      Andrew Shewmaker authored
      Alter the admin and user reserves of the previous patches in this series
      when memory is added or removed.
      
      If memory is added and the reserves have been eliminated or increased
      above the default max, then we'll trust the admin.
      
      If memory is removed and there isn't enough free memory, then we need to
      reset the reserves.
      
      Otherwise keep the reserve set by the admin.
      
      The reserve reset code is the same as the reserve initialization code.
      
      I tested hot addition and removal by triggering it via sysfs.  The
      reserves shrunk when they were set high and memory was removed.  They
      were reset higher when memory was added again.
      
      [akpm@linux-foundation.org: use register_hotmemory_notifier()]
      [akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
      [fengguang.wu@intel.com: make init_reserve_notifier() static]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1640879a
    • Andrew Shewmaker's avatar
      mm: replace hardcoded 3% with admin_reserve_pages knob · 4eeab4f5
      Andrew Shewmaker authored
      Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
      memory reserve to something other than 3%, which may be multiple
      gigabytes on large memory systems.  Only about 8MB is necessary to
      enable recovery in the default mode, and only a few hundred MB are
      required even when overcommit is disabled.
      
      This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.
      
      admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
      
      I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
      
      Please see first patch in this series for full background, motivation,
      testing, and full changelog.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_admin_reserve() static]
      Signed-off-by: default avatarAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eeab4f5
    • Andrew Shewmaker's avatar
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker authored
      Add user_reserve_kbytes knob.
      
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      
      user_reserve_pages defaults to min(3% free pages, 128MB)
      
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      
      This only affects OVERCOMMIT_NEVER mode.
      
      Background
      
      1. user reserve
      
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      
      bash: fork: Cannot allocate memory
      
      2. admin reserve
      
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      
      Motivation
      
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      
      Also, the admin reserve would be more useful if it didn't shrink too much.
      
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
      
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      
      Risks
      
      * "bash: fork: Cannot allocate memory"
      
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
      
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      
      * root-cant-log-in problem
      
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
      
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
      
      Alternatives
      
       * Memory cgroups provide a more flexible way to limit application memory.
      
         Not everyone wants to set up cgroups or deal with their overhead.
      
       * We could create a fourth overcommit mode which provides smaller reserves.
      
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
      
       * Force users to initialize all of their memory or use calloc.
      
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
      
      FAQ
      
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
      
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
      
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
      
         admin_reserve_pages defaults to min(3% free memory, 8MB)
      
         So, anyone switching to 'never' mode needs to adjust
         admin_reserve_pages.
      
       * How do you calculate a minimum useful reserve?
      
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
      
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
      
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
      
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
      
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
      
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
      
       * What happens if user_reserve_pages is set to 0?
      
         Note, this only affects overcomitt 'never' mode.
      
         Then a user will be able to allocate all available memory minus
         admin_reserve_kbytes.
      
         However, they will easily see a message such as:
      
         "bash: fork: Cannot allocate memory"
      
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
      
       * What's the difference between overcommit 'guess' and 'never'?
      
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
      
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
         applications.
      
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
      
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      
      Test Summary
      
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      
      1. maximize user-allocatable memory, running close to the edge of
      recoverability
      
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      
      Test Description
      
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      
      In overcommit 'never' mode, memory_ratio=100
      
      Test Results
      
      3.9.0-rc1-mm1
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      
      3.9.0-rc1-mm1-tunablereserves
      
      User and Admin Recovery show their respective reserves, if applicable.
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      
      * process would successfully mlock, then the oom killer would pick it
      
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      
      never        no     1      5359/5448       no     10MB no         10MB barely
      
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      
      * more memtesters were launched, able to allocate approximately another 100MB
      
      Future Work
      
       - Test larger memory systems.
      
       - Test an embedded image.
      
       - Test other architectures.
      
       - Time malloc microbenchmarks.
      
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
      
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: default avatarAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9b1d098
    • Hampson, Steven T's avatar
      mm: merging memory blocks resets mempolicy · 1444f92c
      Hampson, Steven T authored
      Using mbind to change the mempolicy to MPOL_BIND on several adjacent
      mmapped blocks may result in a reset of the mempolicy to MPOL_DEFAULT in
      vma_adjust.
      
      Test code.  Correct result is three lines containing "OK".
      
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <numaif.h>
      #include <errno.h>
      
      /* gcc mbind_test.c -lnuma -o mbind_test -Wall */
      #define MAXNODE 4096
      
      void allocate()
      {
      	int ret;
      	int len;
      	int policy = -1;
      	unsigned char *p;
      	unsigned long mask[MAXNODE] = { 0 };
      	unsigned long retmask[MAXNODE] = { 0 };
      
      	len = getpagesize() * 0x2fc00;
      	p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
      		 -1, 0);
      	if (p == MAP_FAILED)
      		printf("mbind err: %d\n", errno);
      
      	mask[0] = 1;
      	ret = mbind(p, len, MPOL_BIND, mask, MAXNODE, 0);
      	if (ret < 0)
      		printf("mbind err: %d %d\n", ret, errno);
      	ret = get_mempolicy(&policy, retmask, MAXNODE, p, MPOL_F_ADDR);
      	if (ret < 0)
      		printf("get_mempolicy err: %d %d\n", ret, errno);
      
      	if (policy == MPOL_BIND)
      		printf("OK\n");
      	else
      		printf("ERROR: policy is %d\n", policy);
      }
      
      int main()
      {
      	allocate();
      	allocate();
      	allocate();
      	return 0;
      }
      Signed-off-by: default avatarSteven T Hampson <steven.t.hampson@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1444f92c
    • Hugh Dickins's avatar
      mm: allow arch code to control the user page table ceiling · 6ee8630e
      Hugh Dickins authored
      On architectures where a pgd entry may be shared between user and kernel
      (e.g.  ARM+LPAE), freeing page tables needs a ceiling other than 0.
      This patch introduces a generic USER_PGTABLES_CEILING that arch code can
      override.  It is the responsibility of the arch code setting the ceiling
      to ensure the complete freeing of the page tables (usually in
      pgd_free()).
      
      [catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[3.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ee8630e
    • Zhang Yanfei's avatar
      mmap: find_vma: remove the WARN_ON_ONCE(!mm) check · ee5df057
      Zhang Yanfei authored
      Remove the WARN_ON_ONCE(!mm) check as the comment suggested.  Kernel
      code calls find_vma only when it is absolutely sure that the mm_struct
      arg to it is non-NULL.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: k80c <k80ck80c@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee5df057
  7. 25 Apr, 2013 1 commit
  8. 04 Apr, 2013 1 commit
    • Jan Stancek's avatar
      mm: prevent mmap_cache race in find_vma() · b6a9b7f6
      Jan Stancek authored
      find_vma() can be called by multiple threads with read lock
      held on mm->mmap_sem and any of them can update mm->mmap_cache.
      Prevent compiler from re-fetching mm->mmap_cache, because other
      readers could update it in the meantime:
      
                     thread 1                             thread 2
                                              |
        find_vma()                            |  find_vma()
          struct vm_area_struct *vma = NULL;  |
          vma = mm->mmap_cache;               |
          if (!(vma && vma->vm_end > addr     |
              && vma->vm_start <= addr)) {    |
                                              |    mm->mmap_cache = vma;
          return vma;                         |
           ^^ compiler may optimize this      |
              local variable out and re-read  |
              mm->mmap_cache                  |
      
      This issue can be reproduced with gcc-4.8.0-1 on s390x by running
      mallocstress testcase from LTP, which triggers:
      
        kernel BUG at mm/rmap.c:1088!
          Call Trace:
           ([<000003d100c57000>] 0x3d100c57000)
            [<000000000023a1c0>] do_wp_page+0x2fc/0xa88
            [<000000000023baae>] handle_pte_fault+0x41a/0xac8
            [<000000000023d832>] handle_mm_fault+0x17a/0x268
            [<000000000060507a>] do_protection_exception+0x1e2/0x394
            [<0000000000603a04>] pgm_check_handler+0x138/0x13c
            [<000003fffcf1f07a>] 0x3fffcf1f07a
          Last Breaking-Event-Address:
            [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
      
      Thanks to Jakub Jelinek for his insight on gcc and helping to
      track this down.
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6a9b7f6
  9. 28 Mar, 2013 1 commit
  10. 27 Feb, 2013 1 commit
    • Linus Torvalds's avatar
      mm: do not grow the stack vma just because of an overrun on preceding vma · 09884964
      Linus Torvalds authored
      The stack vma is designed to grow automatically (marked with VM_GROWSUP
      or VM_GROWSDOWN depending on architecture) when an access is made beyond
      the existing boundary.  However, particularly if you have not limited
      your stack at all ("ulimit -s unlimited"), this can cause the stack to
      grow even if the access was really just one past *another* segment.
      
      And that's wrong, especially since we first grow the segment, but then
      immediately later enforce the stack guard page on the last page of the
      segment.  So _despite_ first growing the stack segment as a result of
      the access, the kernel will then make the access cause a SIGSEGV anyway!
      
      So do the same logic as the guard page check does, and consider an
      access to within one page of the next segment to be a bad access, rather
      than growing the stack to abut the next segment.
      Reported-and-tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09884964
  11. 23 Feb, 2013 8 commits
  12. 22 Feb, 2013 1 commit
  13. 07 Feb, 2013 1 commit
  14. 05 Feb, 2013 1 commit
  15. 11 Jan, 2013 1 commit
    • Jiri Kosina's avatar
      mm: mmap: annotate vm_lock_anon_vma locking properly for lockdep · 572043c9
      Jiri Kosina authored
      Commit 5a505085 ("mm/rmap: Convert the struct anon_vma::mutex to an
      rwsem") turned anon_vma mutex to rwsem.
      
      However, the properly annotated nested locking in mm_take_all_locks()
      has been converted from
      
      	mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);
      
      to
      
      	down_write(&anon_vma->root->rwsem);
      
      which is incomplete, and causes the false positive report from lockdep
      below.
      
      Annotate the fact that mmap_sem is used as an outter lock to serialize
      taking of all the anon_vma rwsems at once no matter the order, using the
      down_write_nest_lock() primitive.
      
      This patch fixes this lockdep report:
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.8.0-rc2-00036-g5f738967 #171 Not tainted
       ---------------------------------------------
       qemu-kvm/2315 is trying to acquire lock:
        (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       but task is already holding lock:
        (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&anon_vma->rwsem);
         lock(&anon_vma->rwsem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by qemu-kvm/2315:
        #0:  (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
        #1:  (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
        #2:  (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
        #3:  (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       stack backtrace:
       Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f738967 #171
       Call Trace:
         print_deadlock_bug+0xf2/0x100
         validate_chain+0x4f6/0x720
         __lock_acquire+0x359/0x580
         lock_acquire+0x121/0x190
         down_write+0x3f/0x70
         mm_take_all_locks+0x149/0x1b0
         do_mmu_notifier_register+0x68/0x170
         mmu_notifier_register+0xe/0x10
         kvm_create_vm+0x22b/0x330 [kvm]
         kvm_dev_ioctl+0xf8/0x1a0 [kvm]
         do_vfs_ioctl+0x9d/0x350
         sys_ioctl+0x91/0xb0
         system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      572043c9
  16. 12 Dec, 2012 2 commits
    • Michel Lespinasse's avatar
      mm: protect against concurrent vma expansion · 4128997b
      Michel Lespinasse authored
      expand_stack() runs with a shared mmap_sem lock.  Because of this, there
      could be multiple concurrent stack expansions in the same mm, which may
      cause problems in the vma gap update code.
      
      I propose to solve this by taking the mm->page_table_lock around such vma
      expansions, in order to avoid the concurrency issue.  We only have to
      worry about concurrent expand_stack() calls here, since we hold a shared
      mmap_sem lock and all vma modificaitons other than expand_stack() are done
      under an exclusive mmap_sem lock.
      
      I previously tried to achieve the same effect by making sure all growable
      vmas in a given mm would share the same anon_vma, which we already lock
      here.  However this turned out to be difficult - all of the schemes I
      tried for refcounting the growable anon_vma and clearing turned out ugly.
      So, I'm now proposing only the minimal fix.
      
      The overhead of taking the page table lock during stack expansion is
      expected to be small: glibc doesn't use expandable stacks for the threads
      it creates, so having multiple growable stacks is actually uncommon and we
      don't expect the page table lock to get bounced between threads.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4128997b
    • Joonsoo Kim's avatar
      mm: WARN_ON_ONCE if f_op->mmap() change vma's start address · 2897b4d2
      Joonsoo Kim authored
      During reviewing the source code, I found a comment which mention that
      after f_op->mmap(), vma's start address can be changed.  I didn't verify
      that it is really possible, because there are so many f_op->mmap()
      implementation.  But if there are some mmap() which change vma's start
      address, it is possible error situation, because we already prepare prev
      vma, rb_link and rb_parent and these are related to original address.
      
      So add WARN_ON_ONCE for finding that this situtation really happens.
      Signed-off-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2897b4d2
  17. 11 Dec, 2012 6 commits
    • Michel Lespinasse's avatar
      mm: vm_unmapped_area() lookup function · db4fbfb9
      Michel Lespinasse authored
      Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
      information to look up for suitable virtual address space gaps.
      
      struct vm_unmapped_area_info is used to define the desired allocation
      request:
       - lowest or highest possible address matching the remaining constraints
       - desired gap length
       - low/high address limits that the gap must fit into
       - alignment mask and offset
      
      Also update the generic arch_get_unmapped_area[_topdown] functions to make
      use of vm_unmapped_area() instead of implementing a brute force search.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db4fbfb9
    • Michel Lespinasse's avatar
      mm: check rb_subtree_gap correctness · 5a0768f6
      Michel Lespinasse authored
      When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is correctly
      set for every vma and that mm->highest_vm_end is also correct.
      
      Also add an explicit 'bug' variable to track if browse_rb() detected any
      invalid condition.
      
      [akpm@linux-foundation.org: repair innovative coding-style inventions]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a0768f6
    • Michel Lespinasse's avatar
      mm: augment vma rbtree with rb_subtree_gap · d3737187
      Michel Lespinasse authored
      Define vma->rb_subtree_gap as the largest gap between any vma in the
      subtree rooted at that vma, and their predecessor.  Or, for a recursive
      definition, vma->rb_subtree_gap is the max of:
      
       - vma->vm_start - vma->vm_prev->vm_end
       - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
         vma->rb.rb_right
      
      This will allow get_unmapped_area_* to find a free area of the right
      size in O(log(N)) time, instead of potentially having to do a linear
      walk across all the VMAs.
      
      Also define mm->highest_vm_end as the vm_end field of the highest vma,
      so that we can easily check if the following gap is suitable.
      
      This does have the potential to make unmapping VMAs more expensive,
      especially for processes with very large numbers of VMAs, where the VMA
      rbtree can grow quite deep.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3737187
    • Andi Kleen's avatar
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen authored
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
    • Ingo Molnar's avatar
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · 4fc3f1d6
      Ingo Molnar authored
      rmap_walk_anon() and try_to_unmap_anon() appears to be too
      careful about locking the anon vma: while it needs protection
      against anon vma list modifications, it does not need exclusive
      access to the list itself.
      
      Transforming this exclusive lock to a read-locked rwsem removes
      a global lock from the hot path of page-migration intense
      threaded workloads which can cause pathological performance like
      this:
      
          96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                        |
                        --- perf_trace_sched_switch
                            __schedule
                            schedule
                            schedule_preempt_disabled
                            __mutex_lock_common.isra.6
                            __mutex_lock_slowpath
                            mutex_lock
                           |
                           |--50.61%-- rmap_walk
                           |          move_to_new_page
                           |          migrate_pages
                           |          migrate_misplaced_page
                           |          __do_numa_page.isra.69
                           |          handle_pte_fault
                           |          handle_mm_fault
                           |          __do_page_fault
                           |          do_page_fault
                           |          page_fault
                           |          __memset_sse2
                           |          |
                           |           --100.00%-- worker_thread
                           |                     |
                           |                      --100.00%-- start_thread
                           |
                            --49.39%-- page_lock_anon_vma
                                      try_to_unmap_anon
                                      try_to_unmap
                                      migrate_pages
                                      migrate_misplaced_page
                                      __do_numa_page.isra.69
                                      handle_pte_fault
                                      handle_mm_fault
                                      __do_page_fault
                                      do_page_fault
                                      page_fault
                                      __memset_sse2
                                      |
                                       --100.00%-- worker_thread
                                                 start_thread
      
      With this change applied the profile is now nicely flat
      and there's no anon-vma related scheduling/blocking.
      
      Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      to make it clearer that it's an exclusive write-lock in
      that case - suggested by Rik van Riel.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      4fc3f1d6
    • Ingo Molnar's avatar
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem · 5a505085
      Ingo Molnar authored
      Convert the struct anon_vma::mutex to an rwsem, which will help
      in solving a page-migration scalability problem. (Addressed in
      a separate patch.)
      
      The conversion is simple and straightforward: in every case
      where we mutex_lock()ed we'll now down_write().
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      5a505085
  18. 16 Nov, 2012 1 commit
    • Michel Lespinasse's avatar
      mm: add anon_vma_lock to validate_mm() · 63c3b902
      Michel Lespinasse authored
      Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
      NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
      chain might have been removed.
      
        BUG: unable to handle kernel paging request at fffffffffffffff0
        IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
        PGD 4e28067 PUD 4e29067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Pid: 9050, comm: trinity-child64 Tainted: G        W    3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
        RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0
        Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
        Call Trace:
          validate_mm+0x58/0x1e0
          vma_adjust+0x635/0x6b0
          __split_vma.isra.22+0x161/0x220
          split_vma+0x24/0x30
          sys_madvise+0x5da/0x7b0
          tracesys+0xe1/0xe6
        RIP  anon_vma_interval_tree_verify+0xc/0xa0
        CR2: fffffffffffffff0
      
      Figured out by Bob Liu.
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63c3b902
  19. 15 Nov, 2012 1 commit
  20. 09 Oct, 2012 1 commit
    • Michel Lespinasse's avatar
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse authored
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38a76013