1. 15 Apr, 2015 14 commits
    • Roman Pen's avatar
      mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram allocator · 68ac546f
      Roman Pen authored
      Recently I came across high fragmentation of vm_map_ram allocator:
      vmap_block has free space, but still new blocks continue to appear.
      Further investigation showed that certain mapping/unmapping sequences
      can exhaust vmalloc space.  On small 32bit systems that's not a big
      problem, cause purging will be called soon on a first allocation failure
      (alloc_vmap_area), but on 64bit machines, e.g.  x86_64 has 45 bits of
      vmalloc space, that can be a disaster.
      
      1) I came up with a simple allocation sequence, which exhausts virtual
         space very quickly:
      
        while (iters) {
      
                      /* Map/unmap big chunk */
                      vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, 16);
      
                      /* Map/unmap small chunks.
                       *
                       * -1 for hole, which should be left at the end of each block
                       * to keep it partially used, with some free space available */
                      for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                              vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, 8);
                      }
        }
      
      The idea behind is simple:
      
       1. We have to map a big chunk, e.g. 16 pages.
      
       2. Then we have to occupy the remaining space with smaller chunks, i.e.
          8 pages. At the end small hole should remain to keep block in free list,
          but do not let big chunk to occupy remaining space.
      
       3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
          are left free in the block in the #2 step), new block will be allocated,
          all further requests will lay into newly allocated block.
      
      To have some measurement numbers for all further tests I setup ftrace and
      enabled 4 basic calls in a function profile:
      
              echo vm_map_ram              > /sys/kernel/debug/tracing/set_ftrace_filter;
              echo alloc_vmap_area        >> /sys/kernel/debug/tracing/set_ftrace_filter;
              echo vm_unmap_ram           >> /sys/kernel/debug/tracing/set_ftrace_filter;
              echo free_vmap_block        >> /sys/kernel/debug/tracing/set_ftrace_filter;
      
      So for this scenario I got these results:
      
      BEFORE (all new blocks are put to the head of a free list)
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                          126000    30683.30 us     0.243 us        30819.36 us
        vm_unmap_ram                        126000    22003.24 us     0.174 us        340.886 us
        alloc_vmap_area                       1000    4132.065 us     4.132 us        0.903 us
      
      AFTER (all new blocks are put to the tail of a free list)
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                          126000    28713.13 us     0.227 us        24944.70 us
        vm_unmap_ram                        126000    20403.96 us     0.161 us        1429.872 us
        alloc_vmap_area                        993    3916.795 us     3.944 us        29.370 us
        free_vmap_block                        992    654.157 us      0.659 us        1.273 us
      
      SUMMARY:
      
      The most interesting numbers in those tables are numbers of block
      allocations and deallocations: alloc_vmap_area and free_vmap_block
      calls, which show that before the change blocks were not freed, and
      virtual space and physical memory (vmap_block structure allocations,
      etc) were consumed.
      
      Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
      better.  That can be explained with a reasonable amount of blocks in a
      free list, which we need to iterate to find a suitable free block.
      
      2) Another scenario is a random allocation:
      
        while (iters) {
      
                      /* Randomly take number from a range [1..32/64] */
                      nr = rand(1, VMAP_MAX_ALLOC);
                      vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, nr);
        }
      
      I chose mersenne twister PRNG to generate persistent random state to
      guarantee that both runs have the same random sequence.  For each
      vm_map_ram call random number from [1..32/64] was taken to represent
      amount of pages which I do map.
      
      I did 10'000 vm_map_ram calls and got these two tables:
      
      BEFORE (all new blocks are put to the head of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                           10000    10170.01 us     1.017 us        993.609 us
        vm_unmap_ram                         10000    5321.823 us     0.532 us        59.789 us
        alloc_vmap_area                        420    2150.239 us     5.119 us        3.307 us
        free_vmap_block                         37    159.587 us      4.313 us        134.344 us
      
      AFTER (all new blocks are put to the tail of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                           10000    7745.637 us     0.774 us        395.229 us
        vm_unmap_ram                         10000    5460.573 us     0.546 us        67.187 us
        alloc_vmap_area                        414    2201.650 us     5.317 us        5.591 us
        free_vmap_block                        412    574.421 us      1.394 us        15.138 us
      
      SUMMARY:
      
      'BEFORE' table shows, that 420 blocks were allocated and only 37 were
      freed.  Remained 383 blocks are still in a free list, consuming virtual
      space and physical memory.
      
      'AFTER' table shows, that 414 blocks were allocated and 412 were really
      freed.  2 blocks remained in a free list.
      
      So fragmentation was dramatically reduced.  Why? Because when we put
      newly allocated block to the head, all further requests will occupy new
      block, regardless remained space in other blocks.  In this scenario all
      requests come randomly.  Eventually remained free space will be less
      than requested size, free list will be iterated and it is possible that
      nothing will be found there - finally new block will be created.  So
      exhaustion in random scenario happens for the maximum possible
      allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
      system.
      
      Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
      Again this can be explained by iteration through smaller list of free
      blocks.
      
      3) Next simple scenario is a sequential allocation, when the allocation
         order is increased for each block.  This scenario forces allocator to
         reach maximum amount of partially free blocks in a free list:
      
        while (iters) {
      
                      /* Populate free list with blocks with remaining space */
                      for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                              nr = VMAP_BBMAP_BITS / (1 << order);
      
                              /* Leave a hole */
                              nr -= 1;
      
                              for (i = 0; i < nr; i++) {
                                      vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                                      vm_unmap_ram(vaddr, (1 << order));
                      }
      
                      /* Completely occupy blocks from a free list */
                      for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                              vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, (1 << order));
                      }
        }
      
      Results which I got:
      
      BEFORE (all new blocks are put to the head of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                         2032000    399545.2 us     0.196 us        467123.7 us
        vm_unmap_ram                       2032000    363225.7 us     0.178 us        111405.9 us
        alloc_vmap_area                       7001    30627.76 us     4.374 us        495.755 us
        free_vmap_block                       6993    7011.685 us     1.002 us        159.090 us
      
      AFTER (all new blocks are put to the tail of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                         2032000    394259.7 us     0.194 us        589395.9 us
        vm_unmap_ram                       2032000    292500.7 us     0.143 us        94181.08 us
        alloc_vmap_area                       7000    31103.11 us     4.443 us        703.225 us
        free_vmap_block                       7000    6750.844 us     0.964 us        119.112 us
      
      SUMMARY:
      
      No surprises here, almost all numbers are the same.
      
      Fixing this fragmentation problem I also did some improvements in a
      allocation logic of a new vmap block: occupy block immediately and get
      rid of extra search in a free list.
      
      Also I replaced dirty bitmap with min/max dirty range values to make the
      logic simpler and slightly faster, since two longs comparison costs
      less, than loop thru bitmap.
      
      This patchset raises several questions:
      
       Q: Think the problem you comments is already known so that I wrote comments
          about it as "it could consume lots of address space through fragmentation".
          Could you tell me about your situation and reason why it should be avoided?
                                                                           Gioh Kim
      
       A: Indeed, there was a commit 36437638 which adds explicit comment about
          fragmentation.  But fragmentation which is described in this comment caused
          by mixing of long-lived and short-lived objects, when a whole block is pinned
          in memory because some page slots are still in use.  But here I am talking
          about blocks which are free, nobody uses them, and allocator keeps them alive
          forever, continuously allocating new blocks.
      
       Q: I think that if you put newly allocated block to the tail of a free
          list, below example would results in enormous performance degradation.
      
          new block: 1MB (256 pages)
      
          while (iters--) {
            vm_map_ram(3 or something else not dividable for 256) * 85
            vm_unmap_ram(3) * 85
          }
      
          On every iteration, it needs newly allocated block and it is put to the
          tail of a free list so finding it consumes large amount of time.
                                                                          Joonsoo Kim
      
       A: Second patch in current patchset gets rid of extra search in a free list,
          so new block will be immediately occupied..
      
          Also, the scenario above is impossible, cause vm_map_ram allocates virtual
          range in orders, i.e. 2^n.  I.e. passing 3 to vm_map_ram you will allocate
          4 slots in a block and 256 slots (capacity of a block) of course dividable
          on 4, so block will be completely occupied.
      
          But there is a worst case which we can achieve: each free block has a hole
          equal to order size.
      
          The maximum size of allocation is 64 pages for 64-bit system
          (if you try to map more, original alloc_vmap_area will be called).
      
          So the maximum order is 6.  That means that worst case, before allocator
          makes a decision to allocate a new block, is to iterate 7 blocks:
      
          HEAD
          1st block - has 1  page slot  free (order 0)
          2nd block - has 2  page slots free (order 1)
          3rd block - has 4  page slots free (order 2)
          4th block - has 8  page slots free (order 3)
          5th block - has 16 page slots free (order 4)
          6th block - has 32 page slots free (order 5)
          7th block - has 64 page slots free (order 6)
          TAIL
      
          So the worst scenario on 64-bit system is that each CPU queue can have 7
          blocks in a free list.
      
          This can happen only and only if you allocate blocks increasing the order.
          (as I did in the function written in the comment of the first patch)
          This is weird and rare case, but still it is possible.  Afterwards you will
          get 7 blocks in a list.
      
          All further requests should be placed in a newly allocated block or some
          free slots should be found in a free list.
          Seems it does not look dramatically awful.
      
      This patch (of 3):
      
      If suitable block can't be found, new block is allocated and put into a
      head of a free list, so on next iteration this new block will be found
      first.
      
      That's bad, because old blocks in a free list will not get a chance to be
      fully used, thus fragmentation will grow.
      
      Let's consider this simple example:
      
       #1 We have one block in a free list which is partially used, and where only
          one page is free:
      
          HEAD |xxxxxxxxx-| TAIL
                         ^
                         free space for 1 page, order 0
      
       #2 New allocation request of order 1 (2 pages) comes, new block is allocated
          since we do not have free space to complete this request. New block is put
          into a head of a free list:
      
          HEAD |----------|xxxxxxxxx-| TAIL
      
       #3 Two pages were occupied in a new found block:
      
          HEAD |xx--------|xxxxxxxxx-| TAIL
                ^
                two pages mapped here
      
       #4 New allocation request of order 0 (1 page) comes.  Block, which was created
          on #2 step, is located at the beginning of a free list, so it will be found
          first:
      
        HEAD |xxX-------|xxxxxxxxx-| TAIL
                ^                 ^
                page mapped here, but better to use this hole
      
      It is obvious, that it is better to complete request of #4 step using the
      old block, where free space is left, because in other case fragmentation
      will be highly increased.
      
      But fragmentation is not only the case.  The worst thing is that I can
      easily create scenario, when the whole vmalloc space is exhausted by
      blocks, which are not used, but already dirty and have several free pages.
      
      Let's consider this function which execution should be pinned to one CPU:
      
      static void exhaust_virtual_space(struct page *pages[16], int iters)
      {
              /* Firstly we have to map a big chunk, e.g. 16 pages.
               * Then we have to occupy the remaining space with smaller
               * chunks, i.e. 8 pages. At the end small hole should remain.
               * So at the end of our allocation sequence block looks like
               * this:
               *                XX  big chunk
               * |XXxxxxxxx-|    x  small chunk
               *                 -  hole, which is enough for a small chunk,
               *                    but is not enough for a big chunk
               */
              while (iters--) {
                      int i;
                      void *vaddr;
      
                      /* Map/unmap big chunk */
                      vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, 16);
      
                      /* Map/unmap small chunks.
                       *
                       * -1 for hole, which should be left at the end of each block
                       * to keep it partially used, with some free space available */
                      for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                              vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, 8);
                      }
              }
      }
      
      On every iteration new block (1MB of vm area in my case) will be
      allocated and then will be occupied, without attempt to resolve small
      allocation request using previously allocated blocks in a free list.
      
      In case of random allocation (size should be randomly taken from the
      range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
      same: new blocks continue to appear if maximum possible allocation size
      (32 or 64) passed to the allocator, because all remaining blocks in a
      free list do not have enough free space to complete this allocation
      request.
      
      In summary if new blocks are put into the head of a free list eventually
      virtual space will be exhausted.
      
      In current patch I simply put newly allocated block to the tail of a
      free list, thus reduce fragmentation, giving a chance to resolve
      allocation request using older blocks with possible holes left.
      Signed-off-by: default avatarRoman Pen <r.peniaev@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Rob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ac546f
    • Mike Kravetz's avatar
      hugetlbfs: accept subpool min_size mount option and setup accordingly · 7ca02d0a
      Mike Kravetz authored
      Make 'min_size=<value>' be an option when mounting a hugetlbfs.  This
      option takes the same value as the 'size' option.  min_size can be
      specified without specifying size.  If both are specified, min_size must
      be less that or equal to size else the mount will fail.  If min_size is
      specified, then at mount time an attempt is made to reserve min_size
      pages.  If the reservation fails, the mount fails.  At umount time, the
      reserved pages are released.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ca02d0a
    • Mike Kravetz's avatar
      hugetlbfs: add minimum size accounting to subpools · 1c5ecae3
      Mike Kravetz authored
      The same routines that perform subpool maximum size accounting
      hugepage_subpool_get/put_pages() are modified to also perform minimum size
      accounting.  When a delta value is passed to these routines, calculate how
      global reservations must be adjusted to maintain the subpool minimum size.
       The routines now return this global reserve count adjustment.  This
      global reserve count adjustment is then passed to the global accounting
      routine hugetlb_acct_memory().
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c5ecae3
    • Mike Kravetz's avatar
      hugetlbfs: add minimum size tracking fields to subpool structure · c6a91820
      Mike Kravetz authored
      hugetlbfs allocates huge pages from the global pool as needed.  Even if
      the global pool contains a sufficient number pages for the filesystem size
      at mount time, those global pages could be grabbed for some other use.  As
      a result, filesystem huge page allocations may fail due to lack of pages.
      
      Applications such as a database want to use huge pages for performance
      reasons.  hugetlbfs filesystem semantics with ownership and modes work
      well to manage access to a pool of huge pages.  However, the application
      would like some reasonable assurance that allocations will not fail due to
      a lack of huge pages.  At application startup time, the application would
      like to configure itself to use a specific number of huge pages.  Before
      starting, the application can check to make sure that enough huge pages
      exist in the system global pools.  However, there are no guarantees that
      those pages will be available when needed by the application.  What the
      application wants is exclusive use of a subset of huge pages.
      
      Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
      specified number of pages will be available for use by the filesystem.  At
      mount time, this number of huge pages will be reserved for exclusive use
      of the filesystem.  If there is not a sufficient number of free pages, the
      mount will fail.  As pages are allocated to and freeed from the
      filesystem, the number of reserved pages is adjusted so that the specified
      minimum is maintained.
      
      This patch (of 4):
      
      Add a field to the subpool structure to indicate the minimimum number of
      huge pages to always be used by this subpool.  This minimum count includes
      allocated pages as well as reserved pages.  If the minimum number of pages
      for the subpool have not been allocated, pages are reserved up to this
      minimum.  An additional field (rsv_hpages) is used to track the number of
      pages reserved to meet this minimum size.  The hstate pointer in the
      subpool is convenient to have when reserving and unreserving the pages.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6a91820
    • Gioh Kim's avatar
      mm/compaction: reset compaction scanner positions · 195b0c60
      Gioh Kim authored
      When the compaction is activated via /proc/sys/vm/compact_memory it would
      better scan the whole zone.  And some platforms, for instance ARM, have
      the start_pfn of a zone at zero.  Therefore the first try to compact via
      /proc doesn't work.  It needs to reset the compaction scanner position
      first.
      Signed-off-by: default avatarGioh Kim <gioh.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      195b0c60
    • Michal Hocko's avatar
      mm, memcg: sync allocation and memcg charge gfp flags for THP · 3b363692
      Michal Hocko authored
      memcg currently uses hardcoded GFP_TRANSHUGE gfp flags for all THP
      charges.  THP allocations, however, might be using different flags
      depending on /sys/kernel/mm/transparent_hugepage/{,khugepaged/}defrag and
      the current allocation context.
      
      The primary difference is that defrag configured to "madvise" value will
      clear __GFP_WAIT flag from the core gfp mask to make the allocation
      lighter for all mappings which are not backed by VM_HUGEPAGE vmas.  If
      memcg charge path ignores this fact we will get light allocation but the a
      potential memcg reclaim would kill the whole point of the configuration.
      
      Fix the mismatch by providing the same gfp mask used for the allocation to
      the charge functions.  This is quite easy for all paths except for
      hugepaged kernel thread with !CONFIG_NUMA which is doing a pre-allocation
      long before the allocated page is used in collapse_huge_page via
      khugepaged_alloc_page.  To prevent from cluttering the whole code path
      from khugepaged_do_scan we simply return the current flags as per
      khugepaged_defrag() value which might have changed since the
      preallocation.  If somebody changed the value of the knob we would charge
      differently but this shouldn't happen often and it is definitely not
      critical because it would only lead to a reduced success rate of one-off
      THP promotion.
      
      [akpm@linux-foundation.org: fix weird code layout while we're there]
      [rientjes@google.com: clean up around alloc_hugepage_gfpmask()]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b363692
    • Minchan Kim's avatar
      mm: rename deactivate_page to deactivate_file_page · cc5993bd
      Minchan Kim authored
      "deactivate_page" was created for file invalidation so it has too
      specific logic for file-backed pages.  So, let's change the name of the
      function and date to a file-specific one and yield the generic name.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc5993bd
    • Eric B Munson's avatar
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson authored
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
    • Naoya Horiguchi's avatar
      mm/page-writeback: check-before-clear PageReclaim · a4bb3ecd
      Naoya Horiguchi authored
      With the page flag sanitization patchset, an invalid usage of
      ClearPageReclaim() is detected in set_page_dirty().  This can be called
      from __unmap_hugepage_range(), so let's check PageReclaim() before trying
      to clear it to avoid the misuse.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4bb3ecd
    • Naoya Horiguchi's avatar
      mm/migrate: check-before-clear PageSwapCache · b3b3a99c
      Naoya Horiguchi authored
      With the page flag sanitization patchset, an invalid usage of
      ClearPageSwapCache() is detected in migration_page_copy().
      migrate_page_copy() is shared by both normal and hugepage (both thp and
      hugetlb) code path, so let's check PageSwapCache() and clear it if it's
      set to avoid misuse of the invalid clear operation.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3b3a99c
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: define page types for action_result() in one place · 64d37a2b
      Naoya Horiguchi authored
      This cleanup patch moves all strings passed to action_result() into a
      singl= e array action_page_type so that a reader can easily find which
      kind of actio= n results are possible.  And this patch also fixes the
      odd lines to be printed out, like "unknown page state page" or "free
      buddy, 2nd try page".
      
      [akpm@linux-foundation.org: rename messages, per David]
      [akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Xie XiuQi" <xiexiuqi@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64d37a2b
    • Vladimir Davydov's avatar
      memcg: remove obsolete comment · 2564f683
      Vladimir Davydov authored
      Low and high watermarks, as they defined in the TODO to the mem_cgroup
      struct, have already been implemented by Johannes, so remove the stale
      comment.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2564f683
    • Vladimir Davydov's avatar
      memcg: zap mem_cgroup_lookup() · adbe427b
      Vladimir Davydov authored
      mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
      checks that id != 0 before issuing the function call.  Today, there is
      no point in this additional check apart from optimization, because there
      is no css with id <= 0, so that css_from_id, called by
      mem_cgroup_from_id, will return NULL for any id <= 0.
      
      Since mem_cgroup_from_id is only called from mem_cgroup_lookup, let us
      zap mem_cgroup_lookup, substituting calls to it with mem_cgroup_from_id
      and moving the check if id > 0 to css_from_id.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adbe427b
    • Yaowei Bai's avatar
      mm/oom_kill.c: fix typo in comment · bdddbcd4
      Yaowei Bai authored
      Alter 'taks' -> 'task'
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdddbcd4
  2. 14 Apr, 2015 26 commits
    • Vladimir Murzin's avatar
      memtest: use phys_addr_t for physical addresses · 7f70baee
      Vladimir Murzin authored
      Since memtest might be used by other architectures pass input parameters
      as phys_addr_t instead of long to prevent overflow.
      Signed-off-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f70baee
    • Vladimir Murzin's avatar
      mm: move memtest under mm · 4a20799d
      Vladimir Murzin authored
      Memtest is a simple feature which fills the memory with a given set of
      patterns and validates memory contents, if bad memory regions is detected
      it reserves them via memblock API.  Since memblock API is widely used by
      other architectures this feature can be enabled outside of x86 world.
      
      This patch set promotes memtest to live under generic mm umbrella and
      enables memtest feature for arm/arm64.
      
      It was reported that this patch set was useful for tracking down an issue
      with some errant DMA on an arm64 platform.
      
      This patch (of 6):
      
      There is nothing platform dependent in the core memtest code, so other
      platforms might benefit from this feature too.
      
      [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
      Signed-off-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a20799d
    • David Rientjes's avatar
      mm, hugetlb: abort __get_user_pages if current has been oom killed · 02057967
      David Rientjes authored
      If __get_user_pages() is faulting a significant number of hugetlb pages,
      usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
      very large amount of memory.
      
      If the process has been oom killed, this will cause a lot of memory to
      potentially deplete memory reserves.
      
      In the same way that commit 4779280d ("mm: make get_user_pages()
      interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
      memory, based on the premise of commit 462e00cc ("oom: stop
      allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now
      terminate when the process has been oom killed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02057967
    • David Rientjes's avatar
      mm, mempool: do not allow atomic resizing · 11d83360
      David Rientjes authored
      Allocating a large number of elements in atomic context could quickly
      deplete memory reserves, so just disallow atomic resizing entirely.
      
      Nothing currently uses mempool_resize() with anything other than
      GFP_KERNEL, so convert existing callers to drop the gfp_mask.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: Steffen Maier <maier@linux.vnet.ibm.com>	[zfcp]
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steve French <sfrench@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d83360
    • Balasubramani Vivekanandan's avatar
      memcg: print cgroup information when system panics due to panic_on_oom · 2415b9f5
      Balasubramani Vivekanandan authored
      If kernel panics due to oom, caused by a cgroup reaching its limit, when
      'compulsory panic_on_oom' is enabled, then we will only see that the OOM
      happened because of "compulsory panic_on_oom is enabled" but this doesn't
      tell the difference between mempolicy and memcg.  And dumping system wide
      information is plain wrong and more confusing.  This patch provides the
      information of the cgroup whose limit triggerred panic
      Signed-off-by: default avatarBalasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2415b9f5
    • Mel Gorman's avatar
      mm: numa: remove migrate_ratelimited · 2a8e7002
      Mel Gorman authored
      This code is dead since commit 9e645ab6 ("sched/numa: Continue PTE
      scanning even if migrate rate limited") so remove it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a8e7002
    • Chen Gang's avatar
      mm: memcontrol: let mem_cgroup_move_account() have effect only if MMU enabled · b1b0deab
      Chen Gang authored
      When !MMU, it will report warning. The related warning with allmodconfig
      under c6x:
      
          CC      mm/memcontrol.o
        mm/memcontrol.c:2802:12: warning: 'mem_cgroup_move_account' defined but not used [-Wunused-function]
         static int mem_cgroup_move_account(struct page *page,
                    ^
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1b0deab
    • Toshi Kani's avatar
      mm: change vunmap to tear down huge KVA mappings · b9820d8f
      Toshi Kani authored
      Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
      mappings when they are set.  pud_clear_huge() and pmd_clear_huge() return
      zero when no-operation is performed, i.e.  huge page mapping was not used.
      
      These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
      on the architecture.
      
      [akpm@linux-foundation.org: use consistent code layout]
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9820d8f
    • Toshi Kani's avatar
      mm: change __get_vm_area_node() to use fls_long() · 0f616be1
      Toshi Kani authored
      ioremap() and its related interfaces are used to create I/O mappings to
      memory-mapped I/O devices.  The mapping sizes of the traditional I/O
      devices are relatively small.  Non-volatile memory (NVM), however, has
      many GB and is going to have TB soon.  It is not very efficient to create
      large I/O mappings with 4KB.
      
      This patchset extends the ioremap() interfaces to transparently create I/O
      mappings with huge pages whenever possible.  ioremap() continues to use
      4KB mappings when a huge page does not fit into a requested range.  There
      is no change necessary to the drivers using ioremap().  A requested
      physical address must be aligned by a huge page size (1GB or 2MB on x86)
      for using huge page mapping, though.  The kernel huge I/O mapping will
      improve performance of NVM and other devices with large memory, and reduce
      the time to create their mappings as well.
      
      On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
      using a huge page, MTRRs can override the memory type of the huge page,
      which may lead a performance penalty.  The processor can also behave in an
      undefined manner if a huge page is mapped to a memory range that MTRRs
      have mapped with multiple different memory types.  Therefore, the mapping
      code falls back to use a smaller page size toward 4KB when a mapping range
      is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
      the PAT memory types.
      
      The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
      supports huge KVA mappings for ioremap().  User may specify a new kernel
      option "nohugeiomap" to disable the huge I/O mapping capability of
      ioremap() when necessary.
      
      Patch 1-4 change common files to support huge I/O mappings.  There is no
      change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
      architecture of the system.
      
      Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
      HAVE_ARCH_HUGE_VMAP on x86.
      
      This patch (of 6):
      
      __get_vm_area_node() takes unsigned long size, which is a 64-bit value on
      a 64-bit kernel.  However, fls(size) simply ignores the upper 32-bit.
      Change to use fls_long() to handle the size properly.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f616be1
    • Yaowei Bai's avatar
      42ff2703
    • Sasha Levin's avatar
      mm: cma: constify and use correct signness in mm/cma.c · ac173824
      Sasha Levin authored
      Constify function parameters and use correct signness where needed.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Acked-by: default avatarGregory Fong <gregory.0xf0@gmail.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac173824
    • David Rientjes's avatar
      mm, thp: really limit transparent hugepage allocation to local node · 5265047a
      David Rientjes authored
      Commit 077fcf11 ("mm/thp: allocate transparent hugepages on local
      node") restructured alloc_hugepage_vma() with the intent of only
      allocating transparent hugepages locally when there was not an effective
      interleave mempolicy.
      
      alloc_pages_exact_node() does not limit the allocation to the single node,
      however, but rather prefers it.  This is because __GFP_THISNODE is not set
      which would cause the node-local nodemask to be passed.  Without it, only
      a nodemask that prefers the local node is passed.
      
      Fix this by passing __GFP_THISNODE and falling back to small pages when
      the allocation fails.
      
      Commit 9f1b868a ("mm: thp: khugepaged: add policy for finding target
      node") suffers from a similar problem for khugepaged, which is also fixed.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Fixes: 9f1b868a ("mm: thp: khugepaged: add policy for finding target node")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5265047a
    • David Rientjes's avatar
      mm: remove GFP_THISNODE · 4167e9b2
      David Rientjes authored
      NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
      
      GFP_THISNODE is a secret combination of gfp bits that have different
      behavior than expected.  It is a combination of __GFP_THISNODE,
      __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
      allocator slowpath to fail without trying reclaim even though it may be
      used in combination with __GFP_WAIT.
      
      An example of the problem this creates: commit e97ca8e5 ("mm: fix
      GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
      that really just wanted __GFP_THISNODE.  The problem doesn't end there,
      however, because even it was a no-op for alloc_misplaced_dst_page(),
      which also sets __GFP_NORETRY and __GFP_NOWARN, and
      migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
      is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
      no-op in these cases since the page allocator special-cases
      __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
      
      It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
      to restrict an allocation to a local node, but remove GFP_THISNODE and
      its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
      wants to avoid reclaim.
      
      This allows the aforementioned functions to actually reclaim as they
      should.  It also enables any future callers that want to do
      __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
      rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
      
      Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
      it is unchanged.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4167e9b2
    • David Rientjes's avatar
      mm, mempolicy: migrate_to_node should only migrate to node · b360edb4
      David Rientjes authored
      migrate_to_node() is intended to migrate a page from one source node to
      a target node.
      
      Today, migrate_to_node() could end up migrating to any node, not only
      the target node.  This is because the page migration allocator,
      new_node_page() does not pass __GFP_THISNODE to
      alloc_pages_exact_node().  This causes the target node to be preferred
      but allows fallback to any other node in order of affinity.
      
      Prevent this by allocating with __GFP_THISNODE.  If memory is not
      available, -ENOMEM will be returned as appropriate.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b360edb4
    • Vladimir Davydov's avatar
      cleancache: remove limit on the number of cleancache enabled filesystems · 3cb29d11
      Vladimir Davydov authored
      The limit equals 32 and is imposed by the number of entries in the
      fs_poolid_map and shared_fs_poolid_map.  Nowadays it is insufficient,
      because with containers on board a Linux host can have hundreds of
      active fs mounts.
      
      These maps were introduced by commit 49a9ab81 ("mm: cleancache:
      lazy initialization to allow tmem backends to build/run as modules") in
      order to allow compiling cleancache drivers as modules.  Real pool ids
      are stored in these maps while super_block->cleancache_poolid points to
      an entry in the map, so that on cleancache registration we can walk over
      all (if there are <= 32 of them, of course) cleancache-enabled super
      blocks and assign real pool ids.
      
      Actually, there is absolutely no need in these maps, because we can
      iterate over all super blocks immediately using iterate_supers.  This is
      not racy, because cleancache_init_ops is called from mount_fs with
      super_block->s_umount held for writing, while iterate_supers takes this
      semaphore for reading, so if we call iterate_supers after setting
      cleancache_ops, all super blocks that had been created before
      cleancache_register_ops was called will be assigned pool ids by the
      action function of iterate_supers while all newer super blocks will
      receive it in cleancache_init_fs.
      
      This patch therefore removes the maps and hence the artificial limit on
      the number of cleancache enabled filesystems.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb29d11
    • Vladimir Davydov's avatar
      cleancache: forbid overriding cleancache_ops · 53d85c98
      Vladimir Davydov authored
      Currently, cleancache_register_ops returns the previous value of
      cleancache_ops to allow chaining.  However, chaining, as it is
      implemented now, is extremely dangerous due to possible pool id
      collisions.  Suppose, a new cleancache driver is registered after the
      previous one assigned an id to a super block.  If the new driver assigns
      the same id to another super block, which is perfectly possible, we will
      have two different filesystems using the same id.  No matter if the new
      driver implements chaining or not, we are likely to get data corruption
      with such a configuration eventually.
      
      This patch therefore disables the ability to override cleancache_ops
      altogether as potentially dangerous.  If there is already cleancache
      driver registered, all further calls to cleancache_register_ops will
      return EBUSY.  Since no user of cleancache implements chaining, we only
      need to make minor changes to the code outside the cleancache core.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53d85c98
    • Vladimir Davydov's avatar
      cleancache: zap uuid arg of cleancache_init_shared_fs · 9de16262
      Vladimir Davydov authored
      Use super_block->s_uuid instead.  Every shared filesystem using cleancache
      must now initialize super_block->s_uuid before calling
      cleancache_init_shared_fs.  The only one on the tree, ocfs2, already meets
      this requirement.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9de16262
    • Shachar Raindel's avatar
      mm: refactor do_wp_page handling of shared vma into a function · 93e478d4
      Shachar Raindel authored
      The do_wp_page function is extremely long.  Extract the logic for
      handling a page belonging to a shared vma into a function of its own.
      
      This helps the readability of the code, without doing any functional
      change in it.
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e478d4
    • Shachar Raindel's avatar
      mm: refactor do_wp_page, extract the page copy flow · 2f38ab2c
      Shachar Raindel authored
      In some cases, do_wp_page had to copy the page suffering a write fault
      to a new location.  If the function logic decided that to do this, it
      was done by jumping with a "goto" operation to the relevant code block.
      This made the code really hard to understand.  It is also against the
      kernel coding style guidelines.
      
      This patch extracts the page copy and page table update logic to a
      separate function.  It also clean up the naming, from "gotten" to
      "wp_page_copy", and adds few comments.
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f38ab2c
    • Shachar Raindel's avatar
      mm: refactor do_wp_page - rewrite the unlock flow · 28766805
      Shachar Raindel authored
      When do_wp_page is ending, in several cases it needs to unlock the pages
      and ptls it was accessing.
      
      Currently, this logic was "called" by using a goto jump.  This makes
      following the control flow of the function harder.  Readability was
      further hampered by the unlock case containing large amount of logic
      needed only in one of the 3 cases.
      
      Using goto for cleanup is generally allowed.  However, moving the
      trivial unlocking flows to the relevant call sites allow deeper
      refactoring in the next patch.
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28766805
    • Shachar Raindel's avatar
      mm: refactor do_wp_page, extract the reuse case · 4e047f89
      Shachar Raindel authored
      Currently do_wp_page contains 265 code lines.  It also contains 9 goto
      statements, of which 5 are targeting labels which are not cleanup
      related.  This makes the function extremely difficult to understand.
      
      The following patches are an attempt at breaking the function to its
      basic components, and making it easier to understand.
      
      The patches are straight forward function extractions from do_wp_page.
      As we extract functions, we remove unneeded parameters and simplify the
      code as much as possible.  However, the functionality is supposed to
      remain completely unchanged.  The patches also attempt to document the
      functionality of each extracted function.  In patch 2, we split the
      unlock logic to the contain logic relevant to specific needs of each use
      case, instead of having huge number of conditional decisions in a single
      unlock flow.
      
      This patch (of 4):
      
      When do_wp_page is ending, in several cases it needs to reuse the existing
      page.  This is achieved by making the page table writable, and possibly
      updating the page-cache state.
      
      Currently, this logic was "called" by using a goto jump.  This makes
      following the control flow of the function harder.  It is also against the
      coding style guidelines for using goto.
      
      As the code can easily be refactored into a specialized function, refactor
      it out and simplify the code flow in do_wp_page.
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e047f89
    • Konstantin Khlebnikov's avatar
      mm: completely remove dumping per-cpu lists from show_mem() · 761b0677
      Konstantin Khlebnikov authored
      It seems nobody needs this.
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      761b0677
    • Konstantin Khlebnikov's avatar
      mm: hide per-cpu lists in output of show_mem() · d1bfcdb8
      Konstantin Khlebnikov authored
      This makes show_mem() much less verbose on huge machines.  Instead of huge
      and almost useless dump of counters for each per-zone per-cpu lists this
      patch prints the sum of these counters for each zone (free_pcp) and size
      of per-cpu list for current cpu (local_pcp).
      
      The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.
      
      [akpm@linux-foundation.org: update show_free_areas comment]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1bfcdb8
    • Konstantin Khlebnikov's avatar
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov authored
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
    • Ebru Akagunduz's avatar
      mm: incorporate zero pages into transparent huge pages · ca0984ca
      Ebru Akagunduz authored
      This patch improves THP collapse rates, by allowing zero pages.
      
      Currently THP can collapse 4kB pages into a THP when there are up to
      khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch counts
      pte none and mapped zero pages with the same variable.
      
      The patch was tested with a program that allocates 800MB of
      memory, and performs interleaved reads and writes, in a pattern
      that causes some 2MB areas to first see read accesses, resulting
      in the zero pfn being mapped there.
      
      To simulate memory fragmentation at allocation time, I modified
      do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults.
      
      Without the patch, only %50 of the program was collapsed into THP and the
      percentage did not increase over time.
      
      With this patch after 10 minutes of waiting khugepaged had collapsed %99
      of the program's memory.
      
      [aarcange@redhat.com: fix bogus BUG()]
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca0984ca
    • Joonsoo Kim's avatar
      mm/compaction: enhance compaction finish condition · 2149cdae
      Joonsoo Kim authored
      Compaction has anti fragmentation algorithm.  It is that freepage should
      be more than pageblock order to finish the compaction if we don't find any
      freepage in requested migratetype buddy list.  This is for mitigating
      fragmentation, but, there is a lack of migratetype consideration and it is
      too excessive compared to page allocator's anti fragmentation algorithm.
      
      Not considering migratetype would cause premature finish of compaction.
      For example, if allocation request is for unmovable migratetype, freepage
      with CMA migratetype doesn't help that allocation and compaction should
      not be stopped.  But, current logic regards this situation as compaction
      is no longer needed, so finish the compaction.
      
      Secondly, condition is too excessive compared to page allocator's logic.
      We can steal freepage from other migratetype and change pageblock
      migratetype on more relaxed conditions in page allocator.  This is
      designed to prevent fragmentation and we can use it here.  Imposing hard
      constraint only to the compaction doesn't help much in this case since
      page allocator would cause fragmentation again.
      
      To solve these problems, this patch borrows anti fragmentation logic from
      page allocator.  It will reduce premature compaction finish in some cases
      and reduce excessive compaction work.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      considerable increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      31.82 : 42.20
      
      I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
      there is no more degradation on allocation success rate than before.  That
      roughly means that this patch doesn't result in more fragmentations.
      
      Vlastimil suggests additional idea that we only test for fallbacks when
      migration scanner has scanned a whole pageblock.  It looked good for
      fragmentation because chance of stealing increase due to making more free
      pages in certain pageblock.  So, I tested it, but, it results in decreased
      compaction success rate, roughly 38.00.  I guess the reason that if system
      is low memory condition, watermark check could be failed due to not enough
      order 0 free page and so, sometimes, we can't reach a fallback check
      although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
      to cope with this situation but it makes code more complicated so I don't
      include his idea at this patch.
      
      [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2149cdae