1. 13 Dec, 2014 1 commit
  2. 10 Dec, 2014 10 commits
    • Johannes Weiner's avatar
      mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea
      Johannes Weiner authored
      
      
      Now that the external page_cgroup data structure and its lookup is
      gone, let the generic bad_page() check for page->mem_cgroup sanity.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9edad6ea
    • Johannes Weiner's avatar
      mm: embed the memcg pointer directly into struct page · 1306a85a
      Johannes Weiner authored
      
      
      Memory cgroups used to have 5 per-page pointers.  To allow users to
      disable that amount of overhead during runtime, those pointers were
      allocated in a separate array, with a translation layer between them and
      struct page.
      
      There is now only one page pointer remaining: the memcg pointer, that
      indicates which cgroup the page is associated with when charged.  The
      complexity of runtime allocation and the runtime translation overhead is
      no longer justified to save that *potential* 0.19% of memory.  With
      CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
      after the page->private member and doesn't even increase struct page,
      and then this patch actually saves space.  Remaining users that care can
      still compile their kernels without CONFIG_MEMCG.
      
           text    data     bss     dec     hex     filename
        8828345 1725264  983040 11536649 b00909  vmlinux.old
        8827425 1725264  966656 11519345 afc571  vmlinux.new
      
      [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1306a85a
    • Wei Yuan's avatar
      mm: fix a spelling mistake · 26086de3
      Wei Yuan authored
      
      
      Signed-off-by Wei Yuan <weiyuan.wei@huawei.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26086de3
    • Vlastimil Babka's avatar
      mm, compaction: more focused lru and pcplists draining · fdaf7f5c
      Vlastimil Babka authored
      
      
      The goal of memory compaction is to create high-order freepages through
      page migration.  Page migration however puts pages on the per-cpu lru_add
      cache, which is later flushed to per-cpu pcplists, and only after pcplists
      are drained the pages can actually merge.  This can happen due to the
      per-cpu caches becoming full through further freeing, or explicitly.
      
      During direct compaction, it is useful to do the draining explicitly so
      that pages merge as soon as possible and compaction can detect success
      immediately and keep the latency impact at minimum.  However the current
      implementation is far from ideal.  Draining is done only in
      __alloc_pages_direct_compact(), after all zones were already compacted,
      and the decisions to continue or stop compaction in individual zones was
      done without the last batch of migrations being merged.  It is also
      missing the draining of lru_add cache before the pcplists.
      
      This patch moves the draining for direct compaction into compact_zone().
      It adds the missing lru_cache draining and uses the newly introduced
      single zone pcplists draining to reduce overhead and avoid impact on
      unrelated zones.  Draining is only performed when it can actually lead to
      merging of a page of desired order (passed by cc->order).  This means it
      is only done when migration occurred in the previously scanned cc->order
      aligned block(s) and the migration scanner is now pointing to the next
      cc->order aligned block.
      
      The patch has been tested with stress-highalloc benchmark from mmtests.
      Although overal allocation success rates of the benchmark were not
      affected, the number of detected compaction successes has doubled.  This
      suggests that allocations were previously successful due to implicit
      merging caused by background activity, making a later allocation attempt
      succeed immediately, but not attributing the success to compaction.  Since
      stress-highalloc always tries to allocate almost the whole memory, it
      cannot show the improvement in its reported success rate metric.  However
      after this patch, compaction should detect success and terminate earlier,
      reducing the direct compaction latencies in a real scenario.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdaf7f5c
    • Vlastimil Babka's avatar
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka authored
      Since commit 53853e2d
      
       ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • Vlastimil Babka's avatar
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka authored
      
      
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
    • Yu Zhao's avatar
      mm: verify compound order when freeing a page · ab1f306f
      Yu Zhao authored
      
      
      This allows us to catch the bug fixed in the previous patch (mm: free
      compound page with correct order).
      
      Here we also verify whether a page is tail page or not -- tail pages are
      supposed to be freed along with their head, not by themselves.
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab1f306f
    • Vlastimil Babka's avatar
      mm, cma: drain single zone pcplists · 510f5507
      Vlastimil Babka authored
      
      
      CMA allocation drains pcplists so that pages can merge back to buddy
      allocator.  Since it operates on a single zone, we can reduce the
      pcplists drain to the single zone, which is now possible.
      
      The change should make CMA allocations faster and not disturbing
      unrelated pcplists anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510f5507
    • Vlastimil Babka's avatar
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka authored
      
      
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
    • Anton Blanchard's avatar
  3. 13 Nov, 2014 6 commits
    • Joonsoo Kim's avatar
      mm/debug-pagealloc: correct freepage accounting and order resetting · 57cbc87e
      Joonsoo Kim authored
      
      
      One thing I did in this patch is fixing freepage accounting.  If we
      clear guard page and link it onto isolate buddy list, we should not
      increase freepage count.  This patch adds conditional branch to skip
      counting in this case.  Without this patch, this overcounting happens
      frequently if guard order is set and CMA is used.
      
      Another thing fixed in this patch is the target to reset order.  In
      __free_one_page(), we check the buddy page whether it is a guard page or
      not.  And, if so, we should clear guard attribute on the buddy page and
      reset order of it to 0.  But, current code resets original page's order
      rather than buddy one's.  Maybe, this doesn't have any problem, because
      whole merged page's order will be re-assigned soon.  But, it is better
      to correct code.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57cbc87e
    • Michal Nazarewicz's avatar
      mm: alloc_contig_range: demote pages busy message from warn to info · dae803e1
      Michal Nazarewicz authored
      
      
      Having test_pages_isolated failure message as a warning confuses users
      into thinking that it is more serious than it really is.  In reality, if
      called via CMA, allocation will be retried so a single
      test_pages_isolated failure does not prevent allocation from succeeding.
      
      Demote the warning message to an info message and reformat it such that
      the text "failed" does not appear and instead a less worrying "PFNS
      busy" is used.
      
      This message is trivially reproducible on a 10GB x86 machine on 3.16.y
      kernels configured with CONFIG_DMA_CMA.
      Signed-off-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dae803e1
    • Joonsoo Kim's avatar
      mm/page_alloc: restrict max order of merging on isolated pageblock · 3c605096
      Joonsoo Kim authored
      
      
      Current pageblock isolation logic could isolate each pageblock
      individually.  This causes freepage accounting problem if freepage with
      pageblock order on isolate pageblock is merged with other freepage on
      normal pageblock.  We can prevent merging by restricting max order of
      merging to pageblock order if freepage is on isolate pageblock.
      
      A side-effect of this change is that there could be non-merged buddy
      freepage even if finishing pageblock isolation, because undoing
      pageblock isolation is just to move freepage from isolate buddy list to
      normal buddy list rather than to consider merging.  So, the patch also
      makes undoing pageblock isolation consider freepage merge.  When
      un-isolation, freepage with more than pageblock order and it's buddy are
      checked.  If they are on normal pageblock, instead of just moving, we
      isolate the freepage and free it in order to get merged.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c605096
    • Joonsoo Kim's avatar
      mm/page_alloc: move freepage counting logic to __free_one_page() · 8f82b55d
      Joonsoo Kim authored
      
      
      All the caller of __free_one_page() has similar freepage counting logic,
      so we can move it to __free_one_page().  This reduce line of code and
      help future maintenance.
      
      This is also preparation step for "mm/page_alloc: restrict max order of
      merging on isolated pageblock" which fix the freepage counting problem
      on freepage with more than pageblock order.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f82b55d
    • Joonsoo Kim's avatar
      mm/page_alloc: add freepage on isolate pageblock to correct buddy list · 51bb1a40
      Joonsoo Kim authored
      
      
      In free_pcppages_bulk(), we use cached migratetype of freepage to
      determine type of buddy list where freepage will be added.  This
      information is stored when freepage is added to pcp list, so if
      isolation of pageblock of this freepage begins after storing, this
      cached information could be stale.  In other words, it has original
      migratetype rather than MIGRATE_ISOLATE.
      
      There are two problems caused by this stale information.
      
      One is that we can't keep these freepages from being allocated.
      Although this pageblock is isolated, freepage will be added to normal
      buddy list so that it could be allocated without any restriction.  And
      the other problem is incorrect freepage accounting.  Freepages on
      isolate pageblock should not be counted for number of freepage.
      
      Following is the code snippet in free_pcppages_bulk().
      
          /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
          __free_one_page(page, page_to_pfn(page), zone, 0, mt);
          trace_mm_page_pcpu_drain(page, 0, mt);
          if (likely(!is_migrate_isolate_page(page))) {
              __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
              if (is_migrate_cma(mt))
                  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
          }
      
      As you can see above snippet, current code already handle second
      problem, incorrect freepage accounting, by re-fetching pageblock
      migratetype through is_migrate_isolate_page(page).
      
      But, because this re-fetched information isn't used for
      __free_one_page(), first problem would not be solved.  This patch try to
      solve this situation to re-fetch pageblock migratetype before
      __free_one_page() and to use it for __free_one_page().
      
      In addition to move up position of this re-fetch, this patch use
      optimization technique, re-fetching migratetype only if there is isolate
      pageblock.  Pageblock isolation is rare event, so we can avoid
      re-fetching in common case with this optimization.
      
      This patch also correct migratetype of the tracepoint output.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51bb1a40
    • Joonsoo Kim's avatar
      mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype · ad53f92e
      Joonsoo Kim authored
      
      
      Before describing bugs itself, I first explain definition of freepage.
      
       1. pages on buddy list are counted as freepage.
       2. pages on isolate migratetype buddy list are *not* counted as freepage.
       3. pages on cma buddy list are counted as CMA freepage, too.
      
      Now, I describe problems and related patch.
      
      Patch 1: There is race conditions on getting pageblock migratetype that
      it results in misplacement of freepages on buddy list, incorrect
      freepage count and un-availability of freepage.
      
      Patch 2: Freepages on pcp list could have stale cached information to
      determine migratetype of buddy list to go.  This causes misplacement of
      freepages on buddy list and incorrect freepage count.
      
      Patch 4: Merging between freepages on different migratetype of
      pageblocks will cause freepages accouting problem.  This patch fixes it.
      
      Without patchset [3], above problem doesn't happens on my CMA allocation
      test, because CMA reserved pages aren't used at all.  So there is no
      chance for above race.
      
      With patchset [3], I did simple CMA allocation test and get below
      result:
      
       - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
       - run kernel build (make -j16) on background
       - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
       - Result: more than 5000 freepage count are missed
      
      With patchset [3] and this patchset, I found that no freepage count are
      missed so that I conclude that problems are solved.
      
      On my simple memory offlining test, these problems also occur on that
      environment, too.
      
      This patch (of 4):
      
      There are two paths to reach core free function of buddy allocator,
      __free_one_page(), one is free_one_page()->__free_one_page() and the
      other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
      Each paths has race condition causing serious problems.  At first, this
      patch is focused on first type of freepath.  And then, following patch
      will solve the problem in second type of freepath.
      
      In the first type of freepath, we got migratetype of freeing page
      without holding the zone lock, so it could be racy.  There are two cases
      of this race.
      
       1. pages are added to isolate buddy list after restoring orignal
          migratetype
      
          CPU1                                   CPU2
      
          get migratetype => return MIGRATE_ISOLATE
          call free_one_page() with MIGRATE_ISOLATE
      
                                      grab the zone lock
                                      unisolate pageblock
                                      release the zone lock
      
          grab the zone lock
          call __free_one_page() with MIGRATE_ISOLATE
          freepage go into isolate buddy list,
          although pageblock is already unisolated
      
      This may cause two problems.  One is that we can't use this page anymore
      until next isolation attempt of this pageblock, because freepage is on
      isolate buddy list.  The other is that freepage accouting could be wrong
      due to merging between different buddy list.  Freepages on isolate buddy
      list aren't counted as freepage, but ones on normal buddy list are
      counted as freepage.  If merge happens, buddy freepage on normal buddy
      list is inevitably moved to isolate buddy list without any consideration
      of freepage accouting so it could be incorrect.
      
       2. pages are added to normal buddy list while pageblock is isolated.
          It is similar with above case.
      
      This also may cause two problems.  One is that we can't keep these
      freepages from being allocated.  Although this pageblock is isolated,
      freepage would be added to normal buddy list so that it could be
      allocated without any restriction.  And the other problem is same as
      case 1, that it, incorrect freepage accouting.
      
      This race condition would be prevented by checking migratetype again
      with holding the zone lock.  Because it is somewhat heavy operation and
      it isn't needed in common case, we want to avoid rechecking as much as
      possible.  So this patch introduce new variable, nr_isolate_pageblock in
      struct zone to check if there is isolated pageblock.  With this, we can
      avoid to re-check migratetype in common case and do it only if there is
      isolated pageblock or migratetype is MIGRATE_ISOLATE.  This solve above
      mentioned problems.
      
      Changes from v3:
      Add one more check in free_one_page() that checks whether migratetype is
      MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad53f92e
  4. 27 Oct, 2014 1 commit
    • Vladimir Davydov's avatar
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov authored
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d
      
       ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      344736f2
  5. 21 Oct, 2014 1 commit
    • Michal Hocko's avatar
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko authored
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac
      
       (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
  6. 09 Oct, 2014 13 commits
    • Sasha Levin's avatar
      mm: move debug code out of page_alloc.c · 82742a3a
      Sasha Levin authored
      
      
      dump_page() and dump_vma() are not specific to page_alloc.c, move them out
      so page_alloc.c won't turn into the unofficial debug repository.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82742a3a
    • Mel Gorman's avatar
      mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit · 3193913c
      Mel Gorman authored
      
      
      Zones are allocated by the page allocator in either node or zone order.
      Node ordering is preferred in terms of locality and is applied
      automatically in one of three cases:
      
        1. If a node has only low memory
      
        2. If DMA/DMA32 is a high percentage of memory
      
        3. If low memory on a single node is greater than 70% of the node size
      
      Otherwise zone ordering is used to preserve low memory for devices that
      require it.  Unfortunately a consequence of this is that applications
      running on a machine with balanced NUMA nodes will experience different
      performance characteristics depending on which node they happen to start
      from.
      
      The point of zone ordering is to protect lower zones for devices that
      require DMA/DMA32 memory.  When NUMA was first introduced, this was
      critical as 32-bit NUMA machines existed and exhausting low memory
      triggered OOMs easily as so many allocations required low memory.  On
      64-bit machines the primary concern is devices that are 32-bit only which
      is less severe than the low memory exhaustion problem on 32-bit NUMA.  It
      seems there are really few devices that depends on it.
      
      AGP -- I assume this is getting more rare but even then I think the allocations
      	happen early in boot time where lowmem pressure is less of a problem
      
      DRM -- If the device is 32-bit only then there may be low pressure. I didn't
      	evaluate these in detail but it looks like some of these are mobile
      	graphics card. Not many NUMA laptops out there. DRM folk should know
      	better though.
      
      Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
      
      B43 wireless card -- again not really a NUMA thing.
      
      I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
      machines in case someone throws a brain damanged TV or graphics card in there.
      This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
      to make it default everywhere but I understand that some embedded arches may
      be using 32-bit NUMA where I cannot predict the consequences.
      
      The performance impact depends on the workload and the characteristics of the
      machine and the machine I tested on had a large Normal zone on node 0 so the
      impact is within the noise for the majority of tests. The allocation stats
      show more allocation requests were from DMA32 and local node. Running SpecJBB
      with multiple JVMs and automatic NUMA balancing disabled the results were
      
      specjbb
                           3.17.0-rc2            3.17.0-rc2
                              vanilla        nodeorder-v1r1
      Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
      Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
      Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
      Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
      Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
      Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
      Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
      Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
      Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
      Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
      Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
      Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
      Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
      Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
      Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
      Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
      Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
      Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
      Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
      Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
      Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
      Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
      Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
      Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
      TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
      TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
      TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
      TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
      TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
      TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
      
                                  3.17.0-rc2  3.17.0-rc2
                                     vanillanodeorder-v1r1
      DMA allocs                           0           0
      DMA32 allocs                        57     1491992
      Normal allocs                 32543566    30026383
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed               0           0
      Kswapd efficiency                 100%        100%
      Kswapd velocity                  0.000       0.000
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Zone normal velocity             0.000       0.000
      Zone dma32 velocity              0.000       0.000
      Zone dma velocity                0.000       0.000
      THP fault alloc                  55164       52987
      THP collapse alloc                 139         147
      THP splits                          26          21
      NUMA alloc hit                 4169066     4250692
      NUMA alloc miss                      0           0
      
      Note that there were more DMA32 allocations with the patch applied.  In this
      particular case there was no difference in numa_hit and numa_miss. The
      expectation is that DMA32 was being used at the low watermark instead of
      falling into the slow path. kswapd was not woken but it's not worken for
      THP allocations.
      
      On 32-bit, this patch defaults to zone-ordering as low memory depletion
      can be a serious problem on 32-bit large memory machines. If the default
      ordering was node then processes on node 0 will deplete the Normal zone
      due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
      set. If combined with large amounts of dirty/writeback pages in Normal
      zone then there is also a high risk of OOM. The heuristics are removed
      as it's not clear they were ever important on 32-bit. They were only
      relevant for setting node-ordering on 64-bit.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3193913c
    • Mel Gorman's avatar
      mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ON · 97ee4ba7
      Mel Gorman authored
      
      
      Since 2.6.24 there has been a paranoid check in move_freepages that looks
      up the zone of two pages.  This is a very slow path and the only time I've
      seen this bug trigger recently is when memory initialisation was broken
      during patch development.  Despite the fact it's a slow path, this patch
      converts the check to a VM_BUG_ON anyway as it has served its purpose by
      now.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97ee4ba7
    • Johannes Weiner's avatar
      mm: clean up zone flags · 57054651
      Johannes Weiner authored
      
      
      Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
      sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
      reader through layers indirection just to track down a simple bit.
      
      Remove all zone flag wrappers and just use bitops against zone->flags
      directly.  It's just as readable and the lines are barely any longer.
      
      Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
      remove the zone_flags_t typedef.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57054651
    • Weijie Yang's avatar
      mm: page_alloc: avoid wakeup kswapd on the unintended node · 7ade3c99
      Weijie Yang authored
      
      
      When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
      according to the zonelist and high_zoneidx.  However, this doesn't take
      nodemask into account, and could prematurely wakeup kswapd on some
      unintended nodes.
      
      This patch uses for_each_zone_zonelist_nodemask() instead of
      for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
      situation.
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ade3c99
    • Sasha Levin's avatar
      mm: introduce dump_vma · 0bf55139
      Sasha Levin authored
      
      
      Introduce a helper to dump information about a VMA, this also makes
      dump_page_flags more generic and re-uses that so the output looks very
      similar to dump_page:
      
      [   61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
      [   61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
      [   61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops           (null)
      [   61.903437] pgoff 7ffffffdd file           (null) private_data           (null)
      [   61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)
      
      [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
      [swarren@nvidia.com: fix dump_vma() compilation]
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarStephen Warren <swarren@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bf55139
    • David Rientjes's avatar
      mm: rename allocflags_to_migratetype for clarity · 43e7a34d
      David Rientjes authored
      
      
      The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
      ALLOC_CPUSET) that have separate semantics.
      
      The function allocflags_to_migratetype() actually takes gfp flags, not
      alloc flags, and returns a migratetype.  Rename it to
      gfpflags_to_migratetype().
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43e7a34d
    • Vlastimil Babka's avatar
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka authored
      
      
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • Vlastimil Babka's avatar
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka authored
      
      
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
    • Vlastimil Babka's avatar
      mm, compaction: do not count compact_stall if all zones skipped compaction · 98dd3b48
      Vlastimil Babka authored
      
      
      The compact_stall vmstat counter counts the number of allocations stalled
      by direct compaction.  It does not count when all attempted zones had
      deferred compaction, but it does count when all zones skipped compaction.
      The skipping is decided based on very early check of
      compaction_suitable(), based on watermarks and memory fragmentation.
      Therefore it makes sense not to count skipped compactions as stalls.
      Moreover, compact_success or compact_fail is also already not being
      counted when compaction was skipped, so this patch changes the
      compact_stall counting to match the other two.
      
      Additionally, restructure __alloc_pages_direct_compact() code for better
      readability.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98dd3b48
    • Vlastimil Babka's avatar
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka authored
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760
      
       ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
    • Vlastimil Babka's avatar
      mm: page_alloc: determine migratetype only once · 21bb9bd1
      Vlastimil Babka authored
      
      
      The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
      from gfp_mask in each retry pass, although the migratetype variable
      already has the value determined and it does not change.  Use the variable
      and perform the check only once.  Also convert #ifdef CONFIG_CMA to
      IS_ENABLED.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21bb9bd1
    • Joonsoo Kim's avatar
      topology: add support for node_to_mem_node() to determine the fallback node · ad2c8144
      Joonsoo Kim authored
      Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
      on ppc LPARs with memoryless nodes, a large amount of memory was consumed
      by slabs and was marked unreclaimable.  He tracked it down to slab
      deactivations in the SLUB core when we allocate remotely, leading to poor
      efficiency always when memoryless nodes are present.
      
      After much discussion, Joonsoo provided a few patches that help
      significantly.  They don't resolve the problem altogether:
      
       - memory hotplug still needs testing, that is when a memoryless node
         becomes memory-ful, we want to dtrt
       - there are other reasons for going off-node than memoryless nodes,
         e.g., fully exhausted local nodes
      
      Neither case is resolved with this series, but I don't think that should
      block their acceptance, as they can be explored/resolved with follow-on
      patches.
      
      The series consists of:
      
      [1/3] topology: add support for node_to_mem_node() to determine the
            fallback node
      
      [2/3] slub: fallback to node_to_mem_node() node if allocating on
            memoryless node
      
            - Joonsoo's patches to cache the nearest node with memory for each
              NUMA node
      
      [3/3] Partial revert of 81c98869
      
       (""kthread: ensure locality of
            task_struct allocations")
      
       - At Tejun's request, keep the knowledge of memoryless node fallback
         to the allocator core.
      
      This patch (of 3):
      
      We need to determine the fallback node in slub allocator if the allocation
      target node is memoryless node.  Without it, the SLUB wrongly select the
      node which has no memory and can't use a partial slab, because of node
      mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
      with memory that has the nearest distance.  If X is memoryless node, it
      will return nearest distance node, but, if X is normal node, it will
      return itself.
      
      We will use this function in following patch to determine the fallback
      node.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2c8144
  7. 02 Oct, 2014 1 commit
  8. 16 Sep, 2014 1 commit
    • Luiz Capitulino's avatar
      x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data() · 8b375f64
      Luiz Capitulino authored
      
      
      The setup_node_data() function allocates a pg_data_t object,
      inserts it into the node_data[] array and initializes the
      following fields: node_id, node_start_pfn and
      node_spanned_pages.
      
      However, a few function calls later during the kernel boot,
      free_area_init_node() re-initializes those fields, possibly with
      setup_node_data() is not used.
      
      This causes a small glitch when running Linux as a hyperv numa
      guest:
      
        SRAT: PXM 0 -> APIC 0x00 -> Node 0
        SRAT: PXM 0 -> APIC 0x01 -> Node 0
        SRAT: PXM 1 -> APIC 0x02 -> Node 1
        SRAT: PXM 1 -> APIC 0x03 -> Node 1
        SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
        SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
        SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
        NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
        Initmem setup node 0 [mem 0x00000000-0x7fffffff]
          NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
        Initmem setup node 1 [mem 0x80800000-0x1081fffff]
          NODE_DATA [mem 0x1081ea000-0x1081fdfff]
        crashkernel: memory value expected
         [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
         [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
        Zone ranges:
          DMA      [mem 0x00001000-0x00ffffff]
          DMA32    [mem 0x01000000-0xffffffff]
          Normal   [mem 0x100000000-0x1081fffff]
        Movable zone start for each node
        Early memory node ranges
          node   0: [mem 0x00001000-0x0009efff]
          node   0: [mem 0x00100000-0x7ffeffff]
          node   1: [mem 0x80200000-0xf7ffffff]
          node   1: [mem 0x100000000-0x1081fffff]
        On node 0 totalpages: 524174
          DMA zone: 64 pages used for memmap
          DMA zone: 21 pages reserved
          DMA zone: 3998 pages, LIFO batch:0
          DMA32 zone: 8128 pages used for memmap
          DMA32 zone: 520176 pages, LIFO batch:31
        On node 1 totalpages: 524288
          DMA32 zone: 7672 pages used for memmap
          DMA32 zone: 491008 pages, LIFO batch:31
          Normal zone: 520 pages used for memmap
          Normal zone: 33280 pages, LIFO batch:7
      
      In this dmesg, the SRAT table reports that the memory range for
      node 1 starts at 0x80200000.  However, the line starting with
      "Initmem" reports that node 1 memory range starts at 0x80800000.
       The "Initmem" line is reported by setup_node_data() and is
      wrong, because the kernel ends up using the range as reported in
      the SRAT table.
      
      This commit drops all that dead code from setup_node_data(),
      renames it to alloc_node_data() and adds a printk() to
      free_area_init_node() so that we report a node's memory range
      accurately.
      
      Here's the same dmesg section with this patch applied:
      
         SRAT: PXM 0 -> APIC 0x00 -> Node 0
         SRAT: PXM 0 -> APIC 0x01 -> Node 0
         SRAT: PXM 1 -> APIC 0x02 -> Node 1
         SRAT: PXM 1 -> APIC 0x03 -> Node 1
         SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
         SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
         SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
         NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
         NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
         NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
         crashkernel: memory value expected
          [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
          [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
         Zone ranges:
           DMA      [mem 0x00001000-0x00ffffff]
           DMA32    [mem 0x01000000-0xffffffff]
           Normal   [mem 0x100000000-0x1081fffff]
         Movable zone start for each node
         Early memory node ranges
           node   0: [mem 0x00001000-0x0009efff]
           node   0: [mem 0x00100000-0x7ffeffff]
           node   1: [mem 0x80200000-0xf7ffffff]
           node   1: [mem 0x100000000-0x1081fffff]
         Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
         On node 0 totalpages: 524174
           DMA zone: 64 pages used for memmap
           DMA zone: 21 pages reserved
           DMA zone: 3998 pages, LIFO batch:0
           DMA32 zone: 8128 pages used for memmap
           DMA32 zone: 520176 pages, LIFO batch:31
         Initmem setup node 1 [mem 0x80200000-0x1081fffff]
         On node 1 totalpages: 524288
           DMA32 zone: 7672 pages used for memmap
           DMA32 zone: 491008 pages, LIFO batch:31
           Normal zone: 520 pages used for memmap
           Normal zone: 33280 pages, LIFO batch:7
      
      This commit was tested on a two node bare-metal NUMA machine and
      Linux as a numa guest on hyperv and qemu/kvm.
      
      PS: The wrong memory range reported by setup_node_data() seems to be
          harmless in the current kernel because it's just not used.  However,
          that bad range is used in kernel 2.6.32 to initialize the old boot
          memory allocator, which causes a crash during boot.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8b375f64
  9. 06 Aug, 2014 6 commits
    • David Rientjes's avatar
      mm, thp: restructure thp avoidance of light synchronous migration · 8fe78048
      David Rientjes authored
      
      
      __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
      or not, has gained more users.  Their use is not necessarily wrong, they
      are trying to do a memory allocation that can easily fail without
      disturbing kswapd, so the bit has gained additional usecases.
      
      This restructures the check to determine whether MIGRATE_SYNC_LIGHT
      should be used for memory compaction in the page allocator.  Rather than
      testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
      for thp allocations.
      
      This also moves the check to be done only after the page allocator is
      aborted for deferred or contended memory compaction since setting
      migration_mode for this case is pointless.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fe78048
    • David Rientjes's avatar
      mm, oom: rename zonelist locking functions · e972a070
      David Rientjes authored
      
      
      try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
      to imply that they require locking semantics to avoid out_of_memory()
      being reordered.
      
      zone_scan_lock is required for both functions to ensure that there is
      proper locking synchronization.
      
      Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
      clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
      locking semantics.
      
      At the same time, convert oom_zonelist_trylock() to return bool instead
      of int since only success and failure are tested.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e972a070
    • Mel Gorman's avatar
      mm: page_alloc: reduce cost of the fair zone allocation policy · 4ffeaf35
      Mel Gorman authored
      
      
      The fair zone allocation policy round-robins allocations between zones
      within a node to avoid age inversion problems during reclaim.  If the
      first allocation fails, the batch counts are reset and a second attempt
      made before entering the slow path.
      
      One assumption made with this scheme is that batches expire at roughly
      the same time and the resets each time are justified.  This assumption
      does not hold when zones reach their low watermark as the batches will
      be consumed at uneven rates.  Allocation failure due to watermark
      depletion result in additional zonelist scans for the reset and another
      watermark check before hitting the slowpath.
      
      On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
      machine it's variable due to the variability of measuring overhead with
      the vmstat changes.  The system CPU overhead comparison looks like
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanilla   vmstat-v5 lowercost-v5
      User          746.94      774.56      802.00
      System      65336.22    32847.27    40852.33
      Elapsed     27553.52    27415.04    27368.46
      
      However it is worth noting that the overall benchmark still completed
      faster and intuitively it makes sense to take as few passes as possible
      through the zonelists.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ffeaf35
    • Mel Gorman's avatar
      mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered · f7b5d647
      Mel Gorman authored
      
      
      The purpose of numa_zonelist_order=zone is to preserve lower zones for
      use with 32-bit devices.  If locality is preferred then the
      numa_zonelist_order=node policy should be used.
      
      Unfortunately, the fair zone allocation policy overrides this by
      skipping zones on remote nodes until the lower one is found.  While this
      makes sense from a page aging and performance perspective, it breaks the
      expected zonelist policy.  This patch restores the expected behaviour
      for zone-list ordering.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7b5d647
    • Mel Gorman's avatar
      mm: move zone->pages_scanned into a vmstat counter · 0d5d823a
      Mel Gorman authored
      
      
      zone->pages_scanned is a write-intensive cache line during page reclaim
      and it's also updated during page free.  Move the counter into vmstat to
      take advantage of the per-cpu updates and do not update it in the free
      paths unless necessary.
      
      On a small UMA machine running tiobench the difference is marginal.  On
      a 4-node machine the overhead is more noticable.  Note that automatic
      NUMA balancing was disabled for this test as otherwise the system CPU
      overhead is unpredictable.
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5   vmstat-v5
      User          746.94      759.78      774.56
      System      65336.22    58350.98    32847.27
      Elapsed     27553.52    27282.02    27415.04
      
      Note that the overhead reduction will vary depending on where exactly
      pages are allocated and freed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d5d823a
    • Mel Gorman's avatar
      mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines · 3484b2de
      Mel Gorman authored
      
      
      The arrangement of struct zone has changed over time and now it has
      reached the point where there is some inappropriate sharing going on.
      On x86-64 for example
      
      o The zone->node field is shared with the zone lock and zone->node is
        accessed frequently from the page allocator due to the fair zone
        allocation policy.
      
      o span_seqlock is almost never used by shares a line with free_area
      
      o Some zone statistics share a cache line with the LRU lock so
        reclaim-intensive and allocator-intensive workloads can bounce the cache
        line on a stat update
      
      This patch rearranges struct zone to put read-only and read-mostly
      fields together and then splits the page allocator intensive fields, the
      zone statistics and the page reclaim intensive fields into their own
      cache lines.  Note that the type of lowmem_reserve changes due to the
      watermark calculations being signed and avoiding a signed/unsigned
      conversion there.
      
      On the test configuration I used the overall size of struct zone shrunk
      by one cache line.  On smaller machines, this is not likely to be
      noticable.  However, on a 4-node NUMA machine running tiobench the
      system CPU overhead is reduced by this patch.
      
                3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5r9
      User          746.94      759.78
      System      65336.22    58350.98
      Elapsed     27553.52    27282.02
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3484b2de