1. 06 Jun, 2014 1 commit
  2. 04 Jun, 2014 21 commits
  3. 07 Apr, 2014 5 commits
    • John Hubbard's avatar
      mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL · ed12d845
      John Hubbard authored
      A new dump_page() routine was recently added, and marked
      EXPORT_SYMBOL_GPL.  dump_page() was also added to the VM_BUG_ON_PAGE()
      macro, and so the end result is that non-GPL code can no longer call
      get_page() and a few other routines.
      
      This only happens if the kernel was compiled with CONFIG_DEBUG_VM.
      
      Change dump_page() to be EXPORT_SYMBOL.
      
      Longer explanation:
      
      Prior to commit 309381fe ("mm: dump page when hitting a VM_BUG_ON
      using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
      drivers on Fedora.  Fedora is semi-unique, in that it sets
      CONFIG_VM_DEBUG.
      
      Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
      via VM_BUG_ON_PAGE, via get_page().  As one of the authors of NVIDIA's
      new, open source, "UVM-Lite" kernel module, I originally choose to use
      the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
      because get_page() has always seemed to be very clearly intended for use
      by non-GPL, driver code.
      
      So I'm hoping that making get_page() widely accessible again will not be
      too controversial.  We did check with Fedora first, and they responded
      (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3
      
      ) that we should
      try to get upstream changed, before asking Fedora to change.  Their
      reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
      Fedora to help catch mm bugs.
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed12d845
    • Emil Medve's avatar
      memblock: use for_each_memblock() · 136199f0
      Emil Medve authored
      
      
      This is a small cleanup.
      Signed-off-by: default avatarEmil Medve <Emilian.Medve@Freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      136199f0
    • Johannes Weiner's avatar
      mm: page_alloc: spill to remote nodes before waking kswapd · 3a025760
      Johannes Weiner authored
      On NUMA systems, a node may start thrashing cache or even swap anonymous
      pages while there are still free pages on remote nodes.
      
      This is a result of commits 81c0a2bb ("mm: page_alloc: fair zone
      allocator policy") and fff4068c
      
       ("mm: page_alloc: revert NUMA aspect
      of fair allocation policy").
      
      Before those changes, the allocator would first try all allowed zones,
      including those on remote nodes, before waking any kswapds.  But now,
      the allocator fastpath doubles as the fairness pass, which in turn can
      only consider the local node to prevent remote spilling based on
      exhausted fairness batches alone.  Remote nodes are only considered in
      the slowpath, after the kswapds are woken up.  But if remote nodes still
      have free memory, kswapd should not be woken to rebalance the local node
      or it may thrash cash or swap prematurely.
      
      Fix this by adding one more unfair pass over the zonelist that is
      allowed to spill to remote nodes after the local fairness pass fails but
      before entering the slowpath and waking the kswapds.
      
      This also gets rid of the GFP_THISNODE exemption from the fairness
      protocol because the unfair pass is no longer tied to kswapd, which
      GFP_THISNODE is not allowed to wake up.
      
      However, because remote spills can be more frequent now - we prefer them
      over local kswapd reclaim - the allocation batches on remote nodes could
      underflow more heavily.  When resetting the batches, use
      atomic_long_read() directly instead of zone_page_state() to calculate the
      delta as the latter filters negative counter values.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a025760
    • Kirill A. Shutemov's avatar
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov authored
      
      
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · 70ef57e6
      Michal Hocko authored
      
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70ef57e6
  4. 03 Apr, 2014 1 commit
  5. 04 Mar, 2014 2 commits
    • Johannes Weiner's avatar
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · 27329369
      Johannes Weiner authored
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb
      
       ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27329369
    • David Rientjes's avatar
      mm: close PageTail race · 668f9abb
      David Rientjes authored
      Commit bf6bddf1
      
       ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  6. 23 Jan, 2014 4 commits
  7. 21 Jan, 2014 6 commits
    • David Rientjes's avatar
      mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure · aed0a0e3
      David Rientjes authored
      
      
      __GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.
      
      Luckily, nothing currently does such craziness.  So instead of causing
      such allocations to loop (potentially forever), we maintain the current
      behavior and also warn about the new users of the deprecated flag.
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aed0a0e3
    • Vlastimil Babka's avatar
      mm: compaction: encapsulate defer reset logic · de6c60a6
      Vlastimil Babka authored
      
      
      Currently there are several functions to manipulate the deferred
      compaction state variables.  The remaining case where the variables are
      touched directly is when a successful allocation occurs in direct
      compaction, or is expected to be successful in the future by kswapd.
      Here, the lowest order that is expected to fail is updated, and in the
      case of successful allocation, the deferred status and counter is reset
      completely.
      
      Create a new function compaction_defer_reset() to encapsulate this
      functionality and make it easier to understand the code.  No functional
      change.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de6c60a6
    • Santosh Shilimkar's avatar
      mm/page_alloc.c: use memblock apis for early memory allocations · 6782832e
      Santosh Shilimkar authored
      
      
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fallback to
      exiting bootmem APIs.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6782832e
    • Tang Chen's avatar
      x86, numa, acpi, memory-hotplug: make movable_node have higher priority · b2f3eebe
      Tang Chen authored
      
      
      If users specify the original movablecore=nn@ss boot option, the kernel
      will arrange [ss, ss+nn) as ZONE_MOVABLE.  The kernelcore=nn@ss boot
      option is similar except it specifies ZONE_NORMAL ranges.
      
      Now, if users specify "movable_node" in kernel commandline, the kernel
      will arrange hotpluggable memory in SRAT as ZONE_MOVABLE.  And if users
      do this, all the other movablecore=nn@ss and kernelcore=nn@ss options
      should be ignored.
      
      For those who don't want this, just specify nothing.  The kernel will
      act as before.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2f3eebe
    • Mel Gorman's avatar
      mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT · aec6a888
      Mel Gorman authored
      Commit 4b59e6c4 ("mm, show_mem: suppress page counts in
      non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
      suppress PFN walks on large memory machines.  Commit c78e9363
      
       ("mm:
      do not walk all of system memory during show_mem") avoided a PFN walk in
      the generic show_mem helper which removes the requirement for
      SHOW_MEM_FILTER_PAGE_COUNT in that case.
      
      This patch removes PFN walkers from the arch-specific implementations
      that report on a per-node or per-zone granularity.  ARM and unicore32
      still do a PFN walk as they report memory usage on each bank which is a
      much finer granularity where the debugging information may still be of
      use.  As the remaining arches doing PFN walks have relatively small
      amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.
      
      [akpm@linux-foundation.org: fix parisc]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Bottomley <jejb@parisc-linux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aec6a888
    • Yasuaki Ishimatsu's avatar
      mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve · 943dca1a
      Yasuaki Ishimatsu authored
      Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
      9TB memory machine since onlining memory sections is too slow.  And we
      found out setup_zone_migrate_reserve spent >90% of the time.
      
      The problem is, setup_zone_migrate_reserve scans all pageblocks
      unconditionally, but it is only necessary if the number of reserved
      block was reduced (i.e.  memory hot remove).
      
      Moreover, maximum MIGRATE_RESERVE per zone is currently 2.  It means
      that the number of reserved pageblocks is almost always unchanged.
      
      This patch adds zone->nr_migrate_reserve_block to maintain the number of
      MIGRATE_RESERVE pageblocks and it reduces the overhead of
      setup_zone_migrate_reserve dramatically.  The following table shows time
      of onlining a memory section.
      
        Amount of memory     | 128GB | 192GB | 256GB|
        ---------------------------------------------
        linux-3.12           |  23.9 |  31.4 | 44.5 |
        This patch           |   8.3 |   8.3 |  8.6 |
        Mel's proposal patch |  10.9 |  19.2 | 31.3 |
        ---------------------------------------------
                                         (millisecond)
      
        128GB : 4 nodes and each node has 32GB of memory
        192GB : 6 nodes and each node has 32GB of memory
        256GB : 8 nodes and each node has 32GB of memory
      
        (*1) Mel proposed his idea by the following threads.
             https://lkml.org/lkml/2013/10/30/272
      
      
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      943dca1a