1. 11 Sep, 2013 6 commits
  2. 27 Aug, 2013 1 commit
  3. 09 Jul, 2013 4 commits
  4. 03 Jul, 2013 24 commits
    • Jiang Liu's avatar
      mm: introduce helper function mem_init_print_info() to simplify mem_init() · 7ee3d4e8
      Jiang Liu authored
      
      
      Introduce helper function mem_init_print_info() to simplify mem_init()
      across different architectures, which also unifies the format and
      information printed.
      
      Function mem_init_print_info() calculates memory statistics information
      without walking each page, so it should be a little faster on some
      architectures.
      
      Also introduce another helper get_num_physpages() to kill the global
      variable num_physpages.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ee3d4e8
    • Jiang Liu's avatar
      mm: report available pages as "MemTotal" for each NUMA node · cdd91a77
      Jiang Liu authored
      As reported by https://bugzilla.kernel.org/show_bug.cgi?id=53501
      
      ,
      "MemTotal" from /proc/meminfo means memory pages managed by the buddy
      system (managed_pages), but "MemTotal" from /sys/.../node/nodex/meminfo
      means physical pages present (present_pages) within the NUMA node.
      There's a difference between managed_pages and present_pages due to
      bootmem allocator and reserved pages.
      
      And Documentation/filesystems/proc.txt says
          MemTotal: Total usable ram (i.e. physical ram minus a few reserved
                    bits and the kernel binary code)
      
      So change /sys/.../node/nodex/meminfo to report available pages within
      the node as "MemTotal".
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Reported-by: <sworddragon2@aol.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdd91a77
    • Jiang Liu's avatar
      mm: correctly update zone->managed_pages · 3dcc0571
      Jiang Liu authored
      
      
      Enhance adjust_managed_page_count() to adjust totalhigh_pages for
      highmem pages.  And change code which directly adjusts totalram_pages to
      use adjust_managed_page_count() because it adjusts totalram_pages,
      totalhigh_pages and zone->managed_pages altogether in a safe way.
      
      Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
      driver bacause adjust_managed_page_count() has already adjusted
      totalhigh_pages.
      
      This patch also fixes two bugs:
      
      1) enhances virtio_balloon driver to adjust totalhigh_pages when
         reserve/unreserve pages.
      2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
         memory.
      
      We still need to deal with modifications of totalram_pages in file
      arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.
      
      [akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
      [akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dcc0571
    • Jiang Liu's avatar
      mm: make __free_pages_bootmem() only available at boot time · 170a5a7e
      Jiang Liu authored
      
      
      In order to simpilify management of totalram_pages and
      zone->managed_pages, make __free_pages_bootmem() only available at boot
      time.  With this change applied, __free_pages_bootmem() will only be
      used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
      Other callers of __free_pages_bootmem() have been converted to use
      free_reserved_page(), which handles totalram_pages and
      zone->managed_pages in a safer way.
      
      This patch also fix a bug in free_pagetable() for x86_64, which should
      increase zone->managed_pages instead of zone->present_pages when freeing
      reserved pages.
      
      And now we have managed_pages_count_lock to protect totalram_pages and
      zone->managed_pages, so remove the redundant ppb_lock lock in
      put_page_bootmem().  This greatly simplifies the locking rules.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      170a5a7e
    • Jiang Liu's avatar
      mm: use a dedicated lock to protect totalram_pages and zone->managed_pages · c3d5f5f0
      Jiang Liu authored
      
      
      Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
      protect totalram_pages and zone->managed_pages.  Other than the memory
      hotplug driver, totalram_pages and zone->managed_pages may also be
      modified at runtime by other drivers, such as Xen balloon,
      virtio_balloon etc.  For those cases, memory hotplug lock is a little
      too heavy, so introduce a dedicated lock to protect totalram_pages and
      zone->managed_pages.
      
      Now we have a simplified locking rules totalram_pages and
      zone->managed_pages as:
      
      1) no locking for read accesses because they are unsigned long.
      2) no locking for write accesses at boot time in single-threaded context.
      3) serialize write accesses at runtime by acquiring the dedicated
         managed_page_count_lock.
      
      Also adjust zone->managed_pages when freeing reserved pages into the
      buddy system, to keep totalram_pages and zone->managed_pages in
      consistence.
      
      [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3d5f5f0
    • Jiang Liu's avatar
      mm: accurately calculate zone->managed_pages for highmem zones · 7b4b2a0d
      Jiang Liu authored
      
      
      Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
      that all highmem pages will be freed into the buddy system by function
      mem_init().  But that's not always true, some architectures may reserve
      some highmem pages during boot.  For example PPC may allocate highmem
      pages for giagant HugeTLB pages, and several architectures have code to
      check PageReserved flag to exclude highmem pages allocated during boot
      when freeing highmem pages into the buddy system.
      
      So treat highmem pages in the same way as normal pages, that is to:
      1) reset zone->managed_pages to zero in mem_init().
      2) recalculate managed_pages when freeing pages into the buddy system.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b4b2a0d
    • Jiang Liu's avatar
      mm: use managed_pages to calculate default zonelist order · 4f9f4774
      Jiang Liu authored
      
      
      Use zone->managed_pages instead of zone->present_pages to calculate
      default zonelist order because managed_pages means allocatable pages.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9f4774
    • Jiang Liu's avatar
      mm: fix some trivial typos in comments · 834405c3
      Jiang Liu authored
      
      
      Fix some trivial typos in comments.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      834405c3
    • Jiang Liu's avatar
      mm: enhance free_reserved_area() to support poisoning memory with zero · dbe67df4
      Jiang Liu authored
      
      
      Address more review comments from last round of code review.
      1) Enhance free_reserved_area() to support poisoning freed memory with
         pattern '0'. This could be used to get rid of poison_init_mem()
         on ARM64.
      2) A previous patch has disabled memory poison for initmem on s390
         by mistake, so restore to the original behavior.
      3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbe67df4
    • Jiang Liu's avatar
      mm: change signature of free_reserved_area() to fix building warnings · 11199692
      Jiang Liu authored
      
      
      Change signature of free_reserved_area() according to Russell King's
      suggestion to fix following build warnings:
      
        arch/arm/mm/init.c: In function 'mem_init':
        arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
          free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
          ^
        In file included from include/linux/mman.h:4:0,
                         from arch/arm/mm/init.c:15:
        include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
         extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
      
         mm/page_alloc.c: In function 'free_reserved_area':
      >> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
         In file included from arch/mips/include/asm/page.h:49:0,
                          from include/linux/mmzone.h:20,
                          from include/linux/gfp.h:4,
                          from include/linux/mm.h:8,
                          from mm/page_alloc.c:18:
         arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
         mm/page_alloc.c: In function 'free_area_init_nodes':
         mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
      
      Also address some minor code review comments.
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11199692
    • Wanpeng Li's avatar
      mm/memory-hotplug: fix lowmem count overflow when offline pages · cea27eb2
      Wanpeng Li authored
      
      
      The logic for the memory-remove code fails to correctly account the
      Total High Memory when a memory block which contains High Memory is
      offlined as shown in the example below.  The following patch fixes it.
      
      Before logic memory remove:
      
      MemTotal:        7603740 kB
      MemFree:         6329612 kB
      Buffers:           94352 kB
      Cached:           872008 kB
      SwapCached:            0 kB
      Active:           626932 kB
      Inactive:         519216 kB
      Active(anon):     180776 kB
      Inactive(anon):   222944 kB
      Active(file):     446156 kB
      Inactive(file):   296272 kB
      Unevictable:           0 kB
      Mlocked:               0 kB
      HighTotal:       7294672 kB
      HighFree:        5704696 kB
      LowTotal:         309068 kB
      LowFree:          624916 kB
      
      After logic memory remove:
      
      MemTotal:        7079452 kB
      MemFree:         5805976 kB
      Buffers:           94372 kB
      Cached:           872000 kB
      SwapCached:            0 kB
      Active:           626936 kB
      Inactive:         519236 kB
      Active(anon):     180780 kB
      Inactive(anon):   222944 kB
      Active(file):     446156 kB
      Inactive(file):   296292 kB
      Unevictable:           0 kB
      Mlocked:               0 kB
      HighTotal:       7294672 kB
      HighFree:        5181024 kB
      LowTotal:       4294752076 kB
      LowFree:          624952 kB
      
      [mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[2.6.24+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cea27eb2
    • Chen Gang's avatar
      mm/page_alloc.c: add additional checking and return value for the 'table->data' · dacbde09
      Chen Gang authored
      
      
      - check the length of the procfs data before copying it into a fixed
        size array.
      
      - when __parse_numa_zonelist_order() fails, save the error code for
        return.
      
      - 'char*' --> 'char *' coding style fix
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dacbde09
    • Cody P Schafer's avatar
      mm/page_alloc: don't re-init pageset in zone_pcp_update() · 169f6c19
      Cody P Schafer authored
      
      
      When memory hotplug is triggered, we call pageset_init() on
      per-cpu-pagesets which both contain pages and are in use, causing both the
      leakage of those pages and (potentially) bad behaviour if a page is
      allocated from a pageset while it is being cleared.
      
      Avoid this by factoring out pageset_set_high_and_batch() (which contains
      all needed logic too set a pageset's ->high and ->batch inrespective of
      system state) from zone_pageset_init() and using the new
      pageset_set_high_and_batch() instead of zone_pageset_init() in
      zone_pcp_update().
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169f6c19
    • Cody P Schafer's avatar
      mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() · 3664033c
      Cody P Schafer authored
      
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3664033c
    • Cody P Schafer's avatar
      mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() · 737af4c0
      Cody P Schafer authored
      
      
      Previously, zone_pcp_update() called pageset_set_batch() directly,
      essentially assuming that percpu_pagelist_fraction == 0.
      
      Correct this by calling zone_pageset_init(), which chooses the
      appropriate ->batch and ->high calculations.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      737af4c0
    • Cody P Schafer's avatar
      mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() · 56cef2b8
      Cody P Schafer authored
      
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56cef2b8
    • Cody P Schafer's avatar
      mm/page_alloc: relocate comment to be directly above code it refers to. · dd1895e2
      Cody P Schafer authored
      
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1895e2
    • Cody P Schafer's avatar
      mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() · 88c90dbc
      Cody P Schafer authored
      
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88c90dbc
    • Cody P Schafer's avatar
      mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high · 22a7f12b
      Cody P Schafer authored
      
      
      Simply moves calculation of the new 'high' value outside the
      for_each_possible_cpu() loop, as it does not depend on the cpu.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22a7f12b
    • Cody P Schafer's avatar
      mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() · 0a647f38
      Cody P Schafer authored
      
      
      zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
      percpu pageset based on a zone's ->managed_pages.  We don't need to drain
      the entire percpu pageset just to modify these fields.
      
      This lets us avoid calling setup_pageset() (and the draining required to
      call it) and instead allows simply setting the fields' values (with some
      attention paid to memory barriers to prevent the relationship between
      ->batch and ->high from being thrown off).
      
      This does change the behavior of zone_pcp_update() as the percpu pagesets
      will not be drained when zone_pcp_update() is called (they will end up
      being shrunk, not completely drained, later when a 0-order page is freed
      in free_hot_cold_page()).
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a647f38
    • Cody P Schafer's avatar
      mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE · 998d39cb
      Cody P Schafer authored
      
      
      pcp->batch could change at any point, avoid relying on it being a stable
      value.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      998d39cb
    • Cody P Schafer's avatar
      mm/page_alloc: insert memory barriers to allow async update of pcp batch and high · 8d7a8fa9
      Cody P Schafer authored
      Introduce pageset_update() to perform a safe transision from one set of
      pcp->{batch,high} to a new set using memory barriers.
      
      This ensures that batch is always set to a safe value (1) prior to
      updating high, and ensure that high is fully updated before setting the
      real value of batch.  It avoids ->batch ever rising above ->high.
      
      Suggested by Gilad Ben-Yossef in these threads:
      
      	https://lkml.org/lkml/2013/4/9/23
      	https://lkml.org/lkml/2013/4/10/49
      
      
      
      Also reproduces his proposed comment.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: default avatarGilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d7a8fa9
    • Cody P Schafer's avatar
      mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high · c8e251fa
      Cody P Schafer authored
      
      
      Because we are going to rely upon a careful transision between old and new
      ->high and ->batch values using memory barriers and will remove
      stop_machine(), we need to prevent multiple updaters from interweaving
      their memory writes.
      
      Add a simple mutex to protect both update loops.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8e251fa
    • Cody P Schafer's avatar
      mm/page_alloc: factor out setting of pcp->high and pcp->batch · 4008bab7
      Cody P Schafer authored
      
      
      "Problems" with the current code:
      
      1: there is a lack of synchronization in setting ->high and ->batch in
         percpu_pagelist_fraction_sysctl_handler()
      
      2: stop_machine() in zone_pcp_update() is unnecissary.
      
      3: zone_pcp_update() does not consider the case where
         percpu_pagelist_fraction is non-zero
      
      To fix:
      
      1: add memory barriers, a safe ->batch value, an update side mutex when
         updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
         that expect a stable value.
      
      2: avoid draining pages in zone_pcp_update(), rely upon the memory
         barriers added to fix #1
      
      3: factor out quite a few functions, and then call the appropriate one.
      
      Note that it results in a change to the behavior of zone_pcp_update(),
      which is used by memory_hotplug.  I'm rather certain that I've diserned
      (and preserved) the essential behavior (changing ->high and ->batch), and
      only eliminated unneeded actions (draining the per cpu pages), but this
      may not be the case.
      
      Further note that the draining of pages that previously took place in
      zone_pcp_update() occured after repeated draining when attempting to
      offline a page, and after the offline has "succeeded".  It appears that
      the draining was added to zone_pcp_update() to avoid refactoring
      setup_pageset() into 2 funtions.
      
      This patch:
      
      Creates pageset_set_batch() for use in setup_pageset().
      pageset_set_batch() imitates the functionality of
      setup_pagelist_highmark(), but uses the boot time
      (percpu_pagelist_fraction == 0) calculations for determining ->high based
      on ->batch.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4008bab7
  5. 12 Jun, 2013 1 commit
    • Tomasz Stanislawski's avatar
      mm/page_alloc.c: fix watermark check in __zone_watermark_ok() · 026b0814
      Tomasz Stanislawski authored
      
      
      The watermark check consists of two sub-checks.  The first one is:
      
      	if (free_pages <= min + lowmem_reserve)
      		return false;
      
      The check assures that there is minimal amount of RAM in the zone.  If
      CMA is used then the free_pages is reduced by the number of free pages
      in CMA prior to the over-mentioned check.
      
      	if (!(alloc_flags & ALLOC_CMA))
      		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
      
      This prevents the zone from being drained from pages available for
      non-movable allocations.
      
      The second check prevents the zone from getting too fragmented.
      
      	for (o = 0; o < order; o++) {
      		free_pages -= z->free_area[o].nr_free << o;
      		min >>= 1;
      		if (free_pages <= min)
      			return false;
      	}
      
      The field z->free_area[o].nr_free is equal to the number of free pages
      including free CMA pages.  Therefore the CMA pages are subtracted twice.
      This may cause a false positive fail of __zone_watermark_ok() if the CMA
      area gets strongly fragmented.  In such a case there are many 0-order
      free pages located in CMA.  Those pages are subtracted twice therefore
      they will quickly drain free_pages during the check against
      fragmentation.  The test fails even though there are many free non-cma
      pages in the zone.
      
      This patch fixes this issue by subtracting CMA pages only for a purpose of
      (free_pages <= min + lowmem_reserve) check.
      
      Laura said:
      
        We were observing allocation failures of higher order pages (order 5 =
        128K typically) under tight memory conditions resulting in driver
        failure.  The output from the page allocation failure showed plenty of
        free pages of the appropriate order/type/zone and mostly CMA pages in
        the lower orders.
      
        For full disclosure, we still observed some page allocation failures
        even after applying the patch but the number was drastically reduced and
        those failures were attributed to fragmentation/other system issues.
      Signed-off-by: default avatarTomasz Stanislawski <t.stanislaws@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Tested-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      026b0814
  6. 22 May, 2013 1 commit
    • Ralf Baechle's avatar
      mm: Fix virt_to_page() warning · bb3ec6b0
      Ralf Baechle authored
      
      
      virt_to_page() is typically implemented as a macro containing a cast so
      that it will accept both pointers and unsigned long without causing a
      warning.
      
      But MIPS virt_to_page() uses virt_to_phys which is a function so passing
      an unsigned long will cause a warning:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function ‘free_reserved_area’:
        mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
        arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’
      
      All others users of virt_to_page() in mm/ are passing a void *.
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Reported-by: default avatarEunbong Song <eunb.song@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb3ec6b0
  7. 29 Apr, 2013 3 commits
    • Cody P Schafer's avatar
      page_alloc: make setup_nr_node_ids() usable for arch init code · f9872caf
      Cody P Schafer authored
      
      
      powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
      page_alloc provides but makes static.  Make it avaliable to the archs in
      linux/mm.h.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9872caf
    • Russ Anderson's avatar
      mm: speedup in __early_pfn_to_nid · 7c243c71
      Russ Anderson authored
      
      
      When booting on a large memory system, the kernel spends considerable
      time in memmap_init_zone() setting up memory zones.  Analysis shows
      significant time spent in __early_pfn_to_nid().
      
      The routine memmap_init_zone() checks each PFN to verify the nid is
      valid.  __early_pfn_to_nid() sequentially scans the list of pfn ranges
      to find the right range and returns the nid.  This does not scale well.
      On a 4 TB (single rack) system there are 308 memory ranges to scan.  The
      higher the PFN the more time spent sequentially spinning through memory
      ranges.
      
      Since memmap_init_zone() increments pfn, it will almost always be
      looking for the same range as the previous pfn, so check that range
      first.  If it is in the same range, return that nid.  If not, scan the
      list as before.
      
      A 4 TB (single rack) UV1 system takes 512 seconds to get through the
      zone code.  This performance optimization reduces the time by 189
      seconds, a 36% improvement.
      
      A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
      a 112.9 second (53%) reduction.
      
      [akpm@linux-foundation.org: make the statics __meminitdata]
      [akpm@linux-foundation.org: fix comment formatting]
      [akpm@linux-foundation.org: fix ia64, per yinghai]
      [akpm@linux-foundation.org: add missing semicolon, per Tony]
      Signed-off-by: default avatarRuss Anderson <rja@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Tested-by: default avatar"Luck, Tony" <tony.luck@intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c243c71
    • Mel Gorman's avatar
      mm: page_alloc: avoid marking zones full prematurely after zone_reclaim() · fed2719e
      Mel Gorman authored
      The following problem was reported against a distribution kernel when
      zone_reclaim was enabled but the same problem applies to the mainline
      kernel.  The reproduction case was as follows
      
      1. Run numactl -m +0 dd if=largefile of=/dev/null
         This allocates a large number of clean pages in node 0
      
      2. numactl -N +0 memhog 0.5*Mg
         This start a memory-using application in node 0.
      
      The expected behaviour is that the clean pages get reclaimed and the
      application uses node 0 for its memory.  The observed behaviour was that
      the memory for the memhog application was allocated off-node since
      commits cd38b115 ("mm: page allocator: initialise ZLC for first zone
      eligible for zone_reclaim") and commit 76d3fbf8
      
       ("mm: page
      allocator: reconsider zones for allocation after direct reclaim").
      
      The assumption of those patches was that it was always preferable to
      allocate quickly than stall for long periods of time and they were meant
      to take care that the zone was only marked full when necessary but an
      important case was missed.
      
      In the allocator fast path, only the low watermarks are checked.  If the
      zones free pages are between the low and min watermark then allocations
      from the allocators slow path will succeed.  However, zone_reclaim will
      only reclaim SWAP_CLUSTER_MAX or 1<<order pages.  There is no guarantee
      that this will meet the low watermark causing the zone to be marked full
      prematurely.
      
      This patch will only mark the zone full after zone_reclaim if it the min
      watermarks are checked or if page reclaim failed to make sufficient
      progress.
      
      [mhocko@suse.cz: fix alloc_flags test]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarHedi Berriche <hedi@sgi.com>
      Tested-by: default avatarHedi Berriche <hedi@sgi.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed2719e