1. 22 Sep, 2009 40 commits
    • David Rientjes's avatar
      flex_array: add flex_array_shrink function · 4af5a2f7
      David Rientjes authored
      Add a new function to the flex_array API:
      
      	int flex_array_shrink(struct flex_array *fa)
      
      This function will free all unused second-level pages.  Since elements are
      now poisoned if they are not allocated with __GFP_ZERO, it's possible to
      identify parts that consist solely of unused elements.
      
      flex_array_shrink() returns the number of pages freed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4af5a2f7
    • David Rientjes's avatar
      flex_array: poison free elements · 19da3dd1
      David Rientjes authored
      Newly initialized flex_array's and/or flex_array_part's are now poisoned
      with a new poison value, FLEX_ARRAY_FREE.  It's value is similar to
      POISON_FREE used in the various slab allocators, but is different to
      distinguish between flex array's poisoned kmem and slab allocator poisoned
      kmem.
      
      This will allow us to identify flex_array_part's that only contain free
      elements (and free them with an addition to the flex_array API).  This
      could also be extended in the future to identify `get' uses on elements
      that have not been `put'.
      
      If __GFP_ZERO is passed for a part's gfp mask, the poisoning is avoided.
      These elements are considered to be in-use since they have been
      initialized.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19da3dd1
    • David Rientjes's avatar
      flex_array: add flex_array_clear function · e6de3988
      David Rientjes authored
      Add a new function to the flex_array API:
      
      	int flex_array_clear(struct flex_array *fa,
      				unsigned int element_nr)
      
      This function will zero the element at element_nr in the flex_array.
      
      Although this is equivalent to using flex_array_put() and passing a
      pointer to zero'd memory, flex_array_clear() does not require such a
      pointer to memory that would most likely need to be allocated on the
      caller's stack which could be significantly large depending on
      element_size.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6de3988
    • Arjan van de Ven's avatar
      cpuidle: fix the menu governor to boost IO performance · 69d25870
      Arjan van de Ven authored
      Fix the menu idle governor which balances power savings, energy efficiency
      and performance impact.
      
      The reason for a reworked governor is that there have been serious
      performance issues reported with the existing code on Nehalem server
      systems.
      
      To show this I'm sure Andrew wants to see benchmark results:
      (benchmark is "fio", "no cstates" is using "idle=poll")
      
      		no cstates	current linux	new algorithm
      1 disk		107 Mb/s	85 Mb/s		105 Mb/s
      2 disks		215 Mb/s	123 Mb/s	209 Mb/s
      12 disks	590 Mb/s	320 Mb/s	585 Mb/s
      
      In various power benchmark measurements, no degredation was found by our
      measurement&diagnostics team.  Obviously a small percentage more power was
      used in the "fio" benchmark, due to the much higher performance.
      
      While it would be a novel idea to describe the new algorithm in this
      commit message, I cheaped out and described it in comments in the code
      instead.
      
      [changes since first post: spelling fixes from akpm, review feedback,
      folded menu-tng into menu.c]
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69d25870
    • Michael S. Tsirkin's avatar
      mm: move use_mm/unuse_mm from aio.c to mm/ · 3d2d827f
      Michael S. Tsirkin authored
      Anyone who wants to do copy to/from user from a kernel thread, needs
      use_mm (like what fs/aio has).  Move that into mm/, to make reusing and
      exporting easier down the line, and make aio use it.  Next intended user,
      besides aio, will be vhost-net.
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d2d827f
    • Eric B Munson's avatar
      hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 4e52780d
      Eric B Munson authored
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to userspace.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
      Signed-off-by: default avatarEric B Munson <ebmunson@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e52780d
    • Arnd Bergmann's avatar
      mm: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 90f72aa5
      Arnd Bergmann authored
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to user space.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      The patch also adds the MAP_STACK flag, which was previously defined only
      on some architectures but not on others.  Since MAP_STACK is meant to be a
      hint only, architectures can define it without assigning a specific
      meaning to it.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90f72aa5
    • Eric B Munson's avatar
      hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount · 6bfde05b
      Eric B Munson authored
      This patchset adds a flag to mmap that allows the user to request that an
      anonymous mapping be backed with huge pages.  This mapping will borrow
      functionality from the huge page shm code to create a file on the kernel
      internal mount and use it to approximate an anonymous mapping.  The
      MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
      both flags being preset.
      
      A new flag is necessary because there is no other way to hook into huge
      pages without creating a file on a hugetlbfs mount which wouldn't be
      MAP_ANONYMOUS.
      
      To userspace, this mapping will behave just like an anonymous mapping
      because the file is not accessible outside of the kernel.
      
      This patchset is meant to simplify the programming model.  Presently there
      is a large chunk of boiler platecode, contained in libhugetlbfs, required
      to create private, hugepage backed mappings.  This patch set would allow
      use of hugepages without linking to libhugetlbfs or having hugetblfs
      mounted.
      
      Unification of the VM code would provide these same benefits, but it has
      been resisted each time that it has been suggested for several reasons: it
      would break PAGE_SIZE assumptions across the kernel, it makes page-table
      abstractions really expensive, and it does not provide any benefit on
      architectures that do not support huge pages, incurring fast path
      penalties without providing any benefit on these architectures.
      
      This patch:
      
      There are two means of creating mappings backed by huge pages:
      
              1. mmap() a file created on hugetlbfs
              2. Use shm which creates a file on an internal mount which essentially
                 maps it MAP_SHARED
      
      The internal mount is only used for shared mappings but there is very
      little that stops it being used for private mappings. This patch extends
      hugetlbfs_file_setup() to deal with the creation of files that will be
      mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
      used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.
      Signed-off-by: default avatarEric Munson <ebmunson@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bfde05b
    • Hugh Dickins's avatar
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins authored
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a
    • Hugh Dickins's avatar
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins authored
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • Hugh Dickins's avatar
      mm: follow_hugetlb_page flags · 2a15efc9
      Hugh Dickins authored
      follow_hugetlb_page() shouldn't be guessing about the coredump case
      either: pass the foll_flags down to it, instead of just the write bit.
      
      Remove that obscure huge_zeropage_ok() test.  The decision is easy,
      though unlike the non-huge case - here vm_ops->fault is always set.
      But we know that a fault would serve up zeroes, unless there's
      already a hugetlbfs pagecache page to back the range.
      
      (Alternatively, since hugetlb pages aren't swapped out under pressure,
      you could save more dump space by arguing that a page not yet faulted
      into this process cannot be relevant to the dump; but that would be
      more surprising.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a15efc9
    • Hugh Dickins's avatar
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins authored
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • Hugh Dickins's avatar
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins authored
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • Mel Gorman's avatar
      page-allocator: split per-cpu list into one-list-per-migrate-type · 5f8dcc21
      Mel Gorman authored
      The following two patches remove searching in the page allocator fast-path
      by maintaining multiple free-lists in the per-cpu structure.  At the time
      the search was introduced, increasing the per-cpu structures would waste a
      lot of memory as per-cpu structures were statically allocated at
      compile-time.  This is no longer the case.
      
      The patches are as follows. They are based on mmotm-2009-08-27.
      
      Patch 1 adds multiple lists to struct per_cpu_pages, one per
      	migratetype that can be stored on the PCP lists.
      
      Patch 2 notes that the pcpu drain path check empty lists multiple times. The
      	patch reduces the number of checks by maintaining a count of free
      	lists encountered. Lists containing pages will then free multiple
      	pages in batch
      
      The patches were tested with kernbench, netperf udp/tcp, hackbench and
      sysbench.  The netperf tests were not bound to any CPU in particular and
      were run such that the results should be 99% confidence that the reported
      results are within 1% of the estimated mean.  sysbench was run with a
      postgres background and read-only tests.  Similar to netperf, it was run
      multiple times so that it's 99% confidence results are within 1%.  The
      patches were tested on x86, x86-64 and ppc64 as
      
      x86:	Intel Pentium D 3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.34% to 2.28% gain
      	netperf-tcp	- 0.45% to 1.22% gain
      	hackbench	- Small variances, very close to noise
      	sysbench	- Very small gains
      
      x86-64:	AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.83% to 10.42% gains
      	netperf-tcp	- No conclusive until buffer >= PAGE_SIZE
      				4096	+15.83%
      				8192	+ 0.34% (not significant)
      				16384	+ 1%
      	hackbench	- Small gains, very close to noise
      	sysbench	- 0.79% to 1.6% gain
      
      ppc64:	PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 2-3% gain for almost all buffer sizes tested
      	netperf-tcp	- losses on small buffers, gains on larger buffers
      			  possibly indicates some bad caching effect.
      	hackbench	- No significant difference
      	sysbench	- 2-4% gain
      
      This patch:
      
      Currently the per-cpu page allocator searches the PCP list for pages of
      the correct migrate-type to reduce the possibility of pages being
      inappropriate placed from a fragmentation perspective.  This search is
      potentially expensive in a fast-path and undesirable.  Splitting the
      per-cpu list into multiple lists increases the size of a per-cpu structure
      and this was potentially a major problem at the time the search was
      introduced.  These problem has been mitigated as now only the necessary
      number of structures is allocated for the running system.
      
      This patch replaces a list search in the per-cpu allocator with one list
      per migrate type.  The potential snag with this approach is when bulk
      freeing pages.  We round-robin free pages based on migrate type which has
      little bearing on the cache hotness of the page and potentially checks
      empty lists repeatedly in the event the majority of PCP pages are of one
      type.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f8dcc21
    • KOSAKI Motohiro's avatar
      oom: move oom_adj value from task_struct to signal_struct · 28b83c51
      KOSAKI Motohiro authored
      Currently, OOM logic callflow is here.
      
          __out_of_memory()
              select_bad_process()            for each task
                  badness()                   calculate badness of one task
                      oom_kill_process()      search child
                          oom_kill_task()     kill target task and mm shared tasks with it
      
      example, process-A have two thread, thread-A and thread-B and it have very
      fat memory and each thread have following oom_adj and oom_score.
      
           thread-A: oom_adj = OOM_DISABLE, oom_score = 0
           thread-B: oom_adj = 0,           oom_score = very-high
      
      Then, select_bad_process() select thread-B, but oom_kill_task() refuse
      kill the task because thread-A have OOM_DISABLE.  Thus __out_of_memory()
      call select_bad_process() again.  but select_bad_process() select the same
      task.  It mean kernel fall in livelock.
      
      The fact is, select_bad_process() must select killable task.  otherwise
      OOM logic go into livelock.
      
      And root cause is, oom_adj shouldn't be per-thread value.  it should be
      per-process value because OOM-killer kill a process, not thread.  Thus
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct signal_struct.  it naturally prevent
      select_bad_process() choose wrong task.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28b83c51
    • Wu Fengguang's avatar
      mm: do batched scans for mem_cgroup · f8629631
      Wu Fengguang authored
      For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
      which case shrink_list() _still_ calls isolate_pages() with the much
      larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list scan
      rate by up to 32 times.
      
      For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
      So when shrink_zone() expects to scan 4 pages in the active/inactive list,
      the active list will be scanned 4 pages, while the inactive list will be
      (over) scanned SWAP_CLUSTER_MAX=32 pages in effect.  And that could break
      the balance between the two lists.
      
      It can further impact the scan of anon active list, due to the anon
      active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
      
      inactive anon list over scanned => inactive_anon_is_low() == TRUE
                                      => shrink_active_list()
                                      => active anon list over scanned
      
      So the end result may be
      
      - anon inactive  => over scanned
      - anon active    => over scanned (maybe not as much)
      - file inactive  => over scanned
      - file active    => under scanned (relatively)
      
      The accesses to nr_saved_scan are not lock protected and so not 100%
      accurate, however we can tolerate small errors and the resulted small
      imbalanced scan rates between zones.
      
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8629631
    • Alexey Dobriyan's avatar
    • Jan Beulich's avatar
      mm: also use alloc_large_system_hash() for the PID hash table · 2c85f51d
      Jan Beulich authored
      This is being done by allowing boot time allocations to specify that they
      may want a sub-page sized amount of memory.
      
      Overall this seems more consistent with the other hash table allocations,
      and allows making two supposedly mm-only variables really mm-only
      (nr_{kernel,all}_pages).
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c85f51d
    • Jan Beulich's avatar
      mm: replace various uses of num_physpages by totalram_pages · 4481374c
      Jan Beulich authored
      Sizing of memory allocations shouldn't depend on the number of physical
      pages found in a system, as that generally includes (perhaps a huge amount
      of) non-RAM pages.  The amount of what actually is usable as storage
      should instead be used as a basis here.
      
      Some of the calculations (i.e.  those not intending to use high memory)
      should likely even use (totalram_pages - totalhigh_pages).
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4481374c
    • Johannes Weiner's avatar
      mm: return boolean from page_has_private() · edcf4748
      Johannes Weiner authored
      Make page_has_private() return a true boolean value and remove the double
      negations from the two callsites using it for arithmetic.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edcf4748
    • Johannes Weiner's avatar
      mm: return boolean from page_is_file_cache() · 6c0b1351
      Johannes Weiner authored
      page_is_file_cache() has been used for both boolean checks and LRU
      arithmetic, which was always a bit weird.
      
      Now that page_lru_base_type() exists for LRU arithmetic, make
      page_is_file_cache() a real predicate function and adjust the
      boolean-using callsites to drop those pesky double negations.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c0b1351
    • Johannes Weiner's avatar
      mm: introduce page_lru_base_type() · 401a8e1c
      Johannes Weiner authored
      Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
      another helper with a more appropriate name and convert the non-boolean
      users of page_is_file_cache() accordingly.
      
      This new helper gives the LRU base type a page is supposed to live on,
      inactive anon or inactive file.
      
      [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      401a8e1c
    • Sage Weil's avatar
      mm: remove broken 'kzalloc' mempool · bba78819
      Sage Weil authored
      The kzalloc mempool zeros items when they are initially allocated, but
      does not rezero used items that are returned to the pool.  Consequently
      mempool_alloc()s may return non-zeroed memory.
      
      Since there are/were only two in-tree users for
      mempool_create_kzalloc_pool(), and 'fixing' this in a way that will
      re-zero used (but not new) items before first use is non-trivial, just
      remove it.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bba78819
    • Mel Gorman's avatar
      tracing, page-allocator: add trace event for page traffic related to the buddy lists · 0d3d062a
      Mel Gorman authored
      The page allocation trace event reports that a page was successfully
      allocated but it does not specify where it came from.  When analysing
      performance, it can be important to distinguish between pages coming from
      the per-cpu allocator and pages coming from the buddy lists as the latter
      requires the zone lock to the taken and more data structures to be
      examined.
      
      This patch adds a trace event for __rmqueue reporting when a page is being
      allocated from the buddy lists.  It distinguishes between being called to
      refill the per-cpu lists or whether it is a high-order allocation.
      Similarly, this patch adds an event to catch when the PCP lists are being
      drained a little and pages are going back to the buddy lists.
      
      This is trickier to draw conclusions from but high activity on those
      events could explain why there were a large number of cache misses on a
      page-allocator-intensive workload.  The coalescing and splitting of
      buddies involves a lot of writing of page metadata and cache line bounces
      not to mention the acquisition of an interrupt-safe lock necessary to
      enter this path.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d3d062a
    • Mel Gorman's avatar
      tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes · e0fff1bd
      Mel Gorman authored
      Fragmentation avoidance depends on being able to use free pages from lists
      of the appropriate migrate type.  In the event this is not possible,
      __rmqueue_fallback() selects a different list and in some circumstances
      change the migratetype of the pageblock.  Simplistically, the more times
      this event occurs, the more likely that fragmentation will be a problem
      later for hugepage allocation at least but there are other considerations
      such as the order of page being split to satisfy the allocation.
      
      This patch adds a trace event for __rmqueue_fallback() that reports what
      page is being used for the fallback, the orders of relevant pages, the
      desired migratetype and the migratetype of the lists being used, whether
      the pageblock changed type and whether this event is important with
      respect to fragmentation avoidance or not.  This information can be used
      to help analyse fragmentation avoidance and help decide whether
      min_free_kbytes should be increased or not.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0fff1bd
    • Mel Gorman's avatar
      tracing, page-allocator: add trace events for page allocation and page freeing · 4b4f278c
      Mel Gorman authored
      This patch adds trace events for the allocation and freeing of pages,
      including the freeing of pagevecs.  Using the events, it will be known
      what struct page and pfns are being allocated and freed and what the call
      site was in many cases.
      
      The page alloc tracepoints be used as an indicator as to whether the
      workload was heavily dependant on the page allocator or not.  You can make
      a guess based on vmstat but you can't get a per-process breakdown.
      Depending on the call path, the call_site for page allocation may be
      __get_free_pages() instead of a useful callsite.  Instead of passing down
      a return address similar to slab debugging, the user should enable the
      stacktrace and seg-addr options to get a proper stack trace.
      
      The pagevec free tracepoint has a different usecase.  It can be used to
      get a idea of how many pages are being dumped off the LRU and whether it
      is kswapd doing the work or a process doing direct reclaim.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b4f278c
    • Mel Gorman's avatar
      page-allocator: remove dead function free_cold_page() · 38a39857
      Mel Gorman authored
      The function free_cold_page() has no callers so delete it.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38a39857
    • Hugh Dickins's avatar
      ksm: unmerge is an origin of OOMs · 35451bee
      Hugh Dickins authored
      Just as the swapoff system call allocates many pages of RAM to various
      processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
      (unmerge) is liable to allocate many pages of RAM to various processes,
      perhaps triggering OOM; and each is normally run from a modest admin
      process (swapoff or shell), easily repeated until it succeeds.
      
      So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
      try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
      with that, to ask the OOM killer to kill them first, to prevent them from
      spawning more and more OOM kills.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35451bee
    • Hugh Dickins's avatar
      ksm: clean up obsolete references · a913e182
      Hugh Dickins authored
      A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
      longer applies, and it can be made private to ksm.c; there's no more
      reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a913e182
    • Andrea Arcangeli's avatar
      ksm: fix deadlock with munlock in exit_mmap · 1c2fb7a4
      Andrea Arcangeli authored
      Rawhide users have reported hang at startup when cryptsetup is run: the
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks.
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c2fb7a4
    • Hugh Dickins's avatar
      ksm: fix oom deadlock · 9ba69294
      Hugh Dickins authored
      There's a now-obvious deadlock in KSM's out-of-memory handling:
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ba69294
    • Hugh Dickins's avatar
      ksm: identify PageKsm pages · 9a840895
      Hugh Dickins authored
      KSM will need to identify its kernel merged pages unambiguously, and
      /proc/kpageflags will probably like to do so too.
      
      Since KSM will only be substituting anonymous pages, statistics are best
      preserved by making a PageKsm page a special PageAnon page: one with no
      anon_vma.
      
      But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
      PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a840895
    • Hugh Dickins's avatar
      ksm: no debug in page_dup_rmap() · 21333b2b
      Hugh Dickins authored
      page_dup_rmap(), used on each mapped page when forking, was originally
      just an inline atomic_inc of mapcount.  2.6.22 added CONFIG_DEBUG_VM
      out-of-line checks to it, which would need to be ever-so-slightly
      complicated to allow for the PageKsm() we're about to define.
      
      But I think these checks never caught anything.  And if it's coding errors
      we're worried about, such checks should be in page_remove_rmap() too, not
      just when forking; whereas if it's pagetable corruption we're worried
      about, then they shouldn't be limited to CONFIG_DEBUG_VM.
      
      Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21333b2b
    • Hugh Dickins's avatar
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins authored
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3
    • Hugh Dickins's avatar
      ksm: define MADV_MERGEABLE and MADV_UNMERGEABLE · d19f3524
      Hugh Dickins authored
      The out-of-tree KSM used ioctls on fds cloned from /dev/ksm to register a
      memory area for merging: we prefer now to use an madvise(2) interface.
      
      This patch just defines MADV_MERGEABLE (to tell KSM it may merge pages in
      this area found identical to pages in other mergeable areas) and
      MADV_UNMERGEABLE (to undo that).
      
      Most architectures use asm-generic, but alpha, mips, parisc, xtensa need
      their own definitions: included here for mmotm convenience, but we'll
      probably want to split this and feed pieces to arch maintainers.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d19f3524
    • Izik Eidus's avatar
      ksm: add mmu_notifier set_pte_at_notify() · 828502d3
      Izik Eidus authored
      KSM is a linux driver that allows dynamicly sharing identical memory pages
      between one or more processes.
      
      Unlike tradtional page sharing that is made at the allocation of the
      memory, ksm do it dynamicly after the memory was created.  Memory is
      periodically scanned; identical pages are identified and merged.
      
      The sharing is made in a transparent way to the processes that use it.
      
      Ksm is highly important for hypervisors (kvm), where in production
      enviorments there might be many copys of the same data data among the host
      memory.  This kind of data can be: similar kernels, librarys, cache, and
      so on.
      
      Even that ksm was wrote for kvm, any userspace application that want to
      use it to share its data can try it.
      
      Ksm may be useful for any application that might have similar (page
      aligment) data strctures among the memory, ksm will find this data merge
      it to one copy, and even if it will be changed and thereforew copy on
      writed, ksm will merge it again as soon as it will be identical again.
      
      Another reason to consider using ksm is the fact that it might simplify
      alot the userspace code of application that want to use shared private
      data, instead that the application will mange shared area, ksm will do
      this for the application, and even write to this data will be allowed
      without any synchinization acts from the application.
      
      Ksm was designed to be a loadable module that doesn't change the VM code
      of linux.
      
      This patch:
      
      The set_pte_at_notify() macro allows setting a pte in the shadow page
      table directly, instead of flushing the shadow page table entry and then
      getting vmexit to set it.  It uses a new change_pte() callback to do so.
      
      set_pte_at_notify() is an optimization for kvm, and other users of
      mmu_notifiers, for COW pages.  It is useful for kvm when ksm is used,
      because it allows kvm not to have to receive vmexit and only then map the
      ksm page into the shadow page table, but instead map it directly at the
      same time as Linux maps the page into the host page table.
      
      Users of mmu_notifiers who don't implement new mmu_notifier_change_pte()
      callback will just receive the mmu_notifier_invalidate_page() callback.
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      828502d3
    • Johannes Weiner's avatar
      mm: perform non-atomic test-clear of PG_mlocked on free · 451ea25d
      Johannes Weiner authored
      By the time PG_mlocked is cleared in the page freeing path, nobody else is
      looking at our page->flags anymore.
      
      It is thus safe to make the test-and-clear non-atomic and thereby removing
      an unnecessary and expensive operation from a hotpath.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      451ea25d
    • Wu Fengguang's avatar
      mm: count only reclaimable lru pages · adea02a1
      Wu Fengguang authored
      global_lru_pages() / zone_lru_pages() can be used in two ways:
      - to estimate max reclaimable pages in determine_dirtyable_memory()
      - to calculate the slab scan ratio
      
      When swap is full or not present, the anon lru lists are not reclaimable
      and also won't be scanned.  So the anon pages shall not be counted in both
      usage scenarios.  Also rename to _reclaimable_pages: now they are counting
      the possibly reclaimable lru pages.
      
      It can greatly (and correctly) increase the slab scan rate under high
      memory pressure (when most file pages have been reclaimed and swap is
      full/absent), thus reduce false OOM kills.
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarJesse Barnes <jbarnes@virtuousgeek.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Li, Ming Chun" <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adea02a1
    • KOSAKI Motohiro's avatar
      mm: remove __{add,sub}_zone_page_state() · 5a2ae913
      KOSAKI Motohiro authored
      __add_zone_page_state() and __sub_zone_page_state() are unused.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a2ae913
    • KOSAKI Motohiro's avatar
      mm: vmstat: add isolate pages · a731286d
      KOSAKI Motohiro authored
      If the system is running a heavy load of processes then concurrent reclaim
      can isolate a large number of pages from the LRU. /proc/vmstat and the
      output generated for an OOM do not show how many pages were isolated.
      
      This has been observed during process fork bomb testing (mstctl11 in LTP).
      
      This patch shows the information about isolated pages.
      
      Reproduced via:
      
      -----------------------
      % ./hackbench 140 process 1000
         => OOM occur
      
      active_anon:146 inactive_anon:0 isolated_anon:49245
       active_file:79 inactive_file:18 isolated_file:113
       unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
       free:370 slab_reclaimable:309 slab_unreclaimable:5492
       mapped:53 shmem:15 pagetables:28140 bounce:0
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a731286d