1. 11 Sep, 2013 3 commits
    • Shaohua Li's avatar
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li authored
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • Shaohua Li's avatar
      swap: make swap discard async · 815c2c54
      Shaohua Li authored
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • Shaohua Li's avatar
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li authored
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449
  2. 03 Jul, 2013 2 commits
    • Rafael Aquini's avatar
      swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES · dcf6b7dd
      Rafael Aquini authored
      Considering the use cases where the swap device supports discard:
      a) and can do it quickly;
      b) but it's slow to do in small granularities (or concurrent with other
         I/O);
      c) but the implementation is so horrendous that you don't even want to
         send one down;
      
      And assuming that the sysadmin considers it useful to send the discards down
      at all, we would (probably) want the following solutions:
      
        i. do the fine-grained discards for freed swap pages, if device is
           capable of doing so optimally;
       ii. do single-time (batched) swap area discards, either at swapon
           or via something like fstrim (not implemented yet);
      iii. allow doing both single-time and fine-grained discards; or
       iv. turn it off completely (default behavior)
      
      As implemented today, one can only enable/disable discards for swap, but
      one cannot select, for instance, solution (ii) on a swap device like (b)
      even though the single-time discard is regarded to be interesting, or
      necessary to the workload because it would imply (1), and the device is
      not capable of performing it optimally.
      
      This patch addresses the scenario depicted above by introducing a way to
      ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
      flagged through swapon(8) to allow a sysadmin to select the best suitable
      swap discard policy accordingly to system constraints.
      
      This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
      new flags to allow more flexibe swap discard policies being flagged
      through swapon(8).  The default behavior is to keep both single-time, or
      batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
      for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
      consistentcy with older kernel behavior, as well as maintain compatibility
      with older swapon(8).  However, through the new introduced flags the best
      suitable discard policy can be selected accordingly to any given swap
      device constraint.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcf6b7dd
    • Mel Gorman's avatar
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman authored
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
  3. 29 Apr, 2013 3 commits
  4. 23 Feb, 2013 5 commits
    • Zhang Yanfei's avatar
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei authored
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • Zhang Yanfei's avatar
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei authored
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • Shaohua Li's avatar
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li authored
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • Shaohua Li's avatar
      swap: make each swap partition have one address_space · 33806f06
      Shaohua Li authored
      When I use several fast SSD to do swap, swapper_space.tree_lock is
      heavily contended.  This makes each swap partition have one
      address_space to reduce the lock contention.  There is an array of
      address_space for swap.  The swap entry type is the index to the array.
      
      In my test with 3 SSD, this increases the swapout throughput 20%.
      
      [akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33806f06
    • Johannes Weiner's avatar
      mm: vmscan: save work scanning (almost) empty LRU lists · d778df51
      Johannes Weiner authored
      In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
      amount of pages is scanned from the LRU lists on each iteration, to make
      progress.
      
      Do not make this minimum bigger than the respective LRU list size,
      however, and save some busy work trying to isolate and reclaim pages
      that are not there.
      
      Empty LRU lists are quite common with memory cgroups in NUMA
      environments because there exists a set of LRU lists for each zone for
      each memory cgroup, while the memory of a single cgroup is expected to
      stay on just one node.  The number of expected empty LRU lists is thus
      
        memcgs * (nodes - 1) * lru types
      
      Each attempt to reclaim from an empty LRU list does expensive size
      comparisons between lists, acquires the zone's lru lock etc.  Avoid
      that.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d778df51
  5. 09 Oct, 2012 1 commit
  6. 31 Jul, 2012 4 commits
    • Mel Gorman's avatar
      mm: swap: implement generic handler for swap_activate · a509bc1a
      Mel Gorman authored
      The version of swap_activate introduced is sufficient for swap-over-NFS
      but would not provide enough information to implement a generic handler.
      This patch shuffles things slightly to ensure the same information is
      available for aops->swap_activate() as is available to the core.
      
      No functionality change.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a509bc1a
    • Mel Gorman's avatar
      mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages · 62c230bc
      Mel Gorman authored
      Currently swapfiles are managed entirely by the core VM by using ->bmap to
      allocate space and write to the blocks directly.  This effectively ensures
      that the underlying blocks are allocated and avoids the need for the swap
      subsystem to locate what physical blocks store offsets within a file.
      
      If the swap subsystem is to use the filesystem information to locate the
      blocks, it is critical that information such as block groups, block
      bitmaps and the block descriptor table that map the swap file were
      resident in memory.  This patch adds address_space_operations that the VM
      can call when activating or deactivating swap backed by a file.
      
        int swap_activate(struct file *);
        int swap_deactivate(struct file *);
      
      The ->swap_activate() method is used to communicate to the file that the
      VM relies on it, and the address_space should take adequate measures such
      as reserving space in the underlying device, reserving memory for mempools
      and pinning information such as the block descriptor table in memory.  The
      ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
      returned success.
      
      After a successful swapfile ->swap_activate, the swapfile is marked
      SWP_FILE and swapper_space.a_ops will proxy to
      sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
      pages and ->readpage to read.
      
      It is perfectly possible that direct_IO be used to read the swap pages but
      it is an unnecessary complication.  Similarly, it is possible that
      ->writepage be used instead of direct_io to write the pages but filesystem
      developers have stated that calling writepage from the VM is undesirable
      for a variety of reasons and using direct_IO opens up the possibility of
      writing back batches of swap pages in the future.
      
      [a.p.zijlstra@chello.nl: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62c230bc
    • Mel Gorman's avatar
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman authored
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
    • Andrew Morton's avatar
      memcg: rename config variables · c255a458
      Andrew Morton authored
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
  7. 29 May, 2012 5 commits
    • Hugh Dickins's avatar
      mm/memcg: apply add/del_page to lruvec · fa9add64
      Hugh Dickins authored
      Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
      del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
      its target functions.
      
      This cleanup eliminates a swathe of cruft in memcontrol.c, including
      mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
      mem_cgroup_lru_move_lists() - which never actually touched the lists.
      
      In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
      a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
      lru_size stats.
      
      Whilst these are simplifications in their own right, the goal is to bring
      the evaluation of lruvec next to the spin_locking of the lrus, in
      preparation for a future patch.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa9add64
    • Konstantin Khlebnikov's avatar
      mm: remove lru type checks from __isolate_lru_page() · f3fd4a61
      Konstantin Khlebnikov authored
      After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
      completely remove anon/file and active/inactive lru type filters from
      __isolate_lru_page(), because isolation for 0-order reclaim always
      isolates pages from right lru list.  And pages-isolation for lumpy
      shrink_inactive_list() or memory-compaction anyway allowed to isolate
      pages from all evictable lru lists.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3fd4a61
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix/change behavior of shared anon at moving task · 4b91355e
      KAMEZAWA Hiroyuki authored
      This patch changes memcg's behavior at task_move().
      
      At task_move(), the kernel scans a task's page table and move the changes
      for mapped pages from source cgroup to target cgroup.  There has been a
      bug at handling shared anonymous pages for a long time.
      
      Before patch:
        - The spec says 'shared anonymous pages are not moved.'
        - The implementation was 'shared anonymoys pages may be moved'.
          If page_mapcount <=2, shared anonymous pages's charge were moved.
      
      After patch:
        - The spec says 'all anonymous pages are moved'.
        - The implementation is 'all anonymous pages are moved'.
      
      Considering usage of memcg, this will not affect user's experience.
      'shared anonymous' pages only exists between a tree of processes which
      don't do exec().  Moving one of process without exec() seems not sane.
      For example, libcgroup will not be affected by this change.  (Anyway, no
      one noticed the implementation for a long time...)
      
      Below is a discussion log:
      
       - current spec/implementation are complex
       - Now, shared file caches are moved
       - It adds unclear check as page_mapcount(). To do correct check,
         we should check swap users, etc.
       - No one notice this implementation behavior. So, no one get benefit
         from the design.
       - In general, once task is moved to a cgroup for running, it will not
         be moved....
       - Finally, we have control knob as memory.move_charge_at_immigrate.
      
      Here is a patch to allow moving shared pages, completely. This makes
      memcg simpler and fix current broken code.
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b91355e
    • Hugh Dickins's avatar
      shmem: replace page if mapping excludes its zone · bde05d1c
      Hugh Dickins authored
      The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
      backing RAM has to be below 4GB.  Not a problem while the boards
      supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
      their GMA3600 is managed by the GMA500 driver.
      
      shmem/tmpfs has never pretended to support hardware restrictions on the
      backing memory, but it might have appeared to do so before v3.1, and
      even now it works fine until a page is swapped out then back in.  When
      read_cache_page_gfp() supplied a freshly allocated page for copy, that
      compensated for whatever choice might have been made by earlier swapin
      readahead; but swapoff was likely to destroy the illusion.
      
      We'd like to continue to support GMA500, so now add a new
      shmem_should_replace_page() check on the zone when about to move a page
      from swapcache to filecache (in swapin and swapoff cases), with
      shmem_replace_page() to allocate and substitute a suitable page (given
      gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
      
      This does involve a minor extension to mem_cgroup_replace_page_cache()
      (the page may or may not have already been charged); and I've removed a
      comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
      always a no-op while PageSwapCache.
      
      Also removed optimization of an unlikely path in shmem_getpage_gfp(),
      now that we need to check PageSwapCache more carefully (a racing caller
      might already have made the copy).  And at one point shmem_unuse_inode()
      needs to use the hitherto private page_swapcount(), to guard against
      racing with inode eviction.
      
      It would make sense to extend shmem_should_replace_page(), to cover
      cpuset and NUMA mempolicy restrictions too, but set that aside for now:
      needs a cleanup of shmem mempolicy handling, and more testing, and ought
      to handle swap faults in do_swap_page() as well as shmem.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bde05d1c
    • Rik van Riel's avatar
      mm: remove swap token code · e709ffd6
      Rik van Riel authored
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarBob Picco <bpicco@meloft.net>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
  8. 15 May, 2012 1 commit
    • Dan Magenheimer's avatar
      mm: frontswap: core swap subsystem hooks and headers · 38b5faf4
      Dan Magenheimer authored
      This patch, 2of4, contains the changes to the core swap subsystem.
      This includes:
      
      (1) makes available core swap data structures (swap_lock, swap_list and
      swap_info) that are needed by frontswap.c but we don't need to expose them
      to the dozens of files that include swap.h so we create a new swapfile.h
      just to extern-ify these and modify their declarations to non-static
      
      (2) adds frontswap-related elements to swap_info_struct.  Frontswap_map
      points to vzalloc'ed one-bit-per-swap-page metadata that indicates
      whether the swap page is in frontswap or in the device and frontswap_pages
      counts how many pages are in frontswap.
      
      (3) adds hooks in the swap subsystem and extends try_to_unuse so that
      frontswap_shrink can do a "partial swapoff".
      
      Note that a failed frontswap_map allocation is safe... failure is noted
      by lack of "FS" in the subsequent printk.
      
      ---
      
      [v14: rebase to 3.4-rc2]
      [v10: no change]
      [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
      [v9: akpm@linux-foundation.org: add clarifying comments]
      [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
      [v9: error27@gmail.com: remove superfluous check for NULL]
      [v8: rebase to 3.0-rc4]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
      [v7: rebase to 3.0-rc3]
      [v7: JBeulich@novell.com: add new swap struct elements only if config'd]
      [v6: rebase to 3.0-rc1]
      [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
      [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
      [v5: no change from v4]
      [v4: rebase to 2.6.39]
      Signed-off-by: default avatarDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarJan Beulich <JBeulich@novell.com>
      Acked-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v11: Rebased, fixed mm/swapfile.c context change]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      38b5faf4
  9. 05 Apr, 2012 1 commit
    • Michal Hocko's avatar
      memcg swap: use mem_cgroup_uncharge_swap fix · dac23b0d
      Michal Hocko authored
      Although mem_cgroup_uncharge_swap has an empty placeholder for
      !CONFIG_CGROUP_MEM_RES_CTLR_SWAP the definition is placed in the
      CONFIG_SWAP ifdef block so we are missing the same definition for
      !CONFIG_SWAP which implies !CONFIG_CGROUP_MEM_RES_CTLR_SWAP.
      
      This has not been an issue before, because mem_cgroup_uncharge_swap was
      not called from !CONFIG_SWAP context.  But Hugh Dickins has a cleanup
      patch to call __mem_cgroup_commit_charge_swapin which is defined also
      for !CONFIG_SWAP.
      
      Let's move both the empty definition and declaration outside of the
      CONFIG_SWAP block to avoid the following compilation error:
      
        mm/memcontrol.c: In function '__mem_cgroup_commit_charge_swapin':
        mm/memcontrol.c:2837: error: implicit declaration of function 'mem_cgroup_uncharge_swap'
      
      if CONFIG_SWAP is disabled.
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac23b0d
  10. 28 Mar, 2012 1 commit
  11. 21 Mar, 2012 2 commits
    • Konstantin Khlebnikov's avatar
      mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug · f0cb3c76
      Konstantin Khlebnikov authored
      This cpu hotplug hook was accidentally removed in commit 00a62ce9
      ("mm: fix Committed_AS underflow on large NR_CPUS environment")
      
      The visible effect of this accident: some pages are borrowed in per-cpu
      page-vectors.  Truncate can deal with it, but these pages cannot be
      reused while this cpu is offline.  So this is like a temporary memory
      leak.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0cb3c76
    • Rik van Riel's avatar
      mm: make swapin readahead skip over holes · 67f96aa2
      Rik van Riel authored
      Ever since abandoning the virtual scan of processes, for scalability
      reasons, swap space has been a little more fragmented than before.  This
      can lead to the situation where a large memory user is killed, swap space
      ends up full of "holes" and swapin readahead is totally ineffective.
      
      On my home system, after killing a leaky firefox it took over an hour to
      page just under 2GB of memory back in, slowing the virtual machines down
      to a crawl.
      
      This patch makes swapin readahead simply skip over holes, instead of
      stopping at them.  This allows the system to swap things back in at rates
      of several MB/second, instead of a few hundred kB/second.
      
      The checks done in valid_swaphandles are already done in
      read_swap_cache_async as well, allowing us to remove a fair amount of
      code.
      
      [akpm@linux-foundation.org: fix it for page_cluster >= 32]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Adrian Drzewiecki <z@drze.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67f96aa2
  12. 23 Jan, 2012 1 commit
    • Hugh Dickins's avatar
      SHM_UNLOCK: fix Unevictable pages stranded after swap · 24513264
      Hugh Dickins authored
      Commit cc39c6a9 ("mm: account skipped entries to avoid looping in
      find_get_pages") correctly fixed an infinite loop; but left a problem
      that find_get_pages() on shmem would return 0 (appearing to callers to
      mean end of tree) when it meets a run of nr_pages swap entries.
      
      The only uses of find_get_pages() on shmem are via pagevec_lookup(),
      called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
      scan_mapping_unevictable_pages().  The first is already commented, and
      not worth worrying about; but the second can leave pages on the
      Unevictable list after an unusual sequence of swapping and locking.
      
      Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
      swap) instead of pagevec_lookup().
      
      But I don't want to contaminate vmscan.c with shmem internals, nor
      shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
      shmem.c, renaming it shmem_unlock_mapping(); and rename
      check_move_unevictable_page() to check_move_unevictable_pages(), looping
      down an array of pages, oftentimes under the same lock.
      
      Leave out the "rotate unevictable list" block: that's a leftover from
      when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
      handling involved looking at pages at tail of LRU.
      
      Was there significance to the sequence first ClearPageUnevictable, then
      test page_evictable, then SetPageUnevictable here? I think not, we're
      under LRU lock, and have no barriers between those.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24513264
  13. 10 Jan, 2012 1 commit
    • Johannes Weiner's avatar
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner authored
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
  14. 31 Oct, 2011 1 commit
    • Minchan Kim's avatar
      mm: change isolate mode from #define to bitwise type · 4356f21d
      Minchan Kim authored
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4356f21d
  15. 14 Sep, 2011 1 commit
  16. 26 Jul, 2011 3 commits
    • Arun Sharma's avatar
      atomic: use <linux/atomic.h> · 60063497
      Arun Sharma authored
      This allows us to move duplicated code in <asm/atomic.h>
      (atomic_inc_not_zero() for now) to <linux/atomic.h>
      Signed-off-by: default avatarArun Sharma <asharma@fb.com>
      Reviewed-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarMike Frysinger <vapier@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60063497
    • KAMEZAWA Hiroyuki's avatar
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki authored
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • KAMEZAWA Hiroyuki's avatar
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki authored
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
  17. 27 Jun, 2011 1 commit
  18. 15 Jun, 2011 1 commit
  19. 26 May, 2011 1 commit
    • Ying Han's avatar
      memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c
      Ying Han authored
      The global kswapd scans per-zone LRU and reclaims pages regardless of the
      cgroup. It breaks memory isolation since one cgroup can end up reclaiming
      pages from another cgroup. Instead we should rely on memcg-aware target
      reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
      memory pressure.
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU. However, the return value is ignored. This patch is the first
      step to skip shrink_zone() if soft_limit reclaim does enough work.
      
      This is part of the effort which tries to reduce reclaiming pages in global
      LRU in memcg. The per-memcg background reclaim patchset further enhances the
      per-cgroup targetting reclaim, which I should have V4 posted shortly.
      
      Try running multiple memory intensive workloads within seperate memcgs. Watch
      the counters of soft_steal in memory.stat.
      
        $ cat /dev/cgroup/A/memory.stat | grep 'soft'
        soft_steal 240000
        soft_scan 240000
        total_soft_steal 240000
        total_soft_scan 240000
      
      This patch:
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU.  However, the return value is ignored.
      
      We would like to skip shrink_zone() if soft_limit reclaim does enough
      work.  Also, we need to make the memory pressure balanced across per-memcg
      zones, like the logic vm-core.  This patch is the first step where we
      start with counting the nr_scanned and nr_reclaimed from soft_limit
      reclaim into the global scan_control.
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ae5e89c
  20. 22 Mar, 2011 2 commits
    • Mel Gorman's avatar
      mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones · 8afdcece
      Mel Gorman authored
      When reclaiming for order-0 pages, kswapd requires that all zones be
      balanced.  Each cycle through balance_pgdat() does background ageing on
      all zones if necessary and applies equal pressure on the inactive zone
      unless a lot of pages are free already.
      
      A "lot of free pages" is defined as a "balance gap" above the high
      watermark which is currently 7*high_watermark.  Historically this was
      reasonable as min_free_kbytes was small.  However, on systems using huge
      pages, it is recommended that min_free_kbytes is higher and it is tuned
      with hugeadm --set-recommended-min_free_kbytes.  With the introduction of
      transparent huge page support, this recommended value is also applied.  On
      X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
      expect around 68M of memory to be free.  The Normal zone is approximately
      35000 pages so under even normal memory pressure such as copying a large
      file, it gets exhausted quickly.  As it is getting exhausted, kswapd
      applies pressure equally to all zones, including the DMA32 zone.  DMA32 is
      approximately 700,000 pages with a high watermark of around 23,000 pages.
      In this situation, kswapd will reclaim around (23000*8 where 8 is the high
      watermark + balance gap of 7 * high watermark) pages or 718M of pages
      before the zone is ignored.  What the user sees is that free memory far
      higher than it should be.
      
      To avoid an excessive number of pages being reclaimed from the larger
      zones, explicitely defines the "balance gap" to be either 1% of the zone
      or the low watermark for the zone, whichever is smaller.  While kswapd
      will check all zones to apply pressure, it'll ignore zones that meets the
      (high_wmark + balance_gap) watermark.
      
      To test this, 80G were copied from a partition and the amount of memory
      being used was recorded.  A comparison of a patch and unpatched kernel can
      be seen at
      http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
      and shows that kswapd is not reclaiming as much memory with the patch
      applied.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8afdcece
    • Minchan Kim's avatar
      mm: deactivate invalidated pages · 31560180
      Minchan Kim authored
      Recently, there are reported problem about thrashing.
      (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
      workloads(ex, nightly rsync).  That's because the workload makes just
      use-once pages and touches pages twice.  It promotes the page into active
      list so that it results in working set page eviction.
      
      Some app developer want to support POSIX_FADV_NOREUSE.  But other OSes
      don't support it, either.
      (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
      
      By other approach, app developers use POSIX_FADV_DONTNEED.  But it has a
      problem.  If kernel meets page is writing during invalidate_mapping_pages,
      it can't work.  It makes for application programmer to use it since they
      always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
      make sure the pages could be discardable.  At last, they can't use
      deferred write of kernel so that they could see performance loss.
      (http://insights.oetiker.ch/linux/fadvise.html)
      
      In fact, invalidation is very big hint to reclaimer.  It means we don't
      use the page any more.  So let's move the writing page into inactive
      list's head if we can't truncate it right now.
      
      Why I move page to head of lru on this patch, Dirty/Writeback page would
      be flushed sooner or later.  It can prevent writeout of pageout which is
      less effective than flusher's writeout.
      
      Originally, I reused lru_demote of Peter with some change so added his
      Signed-off-by.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reported-by: default avatarBen Gamari <bgamari.foss@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31560180