Skip to content
  • Greg Thelen's avatar
    memcg: add page_cgroup flags for dirty page tracking · db16d5ec
    Greg Thelen authored
    
    
    This patchset provides the ability for each cgroup to have independent
    dirty page limits.
    
    Limiting dirty memory is like fixing the max amount of dirty (hard to
    reclaim) page cache used by a cgroup.  So, in case of multiple cgroup
    writers, they will not be able to consume more than their designated share
    of dirty pages and will be forced to perform write-out if they cross that
    limit.
    
    The patches are based on a series proposed by Andrea Righi in Mar 2010.
    
    Overview:
    
    - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
      unstable.
    
    - Extend mem_cgroup to record the total number of pages in each of the
      interesting dirty states (dirty, writeback, unstable_nfs).
    
    - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
      limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
      via cgroupfs control files.
    
    - Consider both system and per-memcg dirty limits in page writeback when
      deciding to queue background writeback or block for foreground writeback.
    
    Known shortcomings:
    
    - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
      writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
      just inodes contributing dirty pages to the cgroup exceeding its limit.
    
    - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
      implementation detail.  An enhanced implementation is needed to check the
      chain of parents to ensure that no dirty limit is exceeded.
    
    Performance data:
    - A page fault microbenchmark workload was used to measure performance, which
      can be called in read or write mode:
            f = open(foo. $cpu)
            truncate(f, 4096)
            alarm(60)
            while (1) {
                    p = mmap(f, 4096)
                    if (write)
    			*p = 1
    		else
    			x = *p
                    munmap(p)
            }
    
    - The workload was called for several points in the patch series in different
      modes:
      - s_read is a single threaded reader
      - s_write is a single threaded writer
      - p_read is a 16 thread reader, each operating on a different file
      - p_write is a 16 thread writer, each operating on a different file
    
    - Measurements were collected on a 16 core non-numa system using "perf stat
      --repeat 3".  The -a option was used for parallel (p_*) runs.
    
    - All numbers are page fault rate (M/sec).  Higher is better.
    
    - To compare the performance of a kernel without non-memcg compare the first and
      last rows, neither has memcg configured.  The first row does not include any
      of these memcg patches.
    
    - To compare the performance of using memcg dirty limits, compare the baseline
      (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
      row titled "all patches").
    
                               root_cgroup                    child_cgroup
                     s_read s_write p_read p_write   s_read s_write p_read p_write
    mmotm w/o memcg   0.428  0.390   0.429  0.388
    mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
    all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
    all patches       0.431  0.402   0.427  0.395
      w/o memcg
    
    This patch:
    
    Add additional flags to page_cgroup to track dirty pages within a
    mem_cgroup.
    
    Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: default avatarAndrea Righi <arighi@develer.com>
    Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
    Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    db16d5ec