1. 12 Nov, 2013 27 commits
  2. 01 Nov, 2013 1 commit
    • Greg Thelen's avatar
      memcg: remove incorrect underflow check · 6920a1bd
      Greg Thelen authored
      When a memcg is deleted mem_cgroup_reparent_charges() moves charged
      memory to the parent memcg.  As of v3.11-9444-g3ea67d06 "memcg: add per
      cgroup writeback pages accounting" there's bad pointer read.  The goal
      was to check for counter underflow.  The counter is a per cpu counter
      and there are two problems with the code:
      
       (1) per cpu access function isn't used, instead a naked pointer is used
           which easily causes oops.
       (2) the check doesn't sum all cpus
      
      Test:
        $ cd /sys/fs/cgroup/memory
        $ mkdir x
        $ echo 3 > /proc/sys/vm/drop_caches
        $ (echo $BASHPID >> x/tasks && exec cat) &
        [1] 7154
        $ grep ^mapped x/memory.stat
        mapped_file 53248
        $ echo 7154 > tasks
        $ rmdir x
        <OOPS>
      
      The fix is to remove the check.  It's currently dangerous and isn't
      worth fixing it to use something expensive, such as
      percpu_counter_sum(), for each reparented page.  __this_cpu_read() isn't
      enough to fix this because there's no guarantees of the current cpus
      count.  The only guarantees is that the sum of all per-cpu counter is >=
      nr_pages.
      
      Fixes: 3ea67d06 ("memcg: add per cgroup writeback pages accounting")
      Reported-and-tested-by: default avatarFlavio Leitner <fbl@redhat.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarSha Zhengju <handai.szj@taobao.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6920a1bd
  3. 31 Oct, 2013 3 commits
  4. 30 Oct, 2013 3 commits
    • Greg Thelen's avatar
      memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting · 5e8cfc3c
      Greg Thelen authored
      As of commit 3ea67d06 ("memcg: add per cgroup writeback pages
      accounting") memcg counter errors are possible when moving charged
      memory to a different memcg.  Charge movement occurs when processing
      writes to memory.force_empty, moving tasks to a memcg with
      memcg.move_charge_at_immigrate=1, or memcg deletion.
      
      An example showing error after memory.force_empty:
      
        $ cd /sys/fs/cgroup/memory
        $ mkdir x
        $ rm /data/tmp/file
        $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
        [1] 13600
        $ grep ^mapped x/memory.stat
        mapped_file 1048576
        $ echo 13600 > tasks
        $ echo 1 > x/memory.force_empty
        $ grep ^mapped x/memory.stat
        mapped_file 4503599627370496
      
      mapped_file should end with 0.
        4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
        1048576          == 0x10,0000           == 0x100 pages
      
      This issue only affects the source memcg on 64 bit machines; the
      destination memcg counters are correct.  So the rmdir case is not too
      important because such counters are soon disappearing with the entire
      memcg.  But the memcg.force_empty and memory.move_charge_at_immigrate=1
      cases are larger problems as the bogus counters are visible for the
      (possibly long) remaining life of the source memcg.
      
      The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
      is subtly wrong because it subtracts the unsigned int nr_pages (either
      -1 or -512 for THP) from a signed long percpu counter.  When
      nr_pages=-1, -nr_pages=0xffffffff.  On 64 bit machines stat->count[idx]
      is signed 64 bit.  So memcg's attempt to simply decrement a count (e.g.
      from 1 to 0) boils down to:
      
        long count = 1
        unsigned int nr_pages = 1
        count += -nr_pages  /* -nr_pages == 0xffff,ffff */
        count is now 0x1,0000,0000 instead of 0
      
      The fix is to subtract the unsigned page count rather than adding its
      negation.  This only works once "percpu: fix this_cpu_sub() subtrahend
      casting for unsigneds" is applied to fix this_cpu_sub().
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e8cfc3c
    • Chen LinX's avatar
      mm/pagewalk.c: fix walk_page_range() access of wrong PTEs · 3017f079
      Chen LinX authored
      When walk_page_range walk a memory map's page tables, it'll skip
      VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it
      maybe larger than 'end'.  In next loop, 'addr' will be larger than
      'next'.  Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
      will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
      access the wrong pte.
      
        BUG: Bad page map in process procrank  pte:8437526f pmd:785de067
        addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping:  (null) index:9108d
        CPU: 1 PID: 4974 Comm: procrank Tainted: G    B   W  O 3.10.1+ #1
        Call Trace:
          dump_stack+0x16/0x18
          print_bad_pte+0x114/0x1b0
          vm_normal_page+0x56/0x60
          pagemap_pte_range+0x17a/0x1d0
          walk_page_range+0x19e/0x2c0
          pagemap_read+0x16e/0x200
          vfs_read+0x84/0x150
          SyS_read+0x4a/0x80
          syscall_call+0x7/0xb
      Signed-off-by: default avatarLiu ShuoX <shuox.liu@intel.com>
      Signed-off-by: default avatarChen LinX <linx.z.chen@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>	[3.10.x+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3017f079
    • Russell King's avatar
      mm: list_lru: fix almost infinite loop causing effective livelock · c56b097a
      Russell King authored
      I've seen a fair number of issues with kswapd and other processes
      appearing to get stuck in v3.12-rc.  Using sysrq-p many times seems to
      indicate that it gets stuck somewhere in list_lru_walk_node(), called
      from prune_icache_sb() and super_cache_scan().
      
      I never seem to be able to trigger a calltrace for functions above that
      point.
      
      So I decided to add the following to super_cache_scan():
      
          @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
                  inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
                  dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
                  total_objects = dentries + inodes + fs_objects + 1;
          +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects);
      
                  /* proportion the scan between the caches */
                  dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
                  inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
          +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes);
          +BUG_ON(dentries == 0);
          +BUG_ON(inodes == 0);
      
                  /*
                   * prune the dcache first as the icache is pinned by it, then
          @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
                          freed += sb->s_op->free_cached_objects(sb, fs_objects,
                                                                 sc->nid);
                  }
          -
          +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed);
                  drop_super(sb);
                  return freed;
           }
      
      and shortly thereafter, having applied some pressure, I got this:
      
          update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
          update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
          ------------[ cut here ]------------
          Kernel BUG at c0101994 [verbose debug info unavailable]
          Internal error: Oops - BUG: 0 [#3] SMP ARM
          Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
          CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G      D      3.12.0-rc7+ #154
          task: daea1200 ti: c3bf8000 task.ti: c3bf8000
          PC is at super_cache_scan+0x1c0/0x278
          LR is at trace_hardirqs_on+0x14/0x18
          Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
          ...
          Backtrace:
            (super_cache_scan) from [<c00cd69c>] (shrink_slab+0x254/0x4c8)
            (shrink_slab) from [<c00d09a0>] (try_to_free_pages+0x3a0/0x5e0)
            (try_to_free_pages) from [<c00c59cc>] (__alloc_pages_nodemask+0x5)
            (__alloc_pages_nodemask) from [<c00e07c0>] (__pte_alloc+0x2c/0x13)
            (__pte_alloc) from [<c00e3a70>] (handle_mm_fault+0x84c/0x914)
            (handle_mm_fault) from [<c001a4cc>] (do_page_fault+0x1f0/0x3bc)
            (do_page_fault) from [<c001a7b0>] (do_translation_fault+0xac/0xb8)
            (do_translation_fault) from [<c000840c>] (do_DataAbort+0x38/0xa0)
            (do_DataAbort) from [<c00133f8>] (__dabt_usr+0x38/0x40)
      
      Notice that we had a very low number of inodes, which were reduced to
      zero my mult_frac().
      
      Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
      inodes (0) into that as the number of objects to scan:
      
          long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
                               int nid)
          {
                  LIST_HEAD(freeable);
                  long freed;
      
                  freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
                                                 &freeable, &nr_to_scan);
      
      which does:
      
          unsigned long
          list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
                             void *cb_arg, unsigned long *nr_to_walk)
          {
      
                  struct list_lru_node    *nlru = &lru->node[nid];
                  struct list_head *item, *n;
                  unsigned long isolated = 0;
      
                  spin_lock(&nlru->lock);
          restart:
                  list_for_each_safe(item, n, &nlru->list) {
                          enum lru_status ret;
      
                          /*
                           * decrement nr_to_walk first so that we don't livelock if we
                           * get stuck on large numbesr of LRU_RETRY items
                           */
                          if (--(*nr_to_walk) == 0)
                                  break;
      
      So, if *nr_to_walk was zero when this function was entered, that means
      we're wanting to operate on (~0UL)+1 objects - which might as well be
      infinite.
      
      Clearly this is not correct behaviour.  If we think about the behaviour
      of this function when *nr_to_walk is 1, then clearly it's wrong - we
      decrement first and then test for zero - which results in us doing
      nothing at all.  A post-decrement would give the desired behaviour -
      we'd try to walk one object and one object only if *nr_to_walk were one.
      
      It also gives the correct behaviour for zero - we exit at this point.
      
      Fixes: 5cedf721 ("list_lru: fix broken LRU_RETRY behaviour")
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [ Modified to make sure we never underflow the count: this function gets
        called in a loop, so the 0 -> ~0ul transition is dangerous  - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c56b097a
  5. 29 Oct, 2013 6 commits