1. 14 Jan, 2016 3 commits
  2. 22 Nov, 2015 1 commit
    • Jesper Dangaard Brouer's avatar
      slab/slub: adjust kmem_cache_alloc_bulk API · 865762a8
      Jesper Dangaard Brouer authored
      Adjust kmem_cache_alloc_bulk API before we have any real users.
      
      Adjust API to return type 'int' instead of previously type 'bool'.  This
      is done to allow future extension of the bulk alloc API.
      
      A future extension could be to allow SLUB to stop at a page boundary, when
      specified by a flag, and then return the number of objects.
      
      The advantage of this approach, would make it easier to make bulk alloc
      run without local IRQs disabled.  With an approach of cmpxchg "stealing"
      the entire c->freelist or page->freelist.  To avoid overshooting we would
      stop processing at a slab-page boundary.  Else we always end up returning
      some objects at the cost of another cmpxchg.
      
      To keep compatible with future users of this API linking against an older
      kernel when using the new flag, we need to return the number of allocated
      objects with this API change.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      865762a8
  3. 06 Nov, 2015 2 commits
    • Kirill A. Shutemov's avatar
      slab, slub: use page->rcu_head instead of page->lru plus cast · bc4f610d
      Kirill A. Shutemov authored
      We have properly typed page->rcu_head, no need to cast page->lru.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc4f610d
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  4. 05 Nov, 2015 2 commits
    • Vladimir Davydov's avatar
      memcg: unify slab and other kmem pages charging · f3ccb2c4
      Vladimir Davydov authored
      We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
      uncharging kmem pages to memcg, but currently they are not used for
      charging slab pages (i.e.  they are only used for charging pages allocated
      with alloc_kmem_pages).  The only reason why the slab subsystem uses
      special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
      needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
      to the memcg that the current task belongs to.
      
      To remove this diversity, this patch adds an extra argument to
      __memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
      not NULL, the function tries to charge to the memcg it points to,
      otherwise it charge to the current context.  Next, it makes the slab
      subsystem use this function to charge slab pages.
      
      Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
      in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
      __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
      don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
      Besides, one can now detect which memcg a slab page belongs to by reading
      /proc/kpagecgroup.
      
      Note, this patch switches slab to charge-after-alloc design.  Since this
      design is already used for all other memcg charges, it should not make any
      difference.
      
      [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3ccb2c4
    • Catalin Marinas's avatar
      mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE · d4322d88
      Catalin Marinas authored
      On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
      configurations defining ARCH_DMA_MINALIGN to 128), the first
      kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
      "kmalloc-128" with index 7.  Depending on the debug kernel configuration,
      sizeof(struct kmem_cache) can be larger than 128 resulting in an
      INDEX_NODE of 8.
      
      Commit 8fc9cf42 ("slab: make more slab management structure off the
      slab") enables off-slab management objects for sizes starting with
      PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
      of the "kmalloc-128" cache would try to place the management objects
      off-slab.  However, since KMALLOC_MIN_SIZE is already 128 and
      freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
      returns NULL (kmalloc_caches[7] not populated yet).  This triggers the
      following bug on arm64:
      
        kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
        Hardware name: Juno (DT)
        PC is at __kmem_cache_create+0x21c/0x280
        LR is at __kmem_cache_create+0x210/0x280
        [...]
        Call trace:
          __kmem_cache_create+0x21c/0x280
          create_boot_cache+0x48/0x80
          create_kmalloc_cache+0x50/0x88
          create_kmalloc_caches+0x4c/0xf4
          kmem_cache_init+0x100/0x118
          start_kernel+0x214/0x33c
      
      This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
      management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
      
      Fixes: 8fc9cf42 ("slab: make more slab management structure off the slab")
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4322d88
  5. 01 Oct, 2015 1 commit
    • Joonsoo Kim's avatar
      mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1) · 03a2d2a3
      Joonsoo Kim authored
      Commit description is copied from the original post of this bug:
      
        http://comments.gmane.org/gmane.linux.kernel.mm/135349
      
      Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
      larger cache size than the size index INDEX_NODE mapping.  In kernels
      3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.
      
      However, sometimes we can't get the right output we expected via
      kmalloc_size(INDEX_NODE + 1), causing a BUG().
      
      The mapping table in the latest kernel is like:
          index = {0,   1,  2 ,  3,  4,   5,   6,   n}
           size = {0,   96, 192, 8, 16,  32,  64,   2^n}
      The mapping table before 3.10 is like this:
          index = {0 , 1 , 2,   3,  4 ,  5 ,  6,   n}
          size  = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}
      
      The problem on my mips64 machine is as follows:
      
      (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
          && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
          and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
          kmalloc_index(sizeof(struct kmem_cache_node))
      
      (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.
      
      (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
          = PAGE_SIZE".
      
      (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
          "flags |= CFLGS_OFF_SLAB" will be covered.
      
      (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
          "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
          here may be NULL while kernel bootup.
      
      (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
          BUG info as the following shows (may be only mips64 has this problem):
      
      This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
      the BUG by adding 'size >= 256' check to guarantee that all necessary
      small sized slabs are initialized regardless sequence of slab size in
      mapping table.
      
      Fixes: e3366016 ("slab: Use common kmalloc_index/kmalloc_size...")
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarLiuhailong <liu.hailong6@zte.com.cn>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03a2d2a3
  6. 08 Sep, 2015 1 commit
    • Vlastimil Babka's avatar
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka authored
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRobin Holt <robinmholt@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
  7. 04 Sep, 2015 1 commit
    • Christoph Lameter's avatar
      slab: infrastructure for bulk object allocation and freeing · 484748f0
      Christoph Lameter authored
      Add the basic infrastructure for alloc/free operations on pointer arrays.
      It includes a generic function in the common slab code that is used in
      this infrastructure patch to create the unoptimized functionality for slab
      bulk operations.
      
      Allocators can then provide optimized allocation functions for situations
      in which large numbers of objects are needed.  These optimization may
      avoid taking locks repeatedly and bypass metadata creation if all objects
      in slab pages can be used to provide the objects required.
      
      Allocators can extend the skeletons provided and add their own code to the
      bulk alloc and free functions.  They can keep the generic allocation and
      freeing and just fall back to those if optimizations would not work (like
      for example when debugging is on).
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      484748f0
  8. 21 Aug, 2015 1 commit
    • Michal Hocko's avatar
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko authored
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7 ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Debugged-by: default avatarVlastimil Babka <vbabka@suse.com>
      Debugged-by: default avatarJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f064f34
  9. 24 Jun, 2015 1 commit
    • Daniel Sanders's avatar
      slab: correct size_index table before replacing the bootstrap kmem_cache_node · 34cc6990
      Daniel Sanders authored
      This patch moves the initialization of the size_index table slightly
      earlier so that the first few kmem_cache_node's can be safely allocated
      when KMALLOC_MIN_SIZE is large.
      
      There are currently two ways to generate indices into kmalloc_caches (via
      kmalloc_index() and via the size_index table in slab_common.c) and on some
      arches (possibly only MIPS) they potentially disagree with each other
      until create_kmalloc_caches() has been called.  It seems that the
      intention is that the size_index table is a fast equivalent to
      kmalloc_index() and that create_kmalloc_caches() patches the table to
      return the correct value for the cases where kmalloc_index()'s
      if-statements apply.
      
      The failing sequence was:
      * kmalloc_caches contains NULL elements
      * kmem_cache_init initialises the element that 'struct
        kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
        56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
      * init_list is called which calls kmalloc_node to allocate a 'struct
        kmem_cache_node'.
      * kmalloc_slab selects the kmem_caches element using
        size_index[size_index_elem(size)]. For MIPS, size is 56, and the
        expression returns 6.
      * This element of kmalloc_caches is NULL and allocation fails.
      * If it had not already failed, it would have called
        create_kmalloc_caches() at this point which would have changed
        size_index[size_index_elem(size)] to 7.
      
      I don't believe the bug to be LLVM specific but GCC doesn't normally
      encounter the problem.  I haven't been able to identify exactly what GCC
      is doing better (probably inlining) but it seems that GCC is managing to
      optimize to the point that it eliminates the problematic allocations.
      This theory is supported by the fact that GCC can be made to fail in the
      same way by changing inline, __inline, __inline__, and __always_inline in
      include/linux/compiler-gcc.h such that they don't actually inline things.
      Signed-off-by: default avatarDaniel Sanders <daniel.sanders@imgtec.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34cc6990
  10. 14 Apr, 2015 1 commit
    • David Rientjes's avatar
      mm: remove GFP_THISNODE · 4167e9b2
      David Rientjes authored
      NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
      
      GFP_THISNODE is a secret combination of gfp bits that have different
      behavior than expected.  It is a combination of __GFP_THISNODE,
      __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
      allocator slowpath to fail without trying reclaim even though it may be
      used in combination with __GFP_WAIT.
      
      An example of the problem this creates: commit e97ca8e5 ("mm: fix
      GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
      that really just wanted __GFP_THISNODE.  The problem doesn't end there,
      however, because even it was a no-op for alloc_misplaced_dst_page(),
      which also sets __GFP_NORETRY and __GFP_NOWARN, and
      migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
      is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
      no-op in these cases since the page allocator special-cases
      __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
      
      It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
      to restrict an allocation to a local node, but remove GFP_THISNODE and
      its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
      wants to avoid reclaim.
      
      This allows the aforementioned functions to actually reclaim as they
      should.  It also enables any future callers that want to do
      __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
      rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
      
      Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
      it is unchanged.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4167e9b2
  11. 12 Feb, 2015 2 commits
    • Vladimir Davydov's avatar
      slub: make dead caches discard free slabs immediately · d6e0b7fa
      Vladimir Davydov authored
      To speed up further allocations SLUB may store empty slabs in per cpu/node
      partial lists instead of freeing them immediately.  This prevents per
      memcg caches destruction, because kmem caches created for a memory cgroup
      are only destroyed after the last page charged to the cgroup is freed.
      
      To fix this issue, this patch resurrects approach first proposed in [1].
      It forbids SLUB to cache empty slabs after the memory cgroup that the
      cache belongs to was destroyed.  It is achieved by setting kmem_cache's
      cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
      that it would drop frozen empty slabs immediately if cpu_partial = 0.
      
      The runtime overhead is minimal.  From all the hot functions, we only
      touch relatively cold put_cpu_partial(): we make it call
      unfreeze_partials() after freezing a slab that belongs to an offline
      memory cgroup.  Since slab freezing exists to avoid moving slabs from/to a
      partial list on free/alloc, and there can't be allocations from dead
      caches, it shouldn't cause any overhead.  We do have to disable preemption
      for put_cpu_partial() to achieve that though.
      
      The original patch was accepted well and even merged to the mm tree.
      However, I decided to withdraw it due to changes happening to the memcg
      core at that time.  I had an idea of introducing per-memcg shrinkers for
      kmem caches, but now, as memcg has finally settled down, I do not see it
      as an option, because SLUB shrinker would be too costly to call since SLUB
      does not keep free slabs on a separate list.  Besides, we currently do not
      even call per-memcg shrinkers for offline memcgs.  Overall, it would
      introduce much more complexity to both SLUB and memcg than this small
      patch.
      
      Regarding to SLAB, there's no problem with it, because it shrinks
      per-cpu/node caches periodically.  Thanks to list_lru reparenting, we no
      longer keep entries for offline cgroups in per-memcg arrays (such as
      memcg_cache_params->memcg_caches), so we do not have to bother if a
      per-memcg cache will be shrunk a bit later than it could be.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6e0b7fa
    • Vladimir Davydov's avatar
      slab: link memcg caches of the same kind into a list · 426589f5
      Vladimir Davydov authored
      Sometimes, we need to iterate over all memcg copies of a particular root
      kmem cache.  Currently, we use memcg_cache_params->memcg_caches array for
      that, because it contains all existing memcg caches.
      
      However, it's a bad practice to keep all caches, including those that
      belong to offline cgroups, in this array, because it will be growing
      beyond any bounds then.  I'm going to wipe away dead caches from it to
      save space.  To still be able to perform iterations over all memcg caches
      of the same kind, let us link them into a list.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      426589f5
  12. 13 Dec, 2014 2 commits
    • Vladimir Davydov's avatar
      slab: fix cpuset check in fallback_alloc · 061d7074
      Vladimir Davydov authored
      fallback_alloc is called on kmalloc if the preferred node doesn't have
      free or partial slabs and there's no pages on the node's free list
      (GFP_THISNODE allocations fail).  Before invoking the reclaimer it tries
      to locate a free or partial slab on other allowed nodes' lists.  While
      iterating over the preferred node's zonelist it skips those zones which
      hardwall cpuset check returns false for.  That means that for a task bound
      to a specific node using cpusets fallback_alloc will always ignore free
      slabs on other nodes and go directly to the reclaimer, which, however, may
      allocate from other nodes if cpuset.mem_hardwall is unset (default).  As a
      result, we may get lists of free slabs grow without bounds on other nodes,
      which is bad, because inactive slabs are only evicted by cache_reap at a
      very slow rate and cannot be dropped forcefully.
      
      To reproduce the issue, run a process that will walk over a directory tree
      with lots of files inside a cpuset bound to a node that constantly
      experiences memory pressure.  Look at num_slabs vs active_slabs growth as
      reported by /proc/slabinfo.
      
      To avoid this we should use softwall cpuset check in fallback_alloc.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      061d7074
    • Vladimir Davydov's avatar
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov authored
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
  13. 10 Dec, 2014 3 commits
  14. 03 Dec, 2014 1 commit
    • Paul Mackerras's avatar
      slab: fix nodeid bounds check for non-contiguous node IDs · 7c3fbbdd
      Paul Mackerras authored
      The bounds check for nodeid in ____cache_alloc_node gives false
      positives on machines where the node IDs are not contiguous, leading to
      a panic at boot time.  For example, on a POWER8 machine the node IDs are
      typically 0, 1, 16 and 17.  This means that num_online_nodes() returns
      4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON
      triggers, like this:
      
        kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
        Call Trace:
          .____cache_alloc_node+0x5c/0x270 (unreliable)
          .kmem_cache_alloc_node_trace+0xdc/0x360
          .init_list+0x3c/0x128
          .kmem_cache_init+0x1dc/0x258
          .start_kernel+0x2a0/0x568
          start_here_common+0x20/0xa8
      
      To fix this, we instead compare the nodeid with MAX_NUMNODES, and
      additionally make sure it isn't negative (since nodeid is an int).  The
      check is there mainly to protect the array dereference in the get_node()
      call in the next line, and the array being dereferenced is of size
      MAX_NUMNODES.  If the nodeid is in range but invalid (for example if the
      node is off-line), the BUG_ON in the next line will catch that.
      
      Fixes: 14e50c6a ("mm: slab: Verify the nodeid passed to ____cache_alloc_node")
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c3fbbdd
  15. 27 Oct, 2014 1 commit
    • Vladimir Davydov's avatar
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov authored
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      344736f2
  16. 13 Oct, 2014 1 commit
    • Joonsoo Kim's avatar
      mm/slab: fix unaligned access on sparc64 · 85c9f4b0
      Joonsoo Kim authored
      Commit bf0dea23 ("mm/slab: use percpu allocator for cpu cache")
      changed the allocation method for cpu cache array from slab allocator to
      percpu allocator.  Alignment should be provided for aligned memory in
      percpu allocator case, but, that commit mistakenly set this alignment to
      0.  So, percpu allocator returns unaligned memory address.  It doesn't
      cause any problem on x86 which permits unaligned access, but, it causes
      the problem on sparc64 which needs strong guarantee of alignment.
      
      Following bug report is reported from David Miller.
      
        I'm getting tons of the following on sparc64:
      
        [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
        [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
        ...
        [603970.554394] log_unaligned: 333 callbacks suppressed
        ...
      
      This patch provides a proper alignment parameter when allocating cpu
      cache to fix this unaligned memory access problem on sparc64.
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Tested-by: default avatarMeelis Roos <mroos@linux.ee>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85c9f4b0
  17. 09 Oct, 2014 7 commits
    • Rob Jones's avatar
      mm/slab.c: use __seq_open_private() instead of seq_open() · b208ce32
      Rob Jones authored
      Using __seq_open_private() removes boilerplate code from slabstats_open()
      
      The resultant code is shorter and easier to follow.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarRob Jones <rob.jones@codethink.co.uk>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b208ce32
    • Joonsoo Kim's avatar
      mm/slab: use percpu allocator for cpu cache · bf0dea23
      Joonsoo Kim authored
      Because of chicken and egg problem, initialization of SLAB is really
      complicated.  We need to allocate cpu cache through SLAB to make the
      kmem_cache work, but before initialization of kmem_cache, allocation
      through SLAB is impossible.
      
      On the other hand, SLUB does initialization in a more simple way.  It uses
      percpu allocator to allocate cpu cache so there is no chicken and egg
      problem.
      
      So, this patch try to use percpu allocator in SLAB.  This simplifies the
      initialization step in SLAB so that we could maintain SLAB code more
      easily.
      
      In my testing there is no performance difference.
      
      This implementation relies on percpu allocator.  Because percpu allocator
      uses vmalloc address space, vmalloc address space could be exhausted by
      this change on many cpu system with *32 bit* kernel.  This implementation
      can cover 1024 cpus in worst case by following calculation.
      
      Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
      	120 objects per cpu_cache = 140 MB
      Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
      	80 objects per cpu_cache = 46 MB
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf0dea23
    • Joonsoo Kim's avatar
      mm/slab: support slab merge · 12220dea
      Joonsoo Kim authored
      Slab merge is good feature to reduce fragmentation.  If new creating slab
      have similar size and property with exsitent slab, this feature reuse it
      rather than creating new one.  As a result, objects are packed into fewer
      slabs so that fragmentation is reduced.
      
      Below is result of my testing.
      
      * After boot, sleep 20; cat /proc/meminfo | grep Slab
      
      <Before>
      Slab: 25136 kB
      
      <After>
      Slab: 24364 kB
      
      We can save 3% memory used by slab.
      
      For supporting this feature in SLAB, we need to implement SLAB specific
      kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
      SLUB specific processing related to debug flag and object size change on
      these functions.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12220dea
    • Joonsoo Kim's avatar
      mm/slab: factor out unlikely part of cache_free_alien() · 25c4f304
      Joonsoo Kim authored
      cache_free_alien() is rarely used function when node mismatch.  But, it is
      defined with inline attribute so it is inlined to __cache_free() which is
      core free function of slab allocator.  It uselessly makes
      kmem_cache_free()/kfree() functions large.  What we really need to inline
      is just checking node match so this patch factor out other parts of
      cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().
      
      <Before>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      <After>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      0000000000001110 00000000000001b5 T kfree
      0000000000000750 0000000000000181 T kmem_cache_free
      
      You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25c4f304
    • Joonsoo Kim's avatar
      mm/slab: noinline __ac_put_obj() · d3aec344
      Joonsoo Kim authored
      Our intention of __ac_put_obj() is that it doesn't affect anything if
      sk_memalloc_socks() is disabled.  But, because __ac_put_obj() is too
      small, compiler inline it to ac_put_obj() and affect code size of free
      path.  This patch add noinline keyword for __ac_put_obj() not to distrupt
      normal free path at all.
      
      <Before>
      nm -S slab-orig.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e80 00000000000002f5 t cache_alloc_refill
      0000000000001230 0000000000000258 T kfree
      0000000000000690 000000000000024c T kmem_cache_free
      
      <After>
      nm -S slab-patched.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e00 00000000000002e5 t cache_alloc_refill
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      cache_alloc_refill: 0x2f5->0x2e5
      kfree: 0x256->0x228
      kmem_cache_free: 0x24c->0x216
      
      code size of each function is reduced slightly.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3aec344
    • Joonsoo Kim's avatar
      mm/slab: move cache_flusharray() out of unlikely.text section · 3d880194
      Joonsoo Kim authored
      Now, due to likely keyword, compiled code of cache_flusharray() is on
      unlikely.text section.  Although it is uncommon case compared to free to
      cpu cache case, it is common case than free_block().  But, free_block() is
      on normal text section.  This patch fix this odd situation to remove
      likely keyword.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d880194
    • Joonsoo Kim's avatar
      mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() · 61f47105
      Joonsoo Kim authored
      Now, we track caller if tracing or slab debugging is enabled.  If they are
      disabled, we could save one argument passing overhead by calling
      __kmalloc(_node)().  But, I think that it would be marginal.  Furthermore,
      default slab allocator, SLUB, doesn't use this technique so I think that
      it's okay to change this situation.
      
      After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
      kernel build and remove some complicated '#if' defintion.  It looks more
      benefitial to me.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61f47105
  18. 26 Sep, 2014 1 commit
  19. 24 Sep, 2014 1 commit
    • Zefan Li's avatar
      cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags · 2ad654bc
      Zefan Li authored
      When we change cpuset.memory_spread_{page,slab}, cpuset will flip
      PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
      This should be done using atomic bitops, but currently we don't,
      which is broken.
      
      Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
      when one thread tried to clear PF_USED_MATH while at the same time another
      thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
      the same task.
      
      Here's the full report:
      https://lkml.org/lkml/2014/9/19/230
      
      To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
      
      v4:
      - updated mm/slab.c. (Fengguang Wu)
      - updated Documentation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Kees Cook <keescook@chromium.org>
      Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time")
      Cc: <stable@vger.kernel.org> # 2.6.31+
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2ad654bc
  20. 08 Aug, 2014 1 commit
    • Joonsoo Kim's avatar
      Revert "slab: remove BAD_ALIEN_MAGIC" · edcad250
      Joonsoo Kim authored
      This reverts commit a6406168 ("slab: remove BAD_ALIEN_MAGIC").
      
      commit a6406168 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
      system with !CONFIG_NUMA has only one memory node.  But, it turns out to
      be false by the report from Geert.  His system, m68k, has many memory
      nodes and is configured in !CONFIG_NUMA.  So it couldn't boot with above
      change.
      
      Here goes his failure report.
      
        With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:
      
        enable_cpucache failed for radix_tree_node, error 12.
        kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
        *** TRAP #7 ***   FORMAT=0
        Current process id is 0
        BAD KERNEL TRAP: 00000000
        Modules linked in:
        PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c
        SR: 2200  SP: 00345f90  a2: 0034c2e8
        d0: 0000003d    d1: 00000000    d2: 00000000    d3: 003ac942
        d4: 00000000    d5: 00000000    a0: 0034f686    a1: 0034f682
        Process swapper (pid: 0, task=0034c2e8)
        Frame format=0
        Stack from 00345fc4:
                002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
                00000000 00000000 00000000 00000000 00000000 003ac942 00000000
                003912d6
        Call Trace: [<000360fa>] parse_args+0x0/0x2ca
         [<0017d806>] strlen+0x0/0x1a
         [<003921d4>] start_kernel+0x23c/0x428
         [<003912d6>] _sinittext+0x2d6/0x95e
      
        Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff
        fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
        Disabling lock debugging due to kernel taint
        Kernel panic - not syncing: Attempted to kill the idle task!
      
      Although there is a alternative way to fix this issue such as disabling
      use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
      to me in this time.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edcad250
  21. 06 Aug, 2014 6 commits