Skip to content
  • Vladimir Davydov's avatar
    mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy · b313aeee
    Vladimir Davydov authored
    
    
    Workingset code was recently made memcg aware, but shadow node shrinker
    is still global.  As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect.  To avoid this, we need to make shadow node
    shrinker memcg aware.
    
    The actual work is done in patch 6 of the series.  Patches 1 and 2
    prepare memcg/shrinker infrastructure for the change.  Patch 3 is just a
    collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
    necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
    shadow nodes overhead in case workload mostly uses anonymous pages.
    
    This patch:
    
    Currently, in the legacy hierarchy kmem accounting is off for all
    cgroups by default and must be enabled explicitly by writing something
    to memory.kmem.limit_in_bytes.  Since we don't support reclaim on
    hitting kmem limit, nor do we have any plans to implement it, this is
    likely to be -1, just to enable kmem accounting and limit kernel memory
    consumption by the memory.limit_in_bytes along with user memory.
    
    This user API was introduced when the implementation of kmem accounting
    lacked slab shrinker support and hence was useless in practice.  Things
    have changed since then - slab shrinkers were made memcg aware, the
    accounting overhead seems to be negligible, and a failure to charge a
    kmem allocation should not have critical consequences, because we only
    account those kernel objects that should be safe to fail.  That's why
    kmem accounting is enabled by default for all cgroups in the default
    hierarchy, which will eventually replace the legacy one.
    
    The ability to enable kmem accounting for some cgroups while keeping it
    disabled for others is getting difficult to maintain.  E.g.  to make
    shadow node shrinker memcg aware (see mm/workingset.c), we need to know
    the relationship between the number of shadow nodes allocated for a
    cgroup and the size of its lru list.  If kmem accounting is enabled for
    all cgroups there is no problem, but what should we do if kmem
    accounting is enabled only for half of cgroups? We've no other choice
    but use global lru stats while scanning root cgroup's shadow nodes, but
    that would be wrong if kmem accounting was enabled for all cgroups
    (which is the case if the unified hierarchy is used), in which case we
    should use lru stats of the root cgroup's lruvec.
    
    That being said, let's enable kmem accounting for all memory cgroups by
    default.  If one finds it unstable or too costly, it can always be
    disabled system-wide by passing cgroup.memory=nokmem to the kernel at
    boot time.
    
    Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b313aeee