Skip to content
  • Michal Hocko's avatar
    memcg: prevent OOM with too many dirty pages · e62e384e
    Michal Hocko authored
    
    
    The current implementation of dirty pages throttling is not memcg aware
    which makes it easy to have memcg LRUs full of dirty pages.  Without
    throttling, these LRUs can be scanned faster than the rate of writeback,
    leading to memcg OOM conditions when the hard limit is small.
    
    This patch fixes the problem by throttling the allocating process
    (possibly a writer) during the hard limit reclaim by waiting on
    PageReclaim pages.  We are waiting only for PageReclaim pages because
    those are the pages that made one full round over LRU and that means that
    the writeback is much slower than scanning.
    
    The solution is far from being ideal - long term solution is memcg aware
    dirty throttling - but it is meant to be a band aid until we have a real
    fix.  We are seeing this happening during nightly backups which are placed
    into containers to prevent from eviction of the real working set.
    
    The change affects only memcg reclaim and only when we encounter
    PageReclaim pages which is a signal that the reclaim doesn't catch up on
    with the writers so somebody should be throttled.  This could be
    potentially unfair because it could be somebody else from the group who
    gets throttled on behalf of the writer but as writers need to allocate as
    well and they allocate in higher rate the probability that only innocent
    processes would be penalized is not that high.
    
    I have tested this change by a simple dd copying /dev/zero to tmpfs or
    ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
    containers) and dd got killed by OOM killer every time.  With the patch I
    could run the dd with the same size under 5M controller without any OOM.
    The issue is more visible with slower devices for output.
    
    * With the patch
    ================
    * tmpfs size=2G
    ---------------
    $ vim cgroup_cache_oom_test.sh
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
    
    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
    
    * Without the patch
    ===================
    * tmpfs size=2G
    ---------------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
    
    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    ./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
    
    [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
    [hughd@google.com: fix deadlock with loop driver]
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Ying Han <yinghan@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e62e384e