Skip to content
  • Mel Gorman's avatar
    mm, page_alloc: delete the zonelist_cache · f77cf4e4
    Mel Gorman authored
    
    
    The zonelist cache (zlc) was introduced to skip over zones that were
    recently known to be full.  This avoided expensive operations such as the
    cpuset checks, watermark calculations and zone_reclaim.  The situation
    today is different and the complexity of zlc is harder to justify.
    
    1) The cpuset checks are no-ops unless a cpuset is active and in general
       are a lot cheaper.
    
    2) zone_reclaim is now disabled by default and I suspect that was a large
       source of the cost that zlc wanted to avoid. When it is enabled, it's
       known to be a major source of stalling when nodes fill up and it's
       unwise to hit every other user with the overhead.
    
    3) Watermark checks are expensive to calculate for high-order
       allocation requests. Later patches in this series will reduce the cost
       of the watermark checking.
    
    4) The most important issue is that in the current implementation it
       is possible for a failed THP allocation to mark a zone full for order-0
       allocations and cause a fallback to remote nodes.
    
    The last issue could be addressed with additional complexity but as the
    benefit of zlc is questionable, it is better to remove it.  If stalls due
    to zone_reclaim are ever reported then an alternative would be to
    introduce deferring logic based on a timeout inside zone_reclaim itself
    and leave the page allocator fast paths alone.
    
    The impact on page-allocator microbenchmarks is negligible as they don't
    hit the paths where the zlc comes into play.  Most page-reclaim related
    workloads showed no noticeable difference as a result of the removal.
    
    The impact was noticeable in a workload called "stutter".  One part uses a
    lot of anonymous memory, a second measures mmap latency and a third copies
    a large file.  In an ideal world the latency application would not notice
    the mmap latency.  On a 2-node machine the results of this patch are
    
    stutter
                                 4.3.0-rc1             4.3.0-rc1
                                  baseline              nozlc-v4
    Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
    1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
    2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
    3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
    Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
    Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
    Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
    Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
    Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
    Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
    Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
    Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
    Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
    Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
    Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
    Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
    Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
    
    Note that the maximum stall latency went from 24 seconds to 12 which is
    still bad but an improvement.  The milage varies considerably 2-node
    machine on an earlier test went from 494 seconds to 47 seconds and a
    4-node machine that tested an earlier version of this patch went from a
    worst case stall time of 6 seconds to 67ms.  The nature of the benchmark
    is inherently unpredictable as it is hammering the system and the milage
    will vary between machines.
    
    There is a secondary impact with potentially more direct reclaim because
    zones are now being considered instead of being skipped by zlc.  In this
    particular test run it did not occur so will not be described.  However,
    in at least one test the following was observed
    
    1. Direct reclaim rates were higher. This was likely due to direct reclaim
      being entered instead of the zlc disabling a zone and busy looping.
      Busy looping may have the effect of allowing kswapd to make more
      progress and in some cases may be better overall. If this is found then
      the correct action is to put direct reclaimers to sleep on a waitqueue
      and allow kswapd make forward progress. Busy looping on the zlc is even
      worse than when the allocator used to blindly call congestion_wait().
    
    2. There was higher swap activity as direct reclaim was active.
    
    3. Direct reclaim efficiency was lower. This is related to 1 as more
      scanning activity also encountered more pages that could not be
      immediately reclaimed
    
    In that case, the direct page scan and reclaim rates are noticeable but
    it is not considered a problem for a few reasons
    
    1. The test is primarily concerned with latency. The mmap attempts are also
       faulted which means there are THP allocation requests. The ZLC could
       cause zones to be disabled causing the process to busy loop instead
       of reclaiming.  This looks like elevated direct reclaim activity but
       it's the correct action to take based on what processes requested.
    
    2. The test hammers reclaim and compaction heavily. The number of successful
       THP faults is highly variable but affects the reclaim stats. It's not a
       realistic or reasonable measure of page reclaim activity.
    
    3. No other page-reclaim intensive workload that was tested showed a problem.
    
    4. If a workload is identified that benefitted from the busy looping then it
       should be fixed by having direct reclaimers sleep on a wait queue until
       woken by kswapd instead of busy looping. We had this class of problem before
       when congestion_waits() with a fixed timeout was a brain damaged decision
       but happened to benefit some workloads.
    
    If a workload is identified that relied on the zlc to busy loop then it
    should be fixed correctly and have a direct reclaimer sleep on a waitqueue
    until woken by kswapd.
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    f77cf4e4