Skip to content
  • Mel Gorman's avatar
    mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim · cd38b115
    Mel Gorman authored
    
    
    There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla.  In these cases, zone_reclaim was enabled by
    default due to large NUMA distances.  In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).
    
    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free.  After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs.  In bad cases, each page allocated results in a full scan of the
    preferred zone.
    
    Patch 1 checks the preferred zone for recent allocation failure
            which is particularly important if zone_reclaim has failed
            recently.  This avoids rescanning the zone in the near future and
            instead falling back to another node.  This may hurt node locality
            in some cases but a failure to zone_reclaim is more expensive than
            a remote access.
    
    Patch 2 clears the zlc information after direct reclaim.
            Otherwise, zone_reclaim can mark zones full, direct reclaim can
            reclaim enough pages but the zone is still not considered for
            allocation.
    
    This was tested on a 24-thread 2-node x86_64 machine.  The tests were
    focused on large amounts of IO.  All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes.  The kernels tested are
    
    3.0-rc6-vanilla		Vanilla 3.0-rc6
    zlcfirst		Patch 1 applied
    zlcreconsider		Patches 1+2 applied
    
    FS-Mark
    ./fs_mark  -d  /tmp/fsmark-10813  -D  100  -N  5000  -n  208  -L  35  -t  24  -S0  -s  524288
                    fsmark-3.0-rc6       3.0-rc6       		3.0-rc6
                       vanilla			 zlcfirs 	zlcreconsider
    Files/s  min          54.90 ( 0.00%)       49.80 (-10.24%)       49.10 (-11.81%)
    Files/s  mean        100.11 ( 0.00%)      135.17 (25.94%)      146.93 (31.87%)
    Files/s  stddev       57.51 ( 0.00%)      138.97 (58.62%)      158.69 (63.76%)
    Files/s  max         361.10 ( 0.00%)      834.40 (56.72%)      802.40 (55.00%)
    Overhead min       76704.00 ( 0.00%)    76501.00 ( 0.27%)    77784.00 (-1.39%)
    Overhead mean    1485356.51 ( 0.00%)  1035797.83 (43.40%)  1594680.26 (-6.86%)
    Overhead stddev  1848122.53 ( 0.00%)   881489.88 (109.66%)  1772354.90 ( 4.27%)
    Overhead max     7989060.00 ( 0.00%)  3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)        501.49    493.91    499.93
    Total Elapsed Time (seconds)               2451.57   2257.48   2215.92
    
    MMTests Statistics: vmstat
    Page Ins                                       46268       63840       66008
    Page Outs                                   90821596    90671128    88043732
    Swap Ins                                           0           0           0
    Swap Outs                                          0           0           0
    Direct pages scanned                        13091697     8966863     8971790
    Kswapd pages scanned                               0     1830011     1831116
    Kswapd pages reclaimed                             0     1829068     1829930
    Direct pages reclaimed                      13037777     8956828     8648314
    Kswapd efficiency                               100%         99%         99%
    Kswapd velocity                                0.000     810.643     826.346
    Direct efficiency                                99%         99%         96%
    Direct velocity                             5340.128    3972.068    4048.788
    Percentage direct scans                         100%         83%         83%
    Page writes by reclaim                             0           3           0
    Slabs scanned                                 796672      720640      720256
    Direct inode steals                          7422667     7160012     7088638
    Kswapd inode steals                                0     1736840     2021238
    
    Test completes far faster with a large increase in the number of files
    created per second.  Standard deviation is high as a small number of
    iterations were much higher than the mean.  The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.
    
    LARGE DD
                   		3.0-rc6       3.0-rc6       3.0-rc6
                       	vanilla     zlcfirst     zlcreconsider
    download tar           59 ( 0.00%)   59 ( 0.00%)   55 ( 7.27%)
    dd source files       527 ( 0.00%)  296 (78.04%)  320 (64.69%)
    delete source          36 ( 0.00%)   19 (89.47%)   20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)        125.03    118.98    122.01
    Total Elapsed Time (seconds)                624.56    375.02    398.06
    
    MMTests Statistics: vmstat
    Page Ins                                     3594216      439368      407032
    Page Outs                                   23380832    23380488    23377444
    Swap Ins                                           0           0           0
    Swap Outs                                          0         436         287
    Direct pages scanned                        17482342    69315973    82864918
    Kswapd pages scanned                               0      519123      575425
    Kswapd pages reclaimed                             0      466501      522487
    Direct pages reclaimed                       5858054     2732949     2712547
    Kswapd efficiency                               100%         89%         90%
    Kswapd velocity                                0.000    1384.254    1445.574
    Direct efficiency                                33%          3%          3%
    Direct velocity                            27991.453  184832.737  208171.929
    Percentage direct scans                         100%         99%         99%
    Page writes by reclaim                             0        5082       13917
    Slabs scanned                                  17280       29952       35328
    Direct inode steals                           115257     1431122      332201
    Kswapd inode steals                                0           0      979532
    
    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with.  Time to
    completion is reduced.  The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered.  The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.
    
    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)        124.47    111.67    112.64
    Total Elapsed Time (seconds)               2138.14   1816.30   1867.56
    
    MMTests Statistics: vmstat
    Page Ins                                       90760       89124       89516
    Page Outs                                  121028340   120199524   120736696
    Swap Ins                                           0          86          55
    Swap Outs                                          0           0           0
    Direct pages scanned                       114989363    96461439    96330619
    Kswapd pages scanned                        56430948    56965763    57075875
    Kswapd pages reclaimed                      27743219    27752044    27766606
    Direct pages reclaimed                         49777       46884       36655
    Kswapd efficiency                                49%         48%         48%
    Kswapd velocity                            26392.541   31363.631   30561.736
    Direct efficiency                                 0%          0%          0%
    Direct velocity                            53780.091   53108.759   51581.004
    Percentage direct scans                          67%         62%         62%
    Page writes by reclaim                           385         122        1513
    Slabs scanned                                  43008       39040       42112
    Direct inode steals                                0          10           8
    Kswapd inode steals                              733         534         477
    
    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.
    
    The gains are mostly down to two things.  In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently.  Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.
    
    This patch: initialise ZLC for first zone eligible for zone_reclaim.
    
    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently.  The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.
    
    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called.  The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive.  If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.
    
    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim().  Once initialised, it is checked whether the zone failed
    zone_reclaim recently.  If it has, the zone is skipped.  As the first zone
    is now being checked, additional care has to be taken about zones marked
    full.  A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails.  Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.
    
    Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    cd38b115