Skip to content
  • Mel Gorman's avatar
    mm: vmscan: remove lumpy reclaim · c53919ad
    Mel Gorman authored
    This series removes lumpy reclaim and some stalling logic that was
    unintentionally being used by memory compaction.  The end result is that
    stalling on dirty pages during page reclaim now depends on
    wait_iff_congested().
    
    Four kernels were compared
    
      3.3.0     vanilla
      3.4.0-rc2 vanilla
      3.4.0-rc2 lumpyremove-v2 is patch one from this series
      3.4.0-rc2 nosync-v2r3 is the full series
    
    Removing lumpy reclaim saves almost 900 bytes of text whereas the full
    series removes 1200 bytes.
    
         text     data      bss       dec     hex  filename
      67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
      6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
      6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
    
    There are behaviour changes in the series and so tests were run with
    monitoring of ftrace events.  This disrupts results so the performance
    results are distorted but the new behaviour should be clearer.
    
    fs-mark running in a threaded configuration showed little of interest as
    it did not push reclaim aggressively
    
      FS-Mark Multi Threaded
                              3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
      Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
      Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
      Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
      Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
      Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
      Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
      Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
      Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
      User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
      Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
    
      MMTests Statistics: vmstat
      Page Ins                                       80532       82212       81420       79480
      Page Outs                                  111434984   111456240   111437376   111582628
      Swap Ins                                           0           0           0           0
      Swap Outs                                          0           0           0           0
      Direct pages scanned                           44881       27889       27453       34843
      Kswapd pages scanned                        25841428    25860774    25861233    25843212
      Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
      Direct pages reclaimed                         44881       27889       27453       34843
      Kswapd efficiency                                99%         99%         99%         99%
      Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
      Direct efficiency                               100%        100%        100%        100%
      Direct velocity                               37.783      23.375      23.031      29.188
      Percentage direct scans                           0%          0%          0%          0%
    
    ftrace showed that there was no stalling on writeback or pages submitted
    for IO from reclaim context.
    
    postmark was similar and while it was more interesting, it also did not
    push reclaim heavily.
    
      POSTMARK
                                           3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
      Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
      Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
      Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
      Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
      Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
      Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
    
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
      User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
      Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
    
      MMTests Statistics: vmstat
      Page Ins                                    13710192    13729032    13727944    13760136
      Page Outs                                   43071140    42987228    42733684    42931624
      Swap Ins                                           0           0           0           0
      Swap Outs                                          0           0           0           0
      Direct pages scanned                               0           0           0           0
      Kswapd pages scanned                         9941613     9937443     9939085     9929154
      Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
      Direct pages reclaimed                             0           0           0           0
      Kswapd efficiency                                99%         99%         99%         99%
      Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
      Direct efficiency                               100%        100%        100%        100%
      Direct velocity                                0.000       0.000       0.000       0.000
    
    It looks like here that the full series regresses performance but as
    ftrace showed no usage of wait_iff_congested() or sync reclaim I am
    assuming it's a disruption due to monitoring.  Other data such as memory
    usage, page IO, swap IO all looked similar.
    
    Running a benchmark with a plain DD showed nothing very interesting.
    The full series stalled in wait_iff_congested() slightly less but stall
    times on vanilla kernels were marginal.
    
    Running a benchmark that hammered on file-backed mappings showed stalls
    due to congestion but not in sync writebacks
    
      MICRO
                                           3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
      User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
      Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
    
      MMTests Statistics: vmstat
      Page Ins                                      108712      120708       97224      110344
      Page Outs                                  155514576   156017404   155813676   156193256
      Swap Ins                                           0           0           0           0
      Swap Outs                                          0           0           0           0
      Direct pages scanned                         2599253     1550480     2512822     2414760
      Kswapd pages scanned                        69742364    71150694    68839041    69692533
      Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
      Direct pages reclaimed                         53693       94750       61792       75205
      Kswapd efficiency                                49%         48%         50%         49%
      Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
      Direct efficiency                                 2%          6%          2%          3%
      Direct velocity                             1432.174     845.464    1379.807    1317.446
      Percentage direct scans                           3%          2%          3%          3%
      Page writes by reclaim                             0           0           0           0
      Page writes file                                   0           0           0           0
      Page writes anon                                   0           0           0           0
      Page reclaim immediate                             0           0           0        1218
      Page rescued immediate                             0           0           0           0
      Slabs scanned                                  15360       16384       13312       16384
      Direct inode steals                                0           0           0           0
      Kswapd inode steals                             4340        4327        1630        4323
    
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited                 0          0          0          0
      Direct time   congest     waited               0ms        0ms        0ms        0ms
      Direct full   congest     waited                 0          0          0          0
      Direct number conditional waited               900        870        754        789
      Direct time   conditional waited               0ms        0ms        0ms       20ms
      Direct full   conditional waited                 0          0          0          0
      KSwapd number congest     waited              2106       2308       2116       1915
      KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
      KSwapd full   congest     waited              1346       1530       1202       1278
      KSwapd number conditional waited             12922      16320      10943      14670
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
    
    Reclaim statistics are not radically changed.  The stall times in kswapd
    are massive but it is clear that it is due to calls to congestion_wait()
    and that is almost certainly the call in balance_pgdat().  Otherwise
    stalls due to dirty pages are non-existant.
    
    I ran a benchmark that stressed high-order allocation.  This is very
    artifical load but was used in the past to evaluate lumpy reclaim and
    compaction.  Generally I look at allocation success rates and latency
    figures.
    
      STRESS-HIGHALLOC
                       3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
      Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
      Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
      while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
    
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
      User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
      Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
    
      MMTests Statistics: vmstat
      Page Ins                                     4486020     2807256     2855944     2876244
      Page Outs                                    7261600     7973688     7975320     7986120
      Swap Ins                                       31694           0           0           0
      Swap Outs                                      98179           0           0           0
      Direct pages scanned                           53494       57731       34406      113015
      Kswapd pages scanned                         6271173     1287481     1278174     1219095
      Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
      Direct pages reclaimed                          1468       14564       16649       92456
      Kswapd efficiency                                32%         99%         98%         98%
      Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
      Direct efficiency                                 2%         25%         48%         81%
      Direct velocity                               46.047      50.092      29.672      97.306
      Percentage direct scans                           0%          4%          2%          8%
      Page writes by reclaim                       1616049           0           0           0
      Page writes file                             1517870           0           0           0
      Page writes anon                               98179           0           0           0
      Page reclaim immediate                        103778       27339        9796       17831
      Page rescued immediate                             0           0           0           0
      Slabs scanned                                1096704      986112      980992      998400
      Direct inode steals                              223      215040      216736      247881
      Kswapd inode steals                           175331       61548       68444       63066
      Kswapd skipped wait                            21991           0           1           0
      THP fault alloc                                    1         135         125         134
      THP collapse alloc                               393         311         228         236
      THP splits                                        25          13           7           8
      THP fault fallback                                 0           0           0           0
      THP collapse fail                                  3           5           7           7
      Compaction stalls                                865        1270        1422        1518
      Compaction success                               370         401         353         383
      Compaction failures                              495         869        1069        1135
      Compaction pages moved                        870155     3828868     4036106     4423626
      Compaction move failure                        26429       23865       29742       27514
    
    Success rates are completely hosed for 3.4-rc2 which is almost certainly
    due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
    is enabled").  I expected this would happen for kswapd and impair
    allocation success rates (https://lkml.org/lkml/2012/1/25/166
    
    ) but I did
    not anticipate this much a difference: 80% less scanning, 37% less
    reclaim by kswapd
    
    In comparison, reclaim/compaction is not aggressive and gives up easily
    which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
    be much more aggressive about reclaim/compaction than THP allocations
    are.  The stress test above is allocating like neither THP or hugetlbfs
    but is much closer to THP.
    
    Mainline is now impaired in terms of high order allocation under heavy
    load although I do not know to what degree as I did not test with
    __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
    resizing, THP allocation and high order atomic allocation failures from
    network devices.
    
    In terms of congestion throttling, I see the following for this test
    
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited                 3          0          0          0
      Direct time   congest     waited               0ms        0ms        0ms        0ms
      Direct full   congest     waited                 0          0          0          0
      Direct number conditional waited               957        512       1081       1075
      Direct time   conditional waited               0ms        0ms        0ms        0ms
      Direct full   conditional waited                 0          0          0          0
      KSwapd number congest     waited                36          4          3          5
      KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
      KSwapd full   congest     waited                30          4          3          5
      KSwapd number conditional waited             88514        197        332        542
      KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
      KSwapd full   conditional waited                49          0          0          0
    
    The "conditional waited" times are the most interesting as this is
    directly impacted by the number of dirty pages encountered during scan.
    As lumpy reclaim is no longer scanning contiguous ranges, it is finding
    fewer dirty pages.  This brings wait times from about 5 seconds to 0.
    kswapd itself is still calling congestion_wait() so it'll still stall but
    it's a lot less.
    
    In terms of the type of IO we were doing, I see this
    
      FTrace Reclaim Statistics: mm_vmscan_writepage
      Direct writes anon  sync                         0          0          0          0
      Direct writes anon  async                        0          0          0          0
      Direct writes file  sync                         0          0          0          0
      Direct writes file  async                        0          0          0          0
      Direct writes mixed sync                         0          0          0          0
      Direct writes mixed async                        0          0          0          0
      KSwapd writes anon  sync                         0          0          0          0
      KSwapd writes anon  async                    91682          0          0          0
      KSwapd writes file  sync                         0          0          0          0
      KSwapd writes file  async                   822629          0          0          0
      KSwapd writes mixed sync                         0          0          0          0
      KSwapd writes mixed async                        0          0          0          0
    
    In 3.2, kswapd was doing a bunch of async writes of pages but
    reclaim/compaction was never reaching a point where it was doing sync
    IO.  This does not guarantee that reclaim/compaction was not calling
    wait_on_page_writeback() but I would consider it unlikely.  It indicates
    that merging patches 2 and 3 to stop reclaim/compaction calling
    wait_on_page_writeback() should be safe.
    
    This patch:
    
    Lumpy reclaim had a purpose but in the mind of some, it was to kick the
    system so hard it trashed.  For others the purpose was to complicate
    vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
    memory compaction needs to step up and replace it so this patch sends
    lumpy reclaim to the farm.
    
    The tracepoint format changes for isolating LRU pages with this patch
    applied.  Furthermore reclaim/compaction can no longer queue dirty pages
    in pageout() if the underlying BDI is congested.  Lumpy reclaim used
    this logic and reclaim/compaction was using it in error.
    
    Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarRik van Riel <riel@redhat.com>
    Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ying Han <yinghan@google.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c53919ad