Skip to content
  • Mel Gorman's avatar
    tracing, vmscan: add trace events for LRU list shrinking · e11da5b4
    Mel Gorman authored
    There have been numerous reports of stalls that pointed at the problem
    being somewhere in the VM.  There are multiple roots to the problems which
    means dealing with any of the root problems in isolation is tricky to
    justify on their own and they would still need integration testing.  This
    patch series puts together two different patch sets which in combination
    should tackle some of the root causes of latency problems being reported.
    
    Patch 1 adds a tracepoint for shrink_inactive_list.  For this series, the
    most important results is being able to calculate the scanning/reclaim
    ratio as a measure of the amount of work being done by page reclaim.
    
    Patch 2 accounts for time spent in congestion_wait.
    
    Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
    this series.  It has been noted that lumpy reclaim is far too aggressive
    and trashes the system somewhat.  As SLUB uses high-order allocations, a
    large cost incurred by lumpy reclaim will be noticeable.  It was also
    reported during transparent hugepage support testing that lumpy reclaim
    was trashing the system and these patches should mitigate that problem
    without disabling lumpy reclaim.
    
    Patch 7 adds wait_iff_congested() and replaces some callers of
    congestion_wait().  wait_iff_congested() only sleeps if there is a BDI
    that is currently congested.  Patch 8 notes that any BDI being congested
    is not necessarily a problem because there could be multiple BDIs of
    varying speeds and numberous zones.  It attempts to track when a zone
    being reclaimed contains many pages backed by a congested BDI and if so,
    reclaimers wait on the congestion queue.
    
    I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
    machine had 3G of RAM and the CPUs were
    
    X86:    Intel P4 2-core
    X86-64: AMD Phenom 4-core
    PPC64:  PPC970MP
    
    Each used a single disk and the onboard IO controller.  Dirty ratio was
    left at 20.  I'm just going to report for X86-64 and PPC64 in a vague
    attempt to keep this report short.  Four kernels were tested each based on
    v2.6.36-rc4
    
    traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
    lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
    waitcongest-v2r3:   Patches 1-7 to only wait on congestion
    waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
    
    nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
    nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
    
    The tests run were as follows
    
    kernbench
    	compile-based benchmark. Smoke test performance
    
    sysbench
    	OLTP read-only benchmark. Will be re-run in the future as read-write
    
    micro-mapped-file-stream
    	This is a micro-benchmark from Johannes Weiner that accesses a
    	large sparse-file through mmap(). It was configured to run in only
    	single-CPU mode but can be indicative of how well page reclaim
    	identifies suitable pages.
    
    stress-highalloc
    	Tries to allocate huge pages under heavy load.
    
    kernbench, iozone and sysbench did not report any performance regression
    on any machine.  sysbench did pressure the system lightly and there was
    reclaim activity but there were no difference of major interest between
    the kernels.
    
    X86-64 micro-mapped-file-stream
    
                                          traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
    pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
    pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
    pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
    pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
    pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
    pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
    pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
    pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
    pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
    allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
    
    These are based on the raw figures taken from /proc/vmstat.  It's a rough
    measure of reclaim activity.  Note that allocstall counts are higher
    because we are entering direct reclaim more often as a result of not
    sleeping in congestion.  In itself, it's not necessarily a bad thing.
    It's easier to get a view of what happened from the vmscan tracepoint
    report.
    
    FTrace Reclaim Statistics: vmscan
    
                                    traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
    Direct reclaims                                443        273        513       1568
    Direct reclaim pages scanned                305968     280402     600825     957933
    Direct reclaim pages reclaimed               43503      19005      30327     117191
    Direct reclaim write file async I/O              0          0          0          0
    Direct reclaim write anon async I/O              0          3          4         12
    Direct reclaim write file sync I/O               0          0          0          0
    Direct reclaim write anon sync I/O               0          0          0          0
    Wake kswapd requests                        187649     132338     191695     267701
    Kswapd wakeups                                   3          1          4          1
    Kswapd pages scanned                       4599269    4454162    4296815    3891906
    Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
    Kswapd reclaim write file async I/O              1          0          1          1
    Kswapd reclaim write anon async I/O             59        187         41        222
    Kswapd reclaim write file sync I/O               0          0          0          0
    Kswapd reclaim write anon sync I/O               0          0          0          0
    Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
    Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
    
    Total pages scanned                        4905237   4734564   4897640   4849839
    Total pages reclaimed                      2339450   2447439   2430145   2436897
    %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
    %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
    %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
    Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
    Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
    
    What is interesting here for nocongest in particular is that while direct
    reclaim scans more pages, the overall number of pages scanned remains the
    same and the ratio of pages scanned to pages reclaimed is more or less the
    same.  In other words, while we are sleeping less, reclaim is not doing
    more work and as direct reclaim and kswapd is awake for less time, it
    would appear to be doing less work.
    
    FTrace Reclaim Statistics: congestion_wait
    Direct number congest     waited                87        196         64          0
    Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
    Direct full   congest     waited                72        145         53          0
    Direct number conditional waited                 0          0        324       1315
    Direct time   conditional waited               0ms        0ms        0ms        0ms
    Direct full   conditional waited                 0          0          0          0
    KSwapd number congest     waited                20         10         15          7
    KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
    KSwapd full   congest     waited                10          4          6          2
    KSwapd number conditional waited                 0          0          0          0
    KSwapd time   conditional waited               0ms        0ms        0ms        0ms
    KSwapd full   conditional waited                 0          0          0          0
    
    The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
    all asleep with the patches.
    
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
    Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
    
    Overall, the tests completed faster. It is interesting to note that backing off further
    when a zone is congested and not just a BDI was more efficient overall.
    
    PPC64 micro-mapped-file-stream
    pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
    pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
    pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
    pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
    pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
    pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
    allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
    
    Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
    
    FTrace Reclaim Statistics: vmscan
    Direct reclaims                                977       2709       2098       5136
    Direct reclaim pages scanned                629825     963814    1063938    1711935
    Direct reclaim pages reclaimed               75550     242538     150904     387647
    Direct reclaim write file async I/O              0          0          0          2
    Direct reclaim write anon async I/O              0         10          0          4
    Direct reclaim write file sync I/O               0          0          0          0
    Direct reclaim write anon sync I/O               0          0          0          0
    Wake kswapd requests                        392119    1201712     571935     571921
    Kswapd wakeups                                   3          2          3          3
    Kswapd pages scanned                       4601307    4128076    3912317    3377165
    Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
    Kswapd reclaim write file async I/O             20          1          1          1
    Kswapd reclaim write anon async I/O             57        132         11        121
    Kswapd reclaim write file sync I/O               0          0          0          0
    Kswapd reclaim write anon sync I/O               0          0          0          0
    Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
    Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
    
    Total pages scanned                        5231132   5091890   4976255   5089100
    Total pages reclaimed                      2508073   2561335   2463577   2532263
    %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
    %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
    %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
    Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
    Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
    
    Again, a similar trend that the congestion_wait changes mean that direct
    reclaim scans more pages but the overall number of pages scanned while
    slightly reduced, are very similar.  The ratio of scanning/reclaimed
    remains roughly similar.  The downside is that kswapd and direct reclaim
    was awake longer and for a larger percentage of the overall workload.
    It's possible there were big differences in the amount of time spent
    reclaiming slab pages between the different kernels which is plausible
    considering that the micro tests runs after fsmark and sysbench.
    
    Trace Reclaim Statistics: congestion_wait
    Direct number congest     waited               845       1312        104          0
    Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
    Direct full   congest     waited               745       1105         72          0
    Direct number conditional waited                 0          0       1322       2935
    Direct time   conditional waited               0ms        0ms       12ms      312ms
    Direct full   conditional waited                 0          0          0          3
    KSwapd number congest     waited                39        102         75         63
    KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
    KSwapd full   congest     waited                20         48         46         25
    KSwapd number conditional waited                 0          0          0          0
    KSwapd time   conditional waited               0ms        0ms        0ms        0ms
    KSwapd full   conditional waited                 0          0          0          0
    
    The vanilla kernel spent 20 seconds asleep in direct reclaim and only
    312ms asleep with the patches.  The time kswapd spent congest waited was
    also reduced by a large factor.
    
    MMTests Statistics: duration
    ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
    Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
    
    With all patches applies, the completion times are very similar.
    
    X86-64 STRESS-HIGHALLOC
                    traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
    Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
    Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
    At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
    
    Success figures across the board are broadly similar.
    
                    traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
    Direct reclaims                               1045        944        886        887
    Direct reclaim pages scanned                135091     119604     109382     101019
    Direct reclaim pages reclaimed               88599      47535      47863      46671
    Direct reclaim write file async I/O            494        283        465        280
    Direct reclaim write anon async I/O          29357      13710      16656      13462
    Direct reclaim write file sync I/O             154          2          2          3
    Direct reclaim write anon sync I/O           14594        571        509        561
    Wake kswapd requests                          7491        933        872        892
    Kswapd wakeups                                 814        778        731        780
    Kswapd pages scanned                       7290822   15341158   11916436   13703442
    Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
    Kswapd reclaim write file async I/O          91975      32317      28022      29628
    Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
    Kswapd reclaim write file sync I/O               0          0          0          0
    Kswapd reclaim write anon sync I/O               0          0          0          0
    Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
    Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
    
    Total pages scanned                        7425913  15460762  12025818  13804461
    Total pages reclaimed                      3675935   3190031   3142255   3233822
    %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
    %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
    %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
    Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
    Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
    
    Scanned/reclaimed ratios again look good with big improvements in
    efficiency.  The Scanned/written ratios also look much improved.  With a
    better scanned/written ration, there is an expectation that IO would be
    more efficient and indeed, the time spent in direct reclaim is much
    reduced by the full series and kswapd spends a little less time awake.
    
    Overall, indications here are that allocations were happening much faster
    and this can be seen with a graph of the latency figures as the
    allocations were taking place
    http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
    
    FTrace Reclaim Statistics: congestion_wait
    Direct number congest     waited              1333        204        169          4
    Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
    Direct full   congest     waited               756         92         69          2
    Direct number conditional waited                 0          0         26        186
    Direct time   conditional waited               0ms        0ms        0ms     2504ms
    Direct full   conditional waited                 0          0          0         25
    KSwapd number congest     waited                 4        395        227        282
    KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
    KSwapd full   congest     waited                 3        232         98        176
    KSwapd number conditional waited                 0          0          0          0
    KSwapd time   conditional waited               0ms        0ms        0ms        0ms
    KSwapd full   conditional waited                 0          0          0          0
    KSwapd full   conditional waited               318          0        312          9
    
    Overall, the time spent speeping is reduced.  kswapd is still hitting
    congestion_wait() but that is because there are callers remaining where it
    wasn't clear in advance if they should be changed to wait_iff_congested()
    or not.  Overall the sleep imes are reduced though - from 79ish seconds to
    about 19.
    
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
    Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
    
    With the full series, the time to complete the tests are reduced by 30%
    
    PPC64 STRESS-HIGHALLOC
                    traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
    Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
    Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
    At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
    
    Success rates there are *way* up particularly considering that the 16MB
    huge pages on PPC64 mean that it's always much harder to allocate them.
    
    FTrace Reclaim Statistics: vmscan
                  stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                    traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
    Direct reclaims                                499        505        564        509
    Direct reclaim pages scanned                223478      41898      51818      45605
    Direct reclaim pages reclaimed              137730      21148      27161      23455
    Direct reclaim write file async I/O            399        136        162        136
    Direct reclaim write anon async I/O          46977       2865       4686       3998
    Direct reclaim write file sync I/O              29          0          1          3
    Direct reclaim write anon sync I/O           31023        159        237        239
    Wake kswapd requests                           420        351        360        326
    Kswapd wakeups                                 185        294        249        277
    Kswapd pages scanned                      15703488   16392500   17821724   17598737
    Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
    Kswapd reclaim write file async I/O         159938      18400      18717      13473
    Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
    Kswapd reclaim write file sync I/O               0          0          0          0
    Kswapd reclaim write anon sync I/O               0          0          0          0
    Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
    Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
    
    Total pages scanned                       15926966  16434398  17873542  17644342
    Total pages reclaimed                      5946196   2930006   3166547   3168890
    %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
    %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
    %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
    Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
    Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
    
    While the scanning rates are slightly up, the scanned/reclaimed and
    scanned/written figures are much improved.  The time spent in direct
    reclaim and with kswapd are massively reduced, mostly by the lowlumpy
    patches.
    
    FTrace Reclaim Statistics: congestion_wait
    Direct number congest     waited               725        303        126          3
    Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
    Direct full   congest     waited               487        190         52          3
    Direct number conditional waited                 0          0        200        301
    Direct time   conditional waited               0ms        0ms        0ms     1904ms
    Direct full   conditional waited                 0          0          0         19
    KSwapd number congest     waited                 0          2         23          4
    KSwapd time   congest     waited               0ms      200ms      420ms      404ms
    KSwapd full   congest     waited                 0          2          2          4
    KSwapd number conditional waited                 0          0          0          0
    KSwapd time   conditional waited               0ms        0ms        0ms        0ms
    KSwapd full   conditional waited                 0          0          0          0
    
    Not as dramatic a story here but the time spent asleep is reduced and we
    can still see what wait_iff_congested is going to sleep when necessary.
    
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
    Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
    
    The time to complete this test goes way down.  With the full series, we
    are allocating over twice the number of huge pages in 30% of the time and
    there is a corresponding impact on the allocation latency graph available
    at.
    
    http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
    
    
    
    This patch:
    
    Add a trace event for shrink_inactive_list() and updates the sample
    postprocessing script appropriately.  It can be used to determine how many
    pages were reclaimed and for non-lumpy reclaim where exactly the pages
    were reclaimed from.
    
    Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e11da5b4