Skip to content
  • Glauber Costa's avatar
    cfq: fix starvation of asynchronous writes · 9362516c
    Glauber Costa authored
    
    
    commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.
    
    While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.
    
    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):
    
        ffffffffb90c57b1 finish_task_switch
        ffffffffb97dffb5 schedule
        ffffffffb97e310c schedule_timeout
        ffffffffb97e1f12 __down
        ffffffffb90ea821 down
        ffffffffc046a9dc xfs_buf_lock
        ffffffffc046abfb _xfs_buf_find
        ffffffffc046ae4a xfs_buf_get_map
        ffffffffc046babd xfs_buf_read_map
        ffffffffc0499931 xfs_trans_read_buf_map
        ffffffffc044a561 xfs_da_read_buf
        ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
        ffffffffc0452b90 xfs_dir2_leaf_lookup_int
        ffffffffc0452e0f xfs_dir2_leaf_lookup
        ffffffffc044d9d3 xfs_dir_lookup
        ffffffffc047d1d9 xfs_lookup
        ffffffffc0479e53 xfs_vn_lookup
        ffffffffb925347a path_openat
        ffffffffb9254a71 do_filp_open
        ffffffffb9242a94 do_sys_open
        ffffffffb9242b9e sys_open
        ffffffffb97e42b2 entry_SYSCALL_64_fastpath
        00007fb0698162ed [unknown]
    
    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.
    
    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:
    
      8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
      8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
      8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
      8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
      <==== 'I' means Inserted (into the IO scheduler) ===================================>
      8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
      <==== Only 15s later the CFQ scheduler dispatches the request ======================>
    
    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):
    
      8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
      8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
      8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
      8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
      8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
      8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
      8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
    
    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.
    
    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.
    
    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.
    
    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.
    
    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).
    
    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)
    
    Care is taken during preempt to still allow rt requests to preempt us
    regardless.
    
    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:
    
    Latency histogram of xfs_buf_lock acquisition (microseconds):
     value |-------------------------------------------------- count
         0 |                                                     11
         1 |@@@@                                                161
         2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
         4 |@                                                    54
         8 |                                                     36
        16 |                                                      7
        32 |                                                      0
        64 |                                                      0
           ~
      1024 |                                                      0
      2048 |                                                      0
      4096 |                                                      1
      8192 |                                                      1
     16384 |                                                      2
     32768 |                                                      0
     65536 |                                                      0
    131072 |                                                      1
    262144 |                                                      0
    524288 |                                                      0
    
    Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
    CC: Jens Axboe <axboe@kernel.dk>
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    9362516c