1. 12 Apr, 2016 2 commits
  2. 20 Mar, 2016 1 commit
  3. 09 Feb, 2016 1 commit
    • Keith Busch's avatar
      blk-mq: dynamic h/w context count · 868f2f0b
      Keith Busch authored
      The hardware's provided queue count may change at runtime with resource
      provisioning. This patch allows a block driver to alter the number of
      h/w queues available when its resource count changes.
      The main part is a new blk-mq API to request a new number of h/w queues
      for a given live tag set. The new API freezes all queues using that set,
      then adjusts the allocated count prior to remapping these to CPUs.
      The bulk of the rest just shifts where h/w contexts and all their
      artifacts are allocated and freed.
      The number of max h/w contexts is capped to the number of possible cpus
      since there is no use for more than that. As such, all pre-allocated
      memory for pointers need to account for the max possible rather than
      the initial number of queues.
      A side effect of this is that the blk-mq will proceed successfully as
      long as it can allocate at least one h/w context. Previously it would
      fail request queue initialization if less than the requested number
      was allocated.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  4. 01 Dec, 2015 1 commit
  5. 07 Nov, 2015 1 commit
    • Jens Axboe's avatar
      block: add block polling support · 05229bee
      Jens Axboe authored
      Add basic support for polling for specific IO to complete. This uses
      the cookie that blk-mq passes back, which enables the block layer
      to pass this cookie to the driver to spin for a specific request.
      This will be combined with request latency tracking, so we can make
      qualified decisions about when to poll and when not to. For now, for
      benchmark purposes, we add a sysfs file that controls whether polling
      is enabled or not.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKeith Busch <keith.busch@intel.com>
  6. 21 Oct, 2015 1 commit
    • Dan Williams's avatar
      block: generic request_queue reference counting · 3ef28e83
      Dan Williams authored
      Allow pmem, and other synchronous/bio-based block drivers, to fallback
      on a per-cpu reference count managed by the core for tracking queue
      live/dead state.
      The existing per-cpu reference count for the blk_mq case is promoted to
      be used in all block i/o scenarios.  This involves initializing it by
      default, waiting for it to drop to zero at exit, and holding a live
      reference over the invocation of q->make_request_fn() in
      generic_make_request().  The blk_mq code continues to take its own
      reference per blk_mq request and retains the ability to freeze the
      queue, but the check that the queue is frozen is moved to
      This fixes crash signatures like the following:
       BUG: unable to handle kernel paging request at ffff880140000000
       Call Trace:
        [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
        [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
        [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
        [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
        [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
        [<ffffffff8141f966>] submit_bio+0x76/0x170
        [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
        [<ffffffff81286e62>] submit_bh+0x12/0x20
        [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
        [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
        [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
        [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
        [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
        [<ffffffff81303494>] ext4_put_super+0x64/0x360
        [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  7. 01 Oct, 2015 2 commits
  8. 29 Sep, 2015 1 commit
    • Akinobu Mita's avatar
      blk-mq: fix sysfs registration/unregistration race · 4593fdbe
      Akinobu Mita authored
      There is a race between cpu hotplug handling and adding/deleting
      gendisk for blk-mq, where both are trying to register and unregister
      the same sysfs entries.
          --> blk_mq_init_queue
              --> blk_mq_init_allocated_queue
                  --> add to 'all_q_list' (*)
          --> add_disk
              --> blk_register_queue
                  --> blk_mq_register_disk (++)
          --> del_gendisk
              --> blk_unregister_queue
                  --> blk_mq_unregister_disk (--)
          --> blk_cleanup_queue
              --> blk_mq_free_queue
                  --> del from 'all_q_list' (*)
          --> blk_mq_sysfs_unregister (-)
          --> blk_mq_sysfs_register (+)
      While the request queue is added to 'all_q_list' (*),
      blk_mq_queue_reinit() can be called for the queue anytime by CPU
      hotplug callback.  But blk_mq_sysfs_unregister (-) and
      blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
      before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
      is finished.  Because '/sys/block/*/mq/' is not exists.
      There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
      be used to track these sysfs stuff, but it is only fixing this issue
      In order to fix it completely, we just need per-queue flag instead of
      per-hctx flag with appropriate locking.  So this introduces
      q->mq_sysfs_init_done which is properly protected with all_q_mutex.
      Also, we need to ensure that blk_mq_map_swqueue() is called with
      all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
      updated in blk_mq_map_swqueue(), so we should avoid
      blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
      in CPU hotplug handling or adding/deleting gendisk .
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: default avatarMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  9. 01 Jun, 2015 1 commit
    • Keith Busch's avatar
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch authored
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  10. 17 Apr, 2015 1 commit
  11. 09 Apr, 2015 1 commit
  12. 13 Mar, 2015 2 commits
    • Mike Snitzer's avatar
      blk-mq: export blk_mq_run_hw_queues · b94ec296
      Mike Snitzer authored
      Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
      and export it.
      DM's suspend support must be able to run the queue without starting
      stopped hw queues.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Mike Snitzer's avatar
      blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk · b62c21b7
      Mike Snitzer authored
      Add a variant of blk_mq_init_queue that allows a previously allocated
      queue to be initialized.  blk_mq_init_allocated_queue models
      blk_init_allocated_queue -- which was also created for DM's use.
      DM's approach to device creation requires a placeholder request_queue be
      allocated for use with alloc_dev() but the decision about what type of
      request_queue will be ultimately created is deferred until all component
      devices referenced in the DM table are processed to determine the table
      type (request-based, blk-mq request-based, or bio-based).
      Also, because of DM's late finalization of the request_queue type
      the call to blk_mq_register_disk() doesn't happen during alloc_dev().
      Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
      once the blk-mq queue is fully allocated.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  13. 10 Feb, 2015 1 commit
  14. 23 Jan, 2015 1 commit
    • Shaohua Li's avatar
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li authored
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  15. 08 Jan, 2015 3 commits
  16. 07 Jan, 2015 1 commit
  17. 02 Jan, 2015 1 commit
  18. 20 Dec, 2014 1 commit
  19. 17 Nov, 2014 1 commit
    • Jens Axboe's avatar
      blk-mq: add blk_mq_free_hctx_request() · 7c7f2f2b
      Jens Axboe authored
      It's silly to use blk_mq_free_request() which in turn maps the
      request to the hardware queue, for places where we already know
      what the hardware queue is. This saves us an extra mapping of a
      hardware queue on request completion, if the caller knows this
      information already.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  20. 12 Nov, 2014 1 commit
  21. 29 Oct, 2014 2 commits
    • Jens Axboe's avatar
      blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag · e167dfb5
      Jens Axboe authored
      Drivers can now tell blk-mq if they take advantage of the deferred
      issue through 'last' or not. If they do, don't do queue-direct
      for sync IO. This is a preparation patch for the nvme conversion.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Jens Axboe's avatar
      blk-mq: add a 'list' parameter to ->queue_rq() · 74c45052
      Jens Axboe authored
      Since we have the notion of a 'last' request in a chain, we can use
      this to have the hardware optimize the issuing of requests. Add
      a list_head parameter to queue_rq that the driver can use to
      temporarily store hw commands for issue when 'last' is true. If we
      are doing a chain of requests, pass in a NULL list for the first
      request to force issue of that immediately, then batch the remainder
      for deferred issue until the last request has been sent.
      Instead of adding yet another argument to the hot ->queue_rq path,
      encapsulate the passed arguments in a blk_mq_queue_data structure.
      This is passed as a constant, and has been tested as faster than
      passing 4 (or even 3) args through ->queue_rq. Update drivers for
      the new ->queue_rq() prototype. There are no functional changes
      in this patch for drivers - if they don't use the passed in list,
      then they will just queue requests individually like before.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  22. 25 Sep, 2014 1 commit
    • Ming Lei's avatar
      blk-mq: support per-distpatch_queue flush machinery · f70ced09
      Ming Lei authored
      This patch supports to run one single flush machinery for
      each blk-mq dispatch queue, so that:
      - current init_request and exit_request callbacks can
      cover flush request too, then the buggy copying way of
      initializing flush request's pdu can be fixed
      - flushing performance gets improved in case of multi hw-queue
      In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
      iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
      increased a lot over my test environment:
      	- throughput: +70% in case of virtio-blk over null_blk
      	- throughput: +30% in case of virtio-blk over SSD image
      The multi virtqueue feature isn't merged to QEMU yet, and patches for
      the feature can be found in below tree:
      	git://kernel.ubuntu.com/ming/qemu.git  	v2.1.0-mq.4
      And simply passing 'num_queues=4 vectors=5' should be enough to
      enable multi queue(quad queue) feature for QEMU virtio-blk.
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  23. 24 Sep, 2014 1 commit
    • Tejun Heo's avatar
      blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode · 17497acb
      Tejun Heo authored
      blk-mq uses percpu_ref for its usage counter which tracks the number
      of in-flight commands and used to synchronously drain the queue on
      freeze.  percpu_ref shutdown takes measureable wallclock time as it
      involves a sched RCU grace period.  This means that draining a blk-mq
      takes measureable wallclock time.  One would think that this shouldn't
      matter as queue shutdown should be a rare event which takes place
      asynchronously w.r.t. userland.
      Unfortunately, SCSI probing involves synchronously setting up and then
      tearing down a lot of request_queues back-to-back for non-existent
      LUNs.  This means that SCSI probing may take above ten seconds when
      scsi-mq is used.
        [    0.949892] scsi host0: Virtio SCSI HBA
        [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
        [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
        [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
        [   16.194099] osd: LOADED open-osd 0.2.1
        [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
        [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
        [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
        [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      This is also the reason why request_queues start in bypass mode which
      is ended on blk_register_queue() as shutting down a fully functional
      queue also involves a RCU grace period and the queues for non-existent
      SCSI devices never reach registration.
      blk-mq basically needs to do the same thing - start the mq in a
      degraded mode which is faster to shut down and then make it fully
      functional only after the queue reaches registration.  percpu_ref
      recently grew facilities to force atomic operation until explicitly
      switched to percpu mode, which can be used for this purpose.  This
      patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
      switch it to percpu mode only once blk_register_queue() is reached.
      Note that this issue was previously worked around by 0a30288d
      ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
      probe") for v3.17.  The temp fix was reverted in preparation of adding
      persistent atomic mode to percpu_ref by 9eca8046 ("Revert "blk-mq,
      percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
      This patch and the prerequisite percpu_ref changes will be merged
      during v3.18 devel cycle.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarChristoph Hellwig <hch@infradead.org>
      Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
      Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count")
      Reviewed-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
  24. 22 Sep, 2014 5 commits
  25. 15 Aug, 2014 1 commit
  26. 17 Jun, 2014 1 commit
  27. 06 Jun, 2014 1 commit
    • Jens Axboe's avatar
      blk-mq: bump max tag depth to 10K tags · a4391c64
      Jens Axboe authored
      For some scsi-mq cases, the tag map can be huge. So increase the
      max number of tags we support.
      Additionally, don't fail with EINVAL if a user requests too many
      tags. Warn that the tag depth has been adjusted down, and store
      the new value inside the tag_set passed in.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  28. 04 Jun, 2014 1 commit
  29. 30 May, 2014 2 commits