1. 04 Apr, 2016 2 commits
    • Kirill A. Shutemov's avatar
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov authored
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  2. 29 Mar, 2016 1 commit
  3. 20 Mar, 2016 1 commit
  4. 15 Mar, 2016 2 commits
  5. 14 Mar, 2016 4 commits
  6. 03 Mar, 2016 3 commits
  7. 27 Feb, 2016 1 commit
  8. 22 Feb, 2016 1 commit
    • Mike Snitzer's avatar
      dm: fix excessive dm-mq context switching · 6acfe68b
      Mike Snitzer authored
      Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
      than if an underlying null_blk device were used directly.  One of the
      reasons for this drop in performance is that blk_insert_clone_request()
      was calling blk_mq_insert_request() with @async=true.  This forced the
      use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
      which ushered in ping-ponging between process context (fio in this case)
      and kblockd's kworker to submit the cloned request.  The ftrace
      function_graph tracer showed:
      
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
      
      Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
      _not_ use kblockd to submit the cloned requests isn't enough to
      eliminate the observed context switches.
      
      In addition to this dm-mq specific blk-core fix, there are 2 DM core
      fixes to dm-mq that (when paired with the blk-core fix) completely
      eliminate the observed context switching:
      
      1)  don't blk_mq_run_hw_queues in blk-mq request completion
      
          Motivated by desire to reduce overhead of dm-mq, punting to kblockd
          just increases context switches.
      
          In my testing against a really fast null_blk device there was no benefit
          to running blk_mq_run_hw_queues() on completion (and no other blk-mq
          driver does this).  So hopefully this change doesn't induce the need for
          yet another revert like commit 621739b0 !
      
      2)  use blk_mq_complete_request() in dm_complete_request()
      
          blk_complete_request() doesn't offer the traditional q->mq_ops vs
          .request_fn branching pattern that other historic block interfaces
          do (e.g. blk_get_request).  Using blk_mq_complete_request() for
          blk-mq requests is important for performance.  It should be noted
          that, like blk_complete_request(), blk_mq_complete_request() doesn't
          natively handle partial completions -- but the request-based
          DM-multipath target does provide the required partial completion
          support by dm.c:end_clone_bio() triggering requeueing of the request
          via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.
      
      dm-mq fix #2 is _much_ more important than #1 for eliminating the
      context switches.
      Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
      After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472
      
      With these changes multithreaded async read IOPs improved from ~950K
      to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
      IOPs of the underlying null_blk device for the same workload is ~1950K.
      
      Fixes: 7fb4898e ("block: add blk-mq support to blk_insert_cloned_request()")
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Cc: stable@vger.kernel.org # 4.1+
      Reported-by: default avatarSagi Grimberg <sagig@dev.mellanox.co.il>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      6acfe68b
  9. 19 Feb, 2016 1 commit
    • Mika Westerberg's avatar
      block: Add blk_set_runtime_active() · d07ab6d1
      Mika Westerberg authored
      If block device is left runtime suspended during system suspend, resume
      hook of the driver typically corrects runtime PM status of the device back
      to "active" after it is resumed. However, this is not enough as queue's
      runtime PM status is still "suspended". As long as it is in this state
      blk_pm_peek_request() returns NULL and thus prevents new requests to be
      processed.
      
      Add new function blk_set_runtime_active() that can be used to force the
      queue status back to "active" as needed.
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d07ab6d1
  10. 17 Feb, 2016 1 commit
  11. 14 Feb, 2016 1 commit
    • Ming Lei's avatar
      blk-mq: mark request queue as mq asap · 66841672
      Ming Lei authored
      Currently q->mq_ops is used widely to decide if the queue
      is mq or not, so we should set the 'flag' asap so that both
      block core and drivers can get the correct mq info.
      
      For example, commit 868f2f0b(blk-mq: dynamic h/w context count)
      moves the hctx's initialization before setting q->mq_ops in
      blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
      to think the queue is non-mq and don't allocate command size
      for the per-hctx flush rq.
      
      This patches should fix the problem reported by Sasha.
      
      Cc: Keith Busch <keith.busch@intel.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Fixes: 868f2f0b ("blk-mq: dynamic h/w context count")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66841672
  12. 12 Feb, 2016 1 commit
    • Hannes Reinecke's avatar
      bio: return EINTR if copying to user space got interrupted · 2d99b55d
      Hannes Reinecke authored
      Commit 35dc2483 introduced a check for
      current->mm to see if we have a user space context and only copies data
      if we do. Now if an IO gets interrupted by a signal data isn't copied
      into user space any more (as we don't have a user space context) but
      user space isn't notified about it.
      
      This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
      to notify userland that a signal has interrupted the syscall, otherwise
      it could lead to a situation where the caller may get a buffer with
      no data returned.
      
      This can be reproduced by issuing SG_IO ioctl()s in one thread while
      constantly sending signals to it.
      
      Fixes: 35dc2483 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: stable@vger.kernel.org # v.3.11+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2d99b55d
  13. 11 Feb, 2016 3 commits
  14. 09 Feb, 2016 3 commits
    • Keith Busch's avatar
      blk-mq: dynamic h/w context count · 868f2f0b
      Keith Busch authored
      The hardware's provided queue count may change at runtime with resource
      provisioning. This patch allows a block driver to alter the number of
      h/w queues available when its resource count changes.
      
      The main part is a new blk-mq API to request a new number of h/w queues
      for a given live tag set. The new API freezes all queues using that set,
      then adjusts the allocated count prior to remapping these to CPUs.
      
      The bulk of the rest just shifts where h/w contexts and all their
      artifacts are allocated and freed.
      
      The number of max h/w contexts is capped to the number of possible cpus
      since there is no use for more than that. As such, all pre-allocated
      memory for pointers need to account for the max possible rather than
      the initial number of queues.
      
      A side effect of this is that the blk-mq will proceed successfully as
      long as it can allocate at least one h/w context. Previously it would
      fail request queue initialization if less than the requested number
      was allocated.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      868f2f0b
    • Roman Pen's avatar
      block: fix module reference leak on put_disk() call for cgroups throttle · 39a169b6
      Roman Pen authored
      get_disk(),get_gendisk() calls have non explicit side effect: they
      increase the reference on the disk owner module.
      
      The following is the correct sequence how to get a disk reference and
      to put it:
      
          disk = get_gendisk(...);
      
          /* use disk */
      
          owner = disk->fops->owner;
          put_disk(disk);
          module_put(owner);
      
      fs/block_dev.c is aware of this required module_put() call, but f.e.
      blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
      a module reference.  To see a leakage in action cgroups throttle config
      can be used.  In the following script I'm removing throttle for /dev/ram0
      (actually this is NOP, because throttle was never set for this device):
      
          # lsmod | grep brd
          brd                     5175  0
          # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
              /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
          done
          # lsmod | grep brd
          brd                     5175  100
      
      Now brd module has 100 references.
      
      The issue is fixed by calling module_put() just right away put_disk().
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      39a169b6
    • Stephane Gasparini's avatar
      kernel/fs: fix I/O wait not accounted for RW O_DSYNC · d57d6115
      Stephane Gasparini authored
       When a process is doing Random Write with O_DSYNC flag
       the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us).
       This is preventing the governor or the cpufreq driver to account for
       I/O wait and thus use the right pstate
      Signed-off-by: default avatarStephane Gasparini <stephane.gasparini@linux.intel.com>
      Signed-off-by: default avatarPhilippe Longepe <philippe.longepe@linux.intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d57d6115
  15. 04 Feb, 2016 5 commits
    • Martin K. Petersen's avatar
      block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled · 0fb5b1fb
      Martin K. Petersen authored
      When a storage device rejects a WRITE SAME command we will disable write
      same functionality for the device and return -EREMOTEIO to the block
      layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
      failing the path.
      
      Yiwen Jiang discovered a small race where WRITE SAME requests issued
      simultaneously would cause -EIO to be returned. This happened because
      any requests being prepared after WRITE SAME had been disabled for the
      device caused us to return BLKPREP_KILL. The latter caused the block
      layer to return -EIO upon completion.
      
      To overcome this we introduce BLKPREP_INVALID which indicates that this
      is an invalid request for the device. blk_peek_request() is modified to
      return -EREMOTEIO in that case.
      Reported-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Suggested-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarHannes Reinicke <hare@suse.de>
      Reviewed-by: default avatarEwan Milne <emilne@redhat.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      0fb5b1fb
    • Jan Kara's avatar
      cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55
      Jan Kara authored
      Currently we don't allow sync workload of one cgroup to preempt sync
      workload of any other cgroup. This is because we want to achieve service
      separation between cgroups. However in cases where cgroup preempting is
      ancestor of the current cgroup, there is no need of separation and
      idling introduces unnecessary overhead. This hurts for example the case
      when workload is isolated within a cgroup but journalling threads are in
      root cgroup. Simple way to demostrate the issue is using:
      
      dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
      
      on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
      difference more visible). When all processes are in the root cgroup,
      reported throughput is 153.132 MB/sec. When dbench process gets its own
      blkio cgroup, reported throughput drops to 26.1006 MB/sec.
      
      Fix the problem by making check in cfq_should_preempt() more benevolent
      and allow preemption by ancestor cgroup. This improves the throughput
      reported by dbench4 to 48.9106 MB/sec.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3984aa55
    • Jan Kara's avatar
      cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e
      Jan Kara authored
      The original idea with preemption of sync noidle queues (introduced in
      commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
      that we service all sync noidle queues together, we don't idle on any of
      the queues individually and we idle only if there is no sync noidle
      queue to be served. This intention also matches the original test:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
      	   && new_cfqq->service_tree == cfqq->service_tree)
      		return true;
      
      However since at that time cfqq->service_tree was not set for idling
      queues, this test was unreliable and was replaced in commit e4a22919
      "cfq-iosched: fix no-idle preemption logic" by:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
      	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
      	    new_cfqq->service_tree->count == 1)
      		return true;
      
      That was a reliable test but was actually doing something different -
      now we preempt sync noidle queue only if the new queue is the only one
      busy in the service tree.
      
      These days cfq queue is kept in service tree even if it is idling and
      thus the original check would be safe again. But since we actually check
      that cfq queues are in the same cgroup, of the same priority class and
      workload type (sync noidle), we know that new_cfqq is fine to preempt
      cfqq. So just remove the service tree check.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a257ae3e
    • Jan Kara's avatar
      cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c
      Jan Kara authored
      Move check for preemption by rt class up. There is no functional change
      but it makes arguing about conditions simpler since we can be sure both
      cfq queues are from the same ioprio class.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6c80731c
    • Jan Kara's avatar
      cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e
      Jan Kara authored
      There is no point in idling on a cfq group if the only cfq queue that is
      there has too big thinktime.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e795421e
  16. 01 Feb, 2016 1 commit
  17. 30 Jan, 2016 2 commits
  18. 22 Jan, 2016 2 commits
  19. 12 Jan, 2016 1 commit
    • Keith Busch's avatar
      block: split bios to max possible length · e36f6204
      Keith Busch authored
      This splits bio in the middle of a vector to form the largest possible
      bio at the h/w's desired alignment, and guarantees the bio being split
      will have some data.
      
      The criteria for splitting is changed from the max sectors to the h/w's
      optimal sector alignment if it is provided. For h/w that advertise their
      block storage's underlying chunk size, it's a big performance win to not
      submit commands that cross them. If sector alignment is not provided,
      this patch uses the max sectors as before.
      
      This addresses the performance issue commit d3805611 attempted to
      fix, but was reverted due to splitting logic error.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.4.x-
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e36f6204
  20. 09 Jan, 2016 4 commits