1. 01 Jul, 2016 1 commit
    • Omar Sandoval's avatar
      block: fix use-after-free in sys_ioprio_get() · 8ba86821
      Omar Sandoval authored
      get_task_ioprio() accesses the task->io_context without holding the task
      lock and thus can race with exit_io_context(), leading to a
      use-after-free. The reproducer below hits this within a few seconds on
      my 4-core QEMU VM:
      
      #define _GNU_SOURCE
      #include <assert.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      #include <sys/wait.h>
      
      int main(int argc, char **argv)
      {
      	pid_t pid, child;
      	long nproc, i;
      
      	/* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
      	syscall(SYS_ioprio_set, 1, 0, 0x6000);
      
      	nproc = sysconf(_SC_NPROCESSORS_ONLN);
      
      	for (i = 0; i < nproc; i++) {
      		pid = fork();
      		assert(pid != -1);
      		if (pid == 0) {
      			for (;;) {
      				pid = fork();
      				assert(pid != -1);
      				if (pid == 0) {
      					_exit(0);
      				} else {
      					child = wait(NULL);
      					assert(child == pid);
      				}
      			}
      		}
      
      		pid = fork();
      		assert(pid != -1);
      		if (pid == 0) {
      			for (;;) {
      				/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
      				syscall(SYS_ioprio_get, 2, 0);
      			}
      		}
      	}
      
      	for (;;) {
      		/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
      		syscall(SYS_ioprio_get, 2, 0);
      	}
      
      	return 0;
      }
      
      This gets us KASAN dumps like this:
      
      [   35.526914] ==================================================================
      [   35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
      [   35.530009] Read of size 2 by task ioprio-gpf/363
      [   35.530009] =============================================================================
      [   35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
      [   35.530009] -----------------------------------------------------------------------------
      
      [   35.530009] Disabling lock debugging due to kernel taint
      [   35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
      [   35.530009] 	___slab_alloc+0x55d/0x5a0
      [   35.530009] 	__slab_alloc.isra.20+0x2b/0x40
      [   35.530009] 	kmem_cache_alloc_node+0x84/0x200
      [   35.530009] 	create_task_io_context+0x2b/0x370
      [   35.530009] 	get_task_io_context+0x92/0xb0
      [   35.530009] 	copy_process.part.8+0x5029/0x5660
      [   35.530009] 	_do_fork+0x155/0x7e0
      [   35.530009] 	SyS_clone+0x19/0x20
      [   35.530009] 	do_syscall_64+0x195/0x3a0
      [   35.530009] 	return_from_SYSCALL_64+0x0/0x6a
      [   35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
      [   35.530009] 	__slab_free+0x27b/0x3d0
      [   35.530009] 	kmem_cache_free+0x1fb/0x220
      [   35.530009] 	put_io_context+0xe7/0x120
      [   35.530009] 	put_io_context_active+0x238/0x380
      [   35.530009] 	exit_io_context+0x66/0x80
      [   35.530009] 	do_exit+0x158e/0x2b90
      [   35.530009] 	do_group_exit+0xe5/0x2b0
      [   35.530009] 	SyS_exit_group+0x1d/0x20
      [   35.530009] 	entry_SYSCALL_64_fastpath+0x1a/0xa4
      [   35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
      [   35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
      [   35.530009] ==================================================================
      
      Fix it by grabbing the task lock while we poke at the io_context.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8ba86821
  2. 07 Jun, 2016 1 commit
  3. 02 Jun, 2016 1 commit
  4. 26 May, 2016 1 commit
  5. 20 May, 2016 2 commits
  6. 17 May, 2016 1 commit
    • Toshi Kani's avatar
      block: Update blkdev_dax_capable() for consistency · a8078b1f
      Toshi Kani authored
      blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
      to remain as a separate interface for checking dax capability of
      a raw block device.
      
      Rename and relocate blkdev_dax_capable() to keep them maintained
      consistently, and call bdev_direct_access() for the dax capability
      check.
      
      There is no change in the behavior.
      
      Link: https://lkml.org/lkml/2016/5/9/950Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      a8078b1f
  7. 16 May, 2016 1 commit
  8. 10 May, 2016 1 commit
  9. 05 May, 2016 2 commits
    • Mike Snitzer's avatar
      block: make bio_inc_remaining() interface accessible again · 0ef5a50c
      Mike Snitzer authored
      Commit 326e1dbb ("block: remove management of bi_remaining when
      restoring original bi_end_io") made bio_inc_remaining() private to bio.c
      because the only use-case that made sense was confined to the
      bio_chain() interface.
      
      Since that time DM thinp went on to use bio_chain() in its relatively
      complex implementation of async discard support.  That implementation,
      even when converted over to use the new async __blkdev_issue_discard()
      interface, depends on deferred completion of the original discard bio --
      which is most appropriately implemented using bio_inc_remaining().
      
      DM thinp foolishly duplicated bio_inc_remaining(), local to dm-thin.c as
      __bio_inc_remaining(), so re-exporting bio_inc_remaining() allows us to
      put an end to that foolishness.
      
      All said, bio_inc_remaining() should really only be used in conjunction
      with bio_chain().  It isn't intended for generic bio reference counting.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0ef5a50c
    • Mike Snitzer's avatar
      block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard · bbd848e0
      Mike Snitzer authored
      Commit 38f25255 ("block: add __blkdev_issue_discard") incorrectly
      disallowed the early return of -EOPNOTSUPP if the device doesn't support
      discard (or secure discard).  This early return of -EOPNOTSUPP has
      always been part of blkdev_issue_discard() interface so there isn't a
      good reason to break that behaviour -- especially when it can be easily
      reinstated.
      
      The nuance of allowing early return of -EOPNOTSUPP vs disallowing late
      return of -EOPNOTSUPP is: if the overall device never advertised support
      for discards and one is issued to the device it is beneficial to inform
      the caller that discards are not supported via -EOPNOTSUPP.  But if a
      device advertises discard support it means that at least a subset of the
      device does have discard support -- but it could be that discards issued
      to some regions of a stacked device will not be supported.  In that case
      the late return of -EOPNOTSUPP must be disallowed.
      
      Fixes: 38f25255 ("block: add __blkdev_issue_discard")
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bbd848e0
  10. 03 May, 2016 1 commit
  11. 02 May, 2016 2 commits
  12. 18 Apr, 2016 1 commit
  13. 13 Apr, 2016 1 commit
  14. 12 Apr, 2016 5 commits
  15. 08 Apr, 2016 1 commit
  16. 04 Apr, 2016 2 commits
    • Kirill A. Shutemov's avatar
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov authored
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  17. 29 Mar, 2016 1 commit
  18. 20 Mar, 2016 1 commit
  19. 15 Mar, 2016 2 commits
  20. 14 Mar, 2016 4 commits
  21. 03 Mar, 2016 3 commits
  22. 27 Feb, 2016 1 commit
  23. 22 Feb, 2016 1 commit
    • Mike Snitzer's avatar
      dm: fix excessive dm-mq context switching · 6acfe68b
      Mike Snitzer authored
      Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
      than if an underlying null_blk device were used directly.  One of the
      reasons for this drop in performance is that blk_insert_clone_request()
      was calling blk_mq_insert_request() with @async=true.  This forced the
      use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
      which ushered in ping-ponging between process context (fio in this case)
      and kblockd's kworker to submit the cloned request.  The ftrace
      function_graph tracer showed:
      
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
      
      Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
      _not_ use kblockd to submit the cloned requests isn't enough to
      eliminate the observed context switches.
      
      In addition to this dm-mq specific blk-core fix, there are 2 DM core
      fixes to dm-mq that (when paired with the blk-core fix) completely
      eliminate the observed context switching:
      
      1)  don't blk_mq_run_hw_queues in blk-mq request completion
      
          Motivated by desire to reduce overhead of dm-mq, punting to kblockd
          just increases context switches.
      
          In my testing against a really fast null_blk device there was no benefit
          to running blk_mq_run_hw_queues() on completion (and no other blk-mq
          driver does this).  So hopefully this change doesn't induce the need for
          yet another revert like commit 621739b0 !
      
      2)  use blk_mq_complete_request() in dm_complete_request()
      
          blk_complete_request() doesn't offer the traditional q->mq_ops vs
          .request_fn branching pattern that other historic block interfaces
          do (e.g. blk_get_request).  Using blk_mq_complete_request() for
          blk-mq requests is important for performance.  It should be noted
          that, like blk_complete_request(), blk_mq_complete_request() doesn't
          natively handle partial completions -- but the request-based
          DM-multipath target does provide the required partial completion
          support by dm.c:end_clone_bio() triggering requeueing of the request
          via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.
      
      dm-mq fix #2 is _much_ more important than #1 for eliminating the
      context switches.
      Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
      After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472
      
      With these changes multithreaded async read IOPs improved from ~950K
      to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
      IOPs of the underlying null_blk device for the same workload is ~1950K.
      
      Fixes: 7fb4898e ("block: add blk-mq support to blk_insert_cloned_request()")
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Cc: stable@vger.kernel.org # 4.1+
      Reported-by: default avatarSagi Grimberg <sagig@dev.mellanox.co.il>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      6acfe68b
  24. 19 Feb, 2016 1 commit
    • Mika Westerberg's avatar
      block: Add blk_set_runtime_active() · d07ab6d1
      Mika Westerberg authored
      If block device is left runtime suspended during system suspend, resume
      hook of the driver typically corrects runtime PM status of the device back
      to "active" after it is resumed. However, this is not enough as queue's
      runtime PM status is still "suspended". As long as it is in this state
      blk_pm_peek_request() returns NULL and thus prevents new requests to be
      processed.
      
      Add new function blk_set_runtime_active() that can be used to force the
      queue status back to "active" as needed.
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d07ab6d1
  25. 17 Feb, 2016 1 commit
  26. 14 Feb, 2016 1 commit
    • Ming Lei's avatar
      blk-mq: mark request queue as mq asap · 66841672
      Ming Lei authored
      Currently q->mq_ops is used widely to decide if the queue
      is mq or not, so we should set the 'flag' asap so that both
      block core and drivers can get the correct mq info.
      
      For example, commit 868f2f0b(blk-mq: dynamic h/w context count)
      moves the hctx's initialization before setting q->mq_ops in
      blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
      to think the queue is non-mq and don't allocate command size
      for the per-hctx flush rq.
      
      This patches should fix the problem reported by Sasha.
      
      Cc: Keith Busch <keith.busch@intel.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Fixes: 868f2f0b ("blk-mq: dynamic h/w context count")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66841672