1. 31 Jan, 2013 1 commit
    • Mike Snitzer's avatar
      dm thin: fix queue limits stacking · 0f640dca
      Mike Snitzer authored
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      which can lead to incorrect limits being set.  The fix here simply
      deletes the thin_io_hints() hook which leaves the existing stacking
      infrastructure to set the limits correctly.
      
      When a thin-pool uses an MD device for the data device a thin device
      from the thin-pool must respect MD's constraints about disallowing a bio
      from spanning multiple chunks.  Otherwise we can see problems.  If the raid0
      chunksize is 1152K and thin-pool chunksize is 256K I see the following
      md/raid0 error (with extra debug tracing added to thin_endio) when
      mkfs.xfs is executed against the thin device:
      
      md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
      device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0
      
      This extra DM debugging shows that the failing bio is spanning across
      the first and second logical 1152K chunk (sector 2080 + 255 takes the
      bio beyond the first chunk's boundary of sector 2304).  So the bio
      splitting that DM is doing clearly isn't respecting the MD limits.
      
      max_hw_sectors_kb is 127 for both the thin-pool and thin device
      (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
      precision).  So this explains why bi_size is 130560.
      
      But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
      that it doesn't have a .merge function (for bio_add_page to consult
      indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
      device that has a compulsory merge_bvec_fn.  This scenario is exactly
      why DM must resort to sending single PAGE_SIZE bios to the underlying
      layer. Some additional context for this is available in the header for
      commit 8cbeb67a
      
       ("dm: avoid unsupported spanning of md stripe boundaries").
      
      Long story short, the reason a thin device doesn't properly get
      configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      device directly to the thin device's queue limits.
      
      Fix this by eliminating thin_io_hints.  Doing so is safe because the
      block layer's queue limits stacking already enables the upper level thin
      device to inherit the thin-pool device's discard and minimum_io_size and
      optimal_io_size limits that get set in pool_io_hints.  But avoiding the
      queue limits copy allows the thin and thin-pool limits to be different
      where it is important, namely max_hw_sectors_kb.
      Reported-by: default avatarDaniel Browning <db@kavod.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0f640dca
  2. 21 Dec, 2012 10 commits
    • Mikulas Patocka's avatar
      dm: remove map_info · 7de3ee57
      Mikulas Patocka authored
      
      
      This patch removes map_info from bio-based device mapper targets.
      map_info is still used for request-based targets.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7de3ee57
    • Mikulas Patocka's avatar
      dm thin: dont use map_context · 59c3d2c6
      Mikulas Patocka authored
      
      
      This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
      
      This patch removes any use of map_info in preparation for the next patch
      that removes map_info from bio-based device mapper.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      59c3d2c6
    • Mike Snitzer's avatar
      dm kcopyd: add WRITE SAME support to dm_kcopyd_zero · 70d6c400
      Mike Snitzer authored
      
      
      Add WRITE SAME support to dm-io and make it accessible to
      dm_kcopyd_zero().  dm_kcopyd_zero() provides an asynchronous interface
      whereas the blkdev_issue_write_same() interface is synchronous.
      
      WRITE SAME is a SCSI command that can be leveraged for more efficient
      zeroing of a specified logical extent of a device which supports it.
      Only a single zeroed logical block is transfered to the target for each
      WRITE SAME and the target then writes that same block across the
      specified extent.
      
      The dm thin target uses this.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      70d6c400
    • Mike Snitzer's avatar
      dm thin: use DMERR_LIMIT for errors · c397741c
      Mike Snitzer authored
      
      
      Throttle all errors logged from the IO path by dm thin.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c397741c
    • Joe Thornber's avatar
      dm thin: cleanup dead code · 2aab3850
      Joe Thornber authored
      
      
      Remove unused @data_block parameter from cell_defer.
      Change thin_bio_map to use many returns rather than setting a variable.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      2aab3850
    • Joe Thornber's avatar
      dm thin: rename cell_defer_except to cell_defer_no_holder · f286ba0e
      Joe Thornber authored
      
      
      Rename cell_defer_except() to cell_defer_no_holder() which describes
      its function more clearly.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f286ba0e
    • Mike Snitzer's avatar
      dm thin: emit ignore_discard in status when discards disabled · 018debea
      Mike Snitzer authored
      
      
      If "ignore_discard" is specified when creating the thin pool device then
      discard support is disabled for that device.  The pool device's status
      should reflect this fact rather than stating "no_discard_passdown"
      (which implies discards are enabled but passdown is disabled).
      Reported-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      018debea
    • Joe Thornber's avatar
      dm thin: wake worker when discard is prepared · 563af186
      Joe Thornber authored
      
      
      When discards are prepared it is best to directly wake the worker that
      will process them.  The worker will be woken anyway, via periodic
      commit, but there is no reason to not wake_worker here.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      563af186
    • Joe Thornber's avatar
      dm thin: fix race between simultaneous io and discards to same block · e8088073
      Joe Thornber authored
      
      
      There is a race when discard bios and non-discard bios are issued
      simultaneously to the same block.
      
      Discard support is expensive for all thin devices precisely because you
      have to be careful to quiesce the area you're discarding.  DM thin must
      handle this conflicting IO pattern (simultaneous non-discard vs discard)
      even though a sane application shouldn't be issuing such IO.
      
      The race manifests as follows:
      
      1. A non-discard bio is mapped in thin_bio_map.
         This doesn't lock out parallel activity to the same block.
      
      2. A discard bio is issued to the same block as the non-discard bio.
      
      3. The discard bio is locked in a dm_bio_prison_cell in process_discard
         to lock out parallel activity against the same block.
      
      4. The non-discard bio's mapping continues and its all_io_entry is
         incremented so the bio is accounted for in the thin pool's all_io_ds
         which is a dm_deferred_set used to track time locality of non-discard IO.
      
      5. The non-discard bio is finally locked in a dm_bio_prison_cell in
         process_bio.
      
      The race can result in deadlock, leaving the block layer hanging waiting
      for completion of a discard bio that never completes, e.g.:
      
      INFO: task ruby:15354 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
       ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
       ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
       ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
      Call Trace:
       [<ffffffff814e5a19>] schedule+0x29/0x70
       [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
       [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
       [<ffffffff814e589e>] wait_for_common+0x11e/0x190
       [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
       [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
       [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
       [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
       [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
       [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
       [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
       [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
       [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
       [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
      
      The thinp-test-suite's test_discard_random_sectors reliably hits this
      deadlock on fast SSD storage.
      
      The fix for this race is that the all_io_entry for a bio must be
      incremented whilst the dm_bio_prison_cell is held for the bio's
      associated virtual and physical blocks.  That cell locking wasn't
      occurring early enough in thin_bio_map.  This patch fixes this.
      
      Care is taken to always call the new function inc_all_io_entry() with
      the relevant cells locked, but they are generally unlocked before
      calling issue() to try to avoid holding the cells locked across
      generic_submit_request.
      
      Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
      longer the only thread that will do so.  Because of this we must be sure
      to use cell_defer_except() to release all non-holder entries, that
      were added by the other thread, because they must be deferred.
      
      This patch depends on "dm thin: replace dm_cell_release_singleton with
      cell_defer_except".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@vger.kernel.org
      e8088073
    • Joe Thornber's avatar
      dm thin: replace dm_cell_release_singleton with cell_defer_except · b7ca9c92
      Joe Thornber authored
      
      
      Change existing users of the function dm_cell_release_singleton to share
      cell_defer_except instead, and then remove the now-unused function.
      
      Everywhere that calls dm_cell_release_singleton, the bio in question
      is the holder of the cell.
      
      If there are no non-holder entries in the cell then cell_defer_except
      behaves exactly like dm_cell_release_singleton.  Conversely, if there
      *are* non-holder entries then dm_cell_release_singleton must not be used
      because those entries would need to be deferred.
      
      Consequently, it is safe to replace use of dm_cell_release_singleton
      with cell_defer_except.
      
      This patch is a pre-requisite for "dm thin: fix race between
      simultaneous io and discards to same block".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b7ca9c92
  3. 12 Oct, 2012 3 commits
  4. 26 Sep, 2012 3 commits
    • Mike Snitzer's avatar
      dm thin: fix discard support for data devices · 0424caa1
      Mike Snitzer authored
      
      
      The discard limits that get established for a thin-pool or thin device
      may be incompatible with the pool's data device.  Avoid this by checking
      the discard limits of the pool's data device.  If an incompatibility is
      found then the pool's 'discard passdown' feature is disabled.
      
      Change thin_io_hints to ensure that a thin device always uses the same
      queue limits as its pool device.
      
      Introduce requested_pf to track whether or not the table line originally
      contained the no_discard_passdown flag and use this directly for table
      output.  We prepare the correct setting for discard_passdown directly in
      bind_control_target (called from pool_io_hints) and store it in
      adjusted_pf rather than waiting until we have access to pool->pf in
      pool_preresume.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0424caa1
    • Mike Snitzer's avatar
      dm thin: tidy discard support · 9bc142dd
      Mike Snitzer authored
      
      
      A little thin discard code refactoring to make the next patch (dm thin:
      fix discard support for data devices) more readable.
      Pull out a couple of functions (and uses bools instead of unsigned for
      features).
      
      No functional changes.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9bc142dd
    • Mike Snitzer's avatar
      dm thin: do not set discard_zeroes_data · 307615a2
      Mike Snitzer authored
      
      
      The dm thin pool target claims to support the zeroing of discarded
      data areas.  This turns out to be incorrect when processing discards
      that do not exactly cover a complete number of blocks, so the target
      must always set discard_zeroes_data_unsupported.
      
      The thin pool target will zero blocks when they are allocated if the
      skip_block_zeroing feature is not specified.  The block layer
      may send a discard that only partly covers a block.  If a thin pool
      block is partially discarded then there is no guarantee that the
      discarded data will get zeroed before it is accessed again.
      Due to this, thin devices cannot claim discards will always zero data.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      307615a2
  5. 27 Jul, 2012 15 commits
  6. 20 Jul, 2012 1 commit
    • Mikulas Patocka's avatar
      dm thin: do not send discards to shared blocks · 650d2a06
      Mikulas Patocka authored
      When process_discard receives a partial discard that doesn't cover a
      full block, it sends this discard down to that block. Unfortunately, the
      block can be shared and the discard would corrupt the other snapshots
      sharing this block.
      
      This patch detects block sharing and ends the discard with success when
      sending it to the shared block.
      
      The above change means that if the device supports discard it can't be
      guaranteed that a discard request zeroes data. Therefore, we set
      ti->discard_zeroes_data_unsupported.
      
      Thin target discard support with this bug arrived in commit
      104655fd
      
       (dm thin: support discards).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      650d2a06
  7. 03 Jul, 2012 1 commit
    • Joe Thornber's avatar
      dm thin: commit metadata before creating metadata snapshot · 0d200aef
      Joe Thornber authored
      Userland sometimes sees a corrupt metadata block if metadata is changing
      rapidly when a metadata snapshot is reserved for userland,  To make the
      problem go away, commit before we take the metadata snapshot (which is a
      sensible thing to do anyway).
      
      The checksums mean userland spots this corruption immediately so there's
      no risk of acting on incorrect data.  No corruption exists from the
      kernel's point of view, and thin_check passes after pool shutdown.
      
      I believe this is to do with shared blocks at the first level of the
      {device, mapping} btree.  Prior to the metadata-snap support no sharing
      at this level was possible, so this patch is only required after commit
      cc8394d8
      
       ("dm thin: provide userspace
      access to pool metadata").
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0d200aef
  8. 02 Jun, 2012 2 commits
  9. 18 May, 2012 1 commit
    • Mike Snitzer's avatar
      dm thin: fix table output when pool target disables discard passdown internally · f402693d
      Mike Snitzer authored
      
      
      When the thin pool target clears the discard_passdown parameter
      internally, it incorrectly changes the table line reported to userspace.
      This breaks dumb string comparisons on these table lines in generic
      userspace device-mapper library code and leads to tables being reloaded
      repeatedly when nothing is actually meant to be changing.
      
      This patch corrects this by no longer changing the table line when
      discard passdown was disabled.
      
      We can still tell when discard passdown is overridden by looking for the
      message "Discard unsupported by data device (sdX): Disabling discard passdown."
      
      This automatic detection is also moved from the 'load' to the 'resume'
      so that it is re-evaluated should the properties of underlying devices
      change.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f402693d
  10. 11 May, 2012 3 commits
    • Alasdair G Kergon's avatar
      dm thin: correct module description · 7cab8bf1
      Alasdair G Kergon authored
      
      
      Remove duplicate copy of string "device-mapper" (DM_NAME) from
      MODULE_DESCRIPTION.
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7cab8bf1
    • Mike Snitzer's avatar
      dm thin: fix unprotected use of prepared_discards list · c3a0ce2e
      Mike Snitzer authored
      Fix two places in commit 104655fd
      
       ("dm thin: support discards") that
      didn't use pool->lock to protect against concurrent changes to the
      prepared_discards list.
      
      Without this fix, thin_endio() can race with process_discard(), leading
      to concurrent list_add()s that result in the processes locking up with
      an error like the following:
      
      WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0()
      ...
      list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0).
      ...
      Pid: 17205, comm: kworker/u:1 Tainted: G        W  O 3.4.0-rc3.snitm+ #1
      Call Trace:
       [<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0
       [<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50
       [<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool]
       [<ffffffff8124ff3f>] __list_add+0x8f/0xa0
       [<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool]
       [<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool]
       [<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool]
       [<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
       [<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool]
       [<ffffffff81056fa9>] process_one_work+0x129/0x450
       [<ffffffff81059b9c>] worker_thread+0x17c/0x3c0
       [<ffffffff81059a20>] ? manage_workers+0x120/0x120
       [<ffffffff8105eabe>] kthread+0x9e/0xb0
       [<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10
       [<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff814ceda0>] ? gs_change+0x13/0x13
      ---[ end trace 7e0a523bc5e52692 ]---
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c3a0ce2e
    • Mike Snitzer's avatar
      dm thin: reinstate missing mempool_free in cell_release_singleton · 03aaae7c
      Mike Snitzer authored
      Fix a significant memory leak inadvertently introduced during
      simplification of cell_release_singleton() in commit
      6f94a4c4
      
       ("dm thin: fix stacked bi_next
      usage").
      
      A cell's hlist_del() must be accompanied by a mempool_free().
      Use __cell_release() to do this, like before.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      03aaae7c