1. 27 Apr, 2015 1 commit
    • Mike Christie's avatar
      SCSI: add 1024 max sectors black list flag · 35e9a9f9
      Mike Christie authored
      This works around a issue with qnap iscsi targets not handling large IOs
      very well.
      
      The target returns:
      
      VPD INQUIRY: Block limits page (SBC)
        Maximum compare and write length: 1 blocks
        Optimal transfer length granularity: 1 blocks
        Maximum transfer length: 4294967295 blocks
        Optimal transfer length: 4294967295 blocks
        Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
        Maximum unmap LBA count: 8388607
        Maximum unmap block descriptor count: 1
        Optimal unmap granularity: 16383
        Unmap granularity alignment valid: 0
        Unmap granularity alignment: 0
        Maximum write same length: 0xffffffff blocks
        Maximum atomic transfer length: 0
        Atomic alignment: 0
        Atomic transfer length granularity: 0
      
      and it is *sometimes* able to handle at least one IO of size up to 8 MB. We
      have seen in traces where it will sometimes work, but other times it
      looks like it fails and it looks like it returns failures if we send
      multiple large IOs sometimes. Also it looks like it can return 2 different
      errors. It will sometimes send iscsi reject errors indicating out of
      resources or it will send invalid cdb illegal requests check conditions.
      And then when it sends iscsi rejects it does not seem to handle retries
      when there are command sequence holes, so I could not just add code to
      try and gracefully handle that error code.
      
      The problem is that we do not have a good contact for the company,
      so we are not able to determine under what conditions it returns
      which error and why it sometimes works.
      
      So, this patch just adds a new black list flag to set targets like this to
      the old max safe sectors of 1024. The max_hw_sectors changes added in 3.19
      caused this regression, so I also ccing stable.
      Reported-by: default avatarChristian Hesse <list@eworm.de>
      Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Odin.com>
      35e9a9f9
  2. 26 Apr, 2015 1 commit
  3. 24 Apr, 2015 3 commits
  4. 23 Apr, 2015 13 commits
  5. 22 Apr, 2015 11 commits
  6. 21 Apr, 2015 11 commits
    • Chris Bainbridge's avatar
      ACPI / EC: fix NULL pointer dereference in acpi_ec_remove_query_handler() · 6b5eab54
      Chris Bainbridge authored
      Use list_for_each_entry_safe for iterating because handler may be freed
      in the loop.
      
      BUG: unable to handle kernel NULL pointer dereference at 000000000000002c
      IP: [<ffffffff814d69c8>] acpi_ec_put_query_handler+0x7/0x1a
      Call Trace:
       acpi_ec_remove_query_handler+0x87/0x97
       acpi_smbus_hc_remove+0x2a/0x44 [sbshc]
       acpi_device_remove+0x7b/0x9a
       __device_release_driver+0x7e/0x110
       driver_detach+0xb0/0xc0
       bus_remove_driver+0x54/0xe0
       driver_unregister+0x2b/0x60
       acpi_bus_unregister_driver+0x10/0x12
       acpi_smb_hc_driver_exit+0x10/0x12 [sbshc]
       SyS_delete_module+0x1b8/0x210
       system_call_fastpath+0x12/0x6a
      Signed-off-by: default avatarChris Bainbridge <chris.bainbridge@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6b5eab54
    • Eric Mei's avatar
      md/raid5: don't do chunk aligned read on degraded array. · 9ffc8f7c
      Eric Mei authored
      When array is degraded, read data landed on failed drives will result in
      reading rest of data in a stripe. So a single sequential read would
      result in same data being read twice.
      
      This patch is to avoid chunk aligned read for degraded array. The
      downside is to involve stripe cache which means associated CPU overhead
      and extra memory copy.
      
      Test Results:
      Following test are done on a enterprise storage node with Seagate 6T SAS
      drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
      chunk size 128 KiB.
      
      I use FIO, using direct-io with various bs size, enough queue depth,
      tested sequential and 100% random read against 3 array config:
       1) optimal, as baseline;
       2) degraded;
       3) degraded with this patch.
      Kernel version is 4.0-rc3.
      
      Each individual test I only did once so there might be some variations,
      but we just focus on big trend.
      
      Sequential Read:
        bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
         1024       1608            656              995
          512       1624            710              956
          256       1635            728              980
          128       1636            771              983
           64       1612           1119             1000
           32       1580           1420             1004
           16       1368            688              986
            8        768            647              953
            4        411            413              850
      
      Random Read:
        bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
         1024        163            160              156
          512        274            273              272
          256        426            428              424
          128        576            592              591
           64        726            724              726
           32        849            848              837
           16        900            970              971
            8        927            940              929
            4        948            940              955
      
      Some notes:
        * In sequential + optimal, as bs size getting smaller, the FIO thread
      become CPU bound.
        * In sequential + degraded, there's big increase when bs is 64K and
      32K, I don't have explanation.
        * In sequential + degraded-with-patch, the MD thread mostly become CPU
      bound.
      
      If you want to we can discuss specific data point in those data. But in
      general it seems with this patch, we have more predictable and in most
      cases significant better sequential read performance when array is
      degraded, and almost no noticeable impact on random read.
      
      Performance is a complicated thing, the patch works well for this
      particular configuration, but may not be universal. For example I
      imagine testing on all SSD array may have very different result. But I
      personally think in most cases IO bandwidth is more scarce resource than
      CPU.
      Signed-off-by: default avatarEric Mei <eric.mei@seagate.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9ffc8f7c
    • NeilBrown's avatar
      md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab
      NeilBrown authored
      The default setting of 256 stripe_heads is probably
      much too small for many configurations.  So it is best to make it
      auto-configure.
      
      Shrinking the cache under memory pressure is easy.  The only
      interesting part here is that we put a fairly high cost
      ('seeks') on shrinking the cache as the cost is greater than
      just having to read more data, it reduces parallelism.
      
      Growing the cache on demand needs to be done carefully.  If we allow
      fast growth, that can upset memory balance as lots of dirty memory can
      quickly turn into lots of memory queued in the stripe_cache.
      It is important for the raid5 block device to appear congested to
      allow write-throttling to work.
      
      So we only add stripes slowly. We set a flag when an allocation
      fails because all stripes are in use, allocate at a convenient
      time when that flag is set, and don't allow it to be set again
      until at least one stripe_head has been released for re-use.
      
      This means that a spurt of requests will only cause one stripe_head
      to be allocated, but a steady stream of requests will slowly
      increase the cache size - until memory pressure puts it back again.
      
      It could take hours to reach a steady state.
      
      The value written to, and displayed in, stripe_cache_size is
      used as a minimum.  The cache can grow above this and shrink back
      down to it.  The actual size is not directly visible, though it can
      be deduced to some extent by watching stripe_cache_active.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      edbe83ab
    • NeilBrown's avatar
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown authored
      This allows us to easily add more (atomic) flags.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5423399a
    • NeilBrown's avatar
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown authored
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      486f0644
    • NeilBrown's avatar
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown authored
      This is needed for future improvement to stripe cache management.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a9683a79
    • Markus Stockhausen's avatar
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen authored
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d06f191f
    • Markus Stockhausen's avatar
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen authored
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      584acdd4
    • shli@kernel.org's avatar
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org authored
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dabc4ec6
    • shli@kernel.org's avatar
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org authored
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      72ac7330
    • shli@kernel.org's avatar
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org authored
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      59fc630b