1. 19 Jul, 2016 1 commit
    • Tomasz Majchrzak's avatar
      raid10: improve random reads performance · 0e5313e2
      Tomasz Majchrzak authored
      RAID10 random read performance is lower than expected due to excessive spinlock
      utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
      as it's in IO path and encounters a lot of unnecessary congestion.
      As lower_barrier just takes a lock in order to decrement a counter, convert
      counter (nr_pending) into atomic variable and remove the spin lock. There is
      also a congestion for wake_up (it uses lock internally) so call it only when
      it's really needed. As wake_up is not called constantly anymore, ensure process
      waiting to raise a barrier is notified when there are no more waiting IOs.
      Signed-off-by: default avatarTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  2. 31 Aug, 2015 1 commit
    • NeilBrown's avatar
      md/raid10: ensure device failure recorded before write request returns. · 95af587e
      NeilBrown authored
      When a write to one of the legs of a RAID10 fails, the failure is
      recorded in the metadata of the other legs so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
  3. 03 Feb, 2015 1 commit
    • NeilBrown's avatar
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown authored
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  4. 25 Feb, 2013 1 commit
    • Jonathan Brassow's avatar
      MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) · 475901af
      Jonathan Brassow authored
      The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
      widths - copying them to a different location on the same devices after
      shifting the stripe.  An example layout of each follows below:
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 L    G    H    I    J    K
      		"offset" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 G    H    I    J    K    L
      	 L    G    H    I    J    K
      Redundancy for these algorithms is gained by shifting the copied stripes
      one device to the right.  This patch proposes that array be divided into
      sets of adjacent devices and when the stripe copies are shifted, they wrap
      on set boundaries rather than the array size boundary.  That is, for the
      purposes of shifting, the copies are confined to their sets within the
      array.  The sets are 'near_copies * far_copies' in size.
      The above "far" algorithm example would change to:
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	 B    A    D    C    F    E  --> Copy of stripe0, shifted 1, 2-dev sets
      	 H    G    J    I    L    K      Dev sets are 1-2, 3-4, 5-6
      This has the affect of improving the redundancy of the array.  We can
      always sustain at least one failure, but sometimes more than one can
      be handled.  In the first examples, the pairs of devices that CANNOT fail
      together are:
      	(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
      In the example where the copies are confined to sets, the pairs of
      devices that cannot fail together are:
      	(1,2) (3,4) (5,6)                    [20% of possible pairs]
      We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
      variable is used to indicate whether we use the old or new method of computing
      the shift.  (This is similar to the way the 16th bit indicates whether the
      "far" algorithm or the "offset" algorithm is being used.)
      This patch only handles the cases where the number of total raid disks is
      a multiple of 'far_copies'.  A follow-on patch addresses the condition where
      this is not true.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  5. 17 Aug, 2012 1 commit
    • NeilBrown's avatar
      md/raid10: fix problem with on-stack allocation of r10bio structure. · e0ee7785
      NeilBrown authored
      A 'struct r10bio' has an array of per-copy information at the end.
      This array is declared with size [0] and r10bio_pool_alloc allocates
      enough extra space to store the per-copy information depending on the
      number of copies needed.
      So declaring a 'struct r10bio on the stack isn't going to work.  It
      won't allocate enough space, and memory corruption will ensue.
      So in the two places where this is done, declare a sufficiently large
      structure and use that instead.
      The two call-sites of this bug were introduced in 3.4 and 3.5
      so this is suitable for both those kernels.  The patch will have to
      be modified for 3.4 as it only has one bug.
      Cc: stable@vger.kernel.org
      Reported-by: default avatarIvan Vasilyev <ivan.vasilyev@gmail.com>
      Tested-by: default avatarIvan Vasilyev <ivan.vasilyev@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  6. 30 Jul, 2012 3 commits
    • Jonathan Brassow's avatar
      MD RAID10: Export md_raid10_congested · cc4d1efd
      Jonathan Brassow authored
      md/raid10: Export is_congested test.
      In similar fashion to commits
      we export the RAID10 congestion checking function so that dm-raid.c can
      make use of it and make use of the personality.  The 'queue' and 'gendisk'
      structures will not be available to the MD code when device-mapper sets
      up the device, so we conditionalize access to these fields also.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Jonathan Brassow's avatar
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow authored
      MD RAID1/RAID10: Move some macros from .h file to .c file
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Jonathan Brassow's avatar
      MD RAID10: rename mirror_info structure · dc280d98
      Jonathan Brassow authored
      MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
      The same structure name ('mirror_info') is used by raid1.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  7. 21 May, 2012 1 commit
    • NeilBrown's avatar
      md/raid10: add reshape support · 3ea7daa5
      NeilBrown authored
      A 'near' or 'offset' lay RAID10 array can be reshaped to a different
      'near' or 'offset' layout, a different chunk size, and a different
      number of devices.
      However the number of copies cannot change.
      Unlike RAID5/6, we do not support having user-space backup data that
      is being relocated during a 'critical section'.  Rather, the
      data_offset of each device must change so that when writing any block
      to a new location, it will not over-write any data that is still
      This means that RAID10 reshape is not supportable on v0.90 metadata.
      The different between the old data_offset and the new_offset must be
      at least the larger of the chunksize multiplied by offset copies of
      each of the old and new layout. (for 'near' mode, offset_copies == 1).
      A larger difference of around 64M seems useful for in-place reshapes
      as more data can be moved between metadata updates.
      Very large differences (e.g. 512M) seem to slow the process down due
      to lots of long seeks (on oldish consumer graded devices at least).
      Metadata needs to be updated whenever the place we are about to write
      to is considered - by the current metadata - to still contain data in
      the old layout.
      [unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  8. 20 May, 2012 2 commits
    • NeilBrown's avatar
      md/raid10: Introduce 'prev' geometry to support reshape. · f8c9e74f
      NeilBrown authored
      When RAID10 supports reshape it will need a 'previous' and a 'current'
      geometry, so introduce that here.
      Use the 'prev' geometry when before the reshape_position, and the
      current 'geo' when beyond it.  At other times, use both as
      For now, both are identical (And reshape_position is never set).
      When we use the 'prev' geometry, we must use the old data_offset.
      When we use the current (And a reshape is happening) we must use
      the new_data_offset.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid10: collect some geometry fields into a dedicated structure. · 5cf00fcd
      NeilBrown authored
      We will shortly be adding reshape support for RAID10 which will
      require it having 2 concurrent geometries (before and after).
      To make that easier, collect most geometry fields into 'struct geom'
      and access them from there.  Then we will more easily be able to add
      a second set of fields.
      Note that 'copies' is not in this struct and so cannot be changed.
      There is little need to change this number and doing so is a lot
      more difficult as it requires reallocating more things.
      So leave it out for now.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  9. 22 Dec, 2011 1 commit
  10. 10 Oct, 2011 7 commits
  11. 27 Jul, 2011 3 commits
    • NeilBrown's avatar
      md/raid10: Handle write errors by updating badblock log. · bd870a16
      NeilBrown authored
      When we get a write error (in the data area, not in metadata),
      update the badblock log rather than failing the whole device.
      As the write may well be many blocks, we trying writing each
      block individually and only log the ones which fail.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid10: clear bad-block record when write succeeds. · 749c55e9
      NeilBrown authored
      If we succeed in writing to a block that was recorded as
      being bad, we clear the bad-block record.
      This requires some delayed handling as the bad-block-list update has
      to happen in process-context.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid10: avoid reading from known bad blocks - part 1 · 856e08e2
      NeilBrown authored
      This patch just covers the basic read path:
       1/ read_balance needs to check for badblocks, and return not only
          the chosen slot, but also how many good blocks are available
       2/ read submission must be ready to issue multiple reads to
          different devices as different bad blocks on different devices
          could mean that a single large read cannot be served by any one
          device, but can still be served by the array.
          This requires keeping count of the number of outstanding requests
          per bio.  This count is stored in 'bi_phys_segments'
      On read error we currently just fail the request if another target
      cannot handle the whole request.  Next patch refines that a bit.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  12. 26 Jul, 2011 1 commit
    • NeilBrown's avatar
      md/raid10: Make use of new recovery_disabled handling · 2bb77736
      NeilBrown authored
      When we get a read error during recovery, RAID10 previously
      arranged for the recovering device to appear to fail so that
      the recovery stops and doesn't restart.  This is misleading and wrong.
      Instead, make use of the new recovery_disabled handling and mark
      the target device and having recovery disabled.
      Add appropriate checks in add_disk and remove_disk so that devices
      are removed and not re-added when recovery is disabled.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  13. 31 Mar, 2011 1 commit
  14. 23 Jun, 2010 1 commit
    • NeilBrown's avatar
      md: fix handling of array level takeover that re-arranges devices. · e93f68a1
      NeilBrown authored
      Most array level changes leave the list of devices largely unchanged,
      possibly causing one at the end to become redundant.
      However conversions between RAID0 and RAID10 need to renumber
      all devices (except 0).
      This renumbering is currently being done in the ->run method when the
      new personality takes over.  However this is too late as the common
      code in md.c might already have invalidated some of the devices if
      they had a ->raid_disk number that appeared to high.
      Moving it into the ->takeover method is too early as the array is
      still active at that time and wrong ->raid_disk numbers could cause
      So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
      the new raid_disk number.
      Now the common code knows exactly which devices need to be renumbered,
      and which can be invalidated, and can do it all at a convenient time
      when the array is suspend.
      It can also update some symlinks in sysfs which previously were not be
      updated correctly.
      Reported-by: default avatarMaciej Trela <maciej.trela@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  15. 17 May, 2010 1 commit
  16. 16 Jun, 2009 1 commit
    • NeilBrown's avatar
      md: remove mddev_to_conf "helper" macro · 070ec55d
      NeilBrown authored
      Having a macro just to cast a void* isn't really helpful.
      I would must rather see that we are simply de-referencing ->private,
      than have to know what the macro does.
      So open code the macro everywhere and remove the pointless cast.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  17. 30 Mar, 2009 2 commits
  18. 03 Oct, 2006 1 commit
  19. 26 Jun, 2006 1 commit
  20. 06 Jan, 2006 4 commits
  21. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      Let it rip!