1. 06 Oct, 2011 3 commits
  2. 20 Sep, 2011 1 commit
    • NeilBrown's avatar
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown authored
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: default avatar"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  3. 30 Aug, 2011 1 commit
    • NeilBrown's avatar
      md/raid5: fix a hang on device failure. · 43220aa0
      NeilBrown authored
      Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
      cannot work with internal metadata as it is the raid5d thread which
      will clear the blocked flag.
      This wasn't a problem in 3.0 and earlier as we only set the blocked
      flag when external metadata was used then.
      However we now set it always, so we need to be more careful.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      43220aa0
  4. 27 Jul, 2011 7 commits
    • NeilBrown's avatar
      md/raid5: Clear bad blocks on successful write. · b84db560
      NeilBrown authored
      On a successful write to a known bad block, flag the sh
      so that raid5d can remove the known bad block from the list.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b84db560
    • NeilBrown's avatar
      md/raid5. Don't write to known bad block on doubtful devices. · 73e92e51
      NeilBrown authored
      If a device has seen write errors, don't write to any known
      bad blocks on that device.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      73e92e51
    • NeilBrown's avatar
      md/raid5: write errors should be recorded as bad blocks if possible. · bc2607f3
      NeilBrown authored
      When a write error is detected, don't mark the device as failed
      immediately but rather record the fact for handle_stripe to deal with.
      
      Handle_stripe then attempts to record a bad block.  Only if that fails
      does the device get marked as faulty.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bc2607f3
    • NeilBrown's avatar
      md/raid5: use bad-block log to improve handling of uncorrectable read errors. · 7f0da59b
      NeilBrown authored
      If we get an uncorrectable read error - record a bad block rather than
      failing the device.
      And if these errors (which may be due to known bad blocks) cause
      recovery to be impossible, record a bad block on the recovering
      devices, or abort the recovery.
      
      As we might abort a recovery without failing a device we need to teach
      RAID5 about recovery_disabled handling.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7f0da59b
    • NeilBrown's avatar
      md/raid5: avoid reading from known bad blocks. · 31c176ec
      NeilBrown authored
      There are two times that we might read in raid5:
      1/ when a read request fits within a chunk on a single
         working device.
         In this case, if there is any bad block in the range of
         the read, we simply fail the cache-bypass read and
         perform the read though the stripe cache.
      
      2/ when reading into the stripe cache.  In this case we
         mark as failed any device which has a bad block in that
         strip (1 page wide).
         Note that we will both avoid reading and avoid writing.
         This is correct (as we will never read from the block, there
         is no point writing), but not optimal (as writing could 'fix'
         the error) - that will be addressed later.
      
      If we have not seen any write errors on the device yet, we treat a bad
      block like a recent read error.  This will encourage an attempt to fix
      the read error which will either generate a write error, or will
      ensure good data is stored there.  We don't yet forget the bad block
      in that case.  That comes later.
      
      Now that we honour bad blocks when reading we can allow devices with
      bad blocks into the array.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      31c176ec
    • NeilBrown's avatar
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown authored
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      de393cde
    • NeilBrown's avatar
      md: don't allow arrays to contain devices with bad blocks. · 34b343cf
      NeilBrown authored
      As no personality understand bad block lists yet, we must
      reject any device that is known to contain bad blocks.
      As the personalities get taught, these tests can be removed.
      
      This only applies to raid1/raid5/raid10.
      For linear/raid0/multipath/faulty the whole concept of bad blocks
      doesn't mean anything so there is no point adding the checks.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarNamhyung Kim <namhyung@gmail.com>
      34b343cf
  5. 26 Jul, 2011 13 commits
  6. 25 Jul, 2011 7 commits
  7. 18 Jul, 2011 2 commits
  8. 13 Jun, 2011 3 commits
  9. 08 Jun, 2011 1 commit
    • Jonathan Brassow's avatar
      MD: raid5 do not set fullsync · d6b212f4
      Jonathan Brassow authored
      Add check to determine if a device needs full resync or if partial resync will do
      
      RAID 5 was assuming that if a device was not In_sync, it must undergo a full
      resync.  We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'.
      If it is, we can safely skip the full resync and rely on the bitmap for
      partial recovery instead.  This is the legitimate purpose of 'saved_raid_disk',
      from md.h:
      int saved_raid_disk;            /* role that device used to have in the
                                       * array and could again if we did a partial
                                       * resync from the bitmap
                                       */
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d6b212f4
  10. 10 May, 2011 2 commits
    • NeilBrown's avatar
      md: allow resync_start to be set while an array is active. · b098636c
      NeilBrown authored
      The sysfs attribute 'resync_start' (known internally as recovery_cp),
      records where a resync is up to.  A value of 0 means the array is
      not known to be in-sync at all.  A value of MaxSector means the array
      is believed to be fully in-sync.
      
      When the size of member devices of an array (RAID1,RAID4/5/6) is
      increased, the array can be increased to match.  This process sets
      resync_start to the old end-of-device offset so that the new part of
      the array gets resynced.
      
      However with RAID1 (and RAID6) a resync is not technically necessary
      and may be undesirable.  So it would be good if the implied resync
      after the array is resized could be avoided.
      
      So: change 'resync_start' so the value can be changed while the array
      is active, and as a precaution only allow it to be changed while
      resync/recovery is 'frozen'.  Changing it once resync has started is
      not going to be useful anyway.
      
      This allows the array to be resized without a resync by:
        write 'frozen' to 'sync_action'
        write new size to 'component_size' (this will set resync_start)
        write 'none' to 'resync_start'
        write 'idle' to 'sync_action'.
      
      Also slightly improve some tests on recovery_cp when resizing
      raid1/raid5.  Now that an arbitrary value could be set we should be
      more careful in our tests.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b098636c
    • NeilBrown's avatar
      md: make error_handler functions more uniform and correct. · 6f8d0c77
      NeilBrown authored
      - there is no need to test_bit Faulty, as that was already done in
        md_error which is the only caller of these functions.
      - MD_CHANGE_DEVS should be set *after* faulty is set to ensure
        metadata is updated correctly.
      - spinlock should be held while updating ->degraded.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      
        
      6f8d0c77