1. 22 May, 2009 1 commit
  2. 06 May, 2009 4 commits
    • NeilBrown's avatar
      md: remove rd%d links immediately after stopping an array. · c4647292
      NeilBrown authored
      md maintains link in sys/mdXX/md/ to identify which device has
      which role in the array. e.g.
         rd2 -> dev-sda
      
      indicates that the device with role '2' in the array is sda.
      
      These links are only present when the array is active.  They are
      created immediately after ->run is called, and so should be removed
      immediately after ->stop is called.
      However they are currently removed a little bit later, and it is
      possible for ->run to be called again, thus adding these links, before
      they are removed.
      
      So move the removal earlier so they are consistently only present when
      the array is active.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c4647292
    • NeilBrown's avatar
      md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597
      NeilBrown authored
      Being able to write 'clean' to an 'array_state' of an inactive array
      to activate it in 'clean' mode is both unnecessary and inconvenient.
      
      It is unnecessary because the same can be achieved by writing
      'active'.  This activates and array, but it still remains 'clean'
      until the first write.
      
      It is inconvenient because writing 'clean' is more often used to
      cause an 'active' array to revert to 'clean' mode (thus blocking
      any writes until a 'write-pending' is promoted to 'active').
      
      Allowing 'clean' to both activate an array and mark an active array as
      clean can lead to races:  One program writes 'clean' to mark the
      active array as clean at the same time as another program writes
      'inactive' to deactivate (stop) and active array.  Depending on which
      writes first, the array could be deactivated and immediately
      reactivated which isn't what was desired.
      
      So just disable the use of 'clean' to activate an array.
      
      This avoids a race that can be triggered with mdadm-3.0 and external
      metadata, so it suitable for -stable.
      Reported-by: default avatarRafal Marszewski <rafal.marszewski@intel.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5bf29597
    • Jan Engelhardt's avatar
      md: constify VFTs · 110518bc
      Jan Engelhardt authored
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      110518bc
    • NeilBrown's avatar
      md: tidy up status_resync to handle large arrays. · dd71cf6b
      NeilBrown authored
      Two problems in status_resync.
      1/ It still used Kilobytes as the basic block unit, while most code
         now uses sectors uniformly.
      2/ It doesn't allow for the possibility that max_sectors exceeds
         the range of "unsigned long".
      
      So
       - change "max_blocks" to "max_sectors", and store sector numbers
         in there and in 'resync'
       - Make 'rt' a 'sector_t' so it can temporarily hold the number of
         remaining sectors.
       - use sector_div rather than normal division.
       - change the magic '100' used to preserve precision to '32'.
         + making it a power of 2 makes division easier
         + it doesn't need to be as large as it was chosen when we averaged
           speed over the entire run.  Now we average speed over the last 30
           seconds or so.
      Reported-by: default avatar"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dd71cf6b
  3. 16 Apr, 2009 1 commit
    • NeilBrown's avatar
      md: update sync_completed and reshape_position even more often. · c03f6a19
      NeilBrown authored
      There are circumstances when a user-space process might need to
      "oversee" a resync/reshape process.  For example when doing an
      in-place reshape of a raid5, it is prudent to take a backup of each
      section before reshaping it as this is the only way to provide
      safety against an unplanned shutdown (i.e. crash/power failure).
      
      The sync_max sysfs value can be used to stop the resync from
      advancing beyond a particular point.
      So user-space can:
        suspend IO to the first section and back it up
        set 'sync_max' to the end of the section
        wait for 'sync_completed' to reach that point
        resume IO on the first section and move on to the next section.
      
      However this process requires the kernel and user-space to run in
      lock-step which could introduce unnecessary delays.
      
      It would be better if a 'double buffered' approach could be used with
      userspace and kernel space working on different sections with the
      'next' section always ready when the 'current' section is finished.
      
      One problem with implementing this is that sync_completed is only
      guaranteed to be updated when the sync process reaches sync_max.
      (it is updated on a time basis at other times, but it is hard to rely
      on that).  This defeats some of the double buffering.
      
      With this patch, sync_completed (and reshape_position) get updated as
      the current position approaches sync_max, so there is room for
      userspace to advance sync_max early without losing updates.
      
      To be precise, sync_completed is updated when the current sync
      position reaches half way between the current value of sync_completed
      and the value of sync_max.  This will usually be a good time for user
      space to update sync_max.
      
      If sync_max does not get updated, the updates to sync_completed
      (together with associated metadata updates) will occur at an
      exponentially increasing frequency which will get unreasonably fast
      (one update every page) immediately before the process hits sync_max
      and stops.  So the update rate will be unreasonably fast only for an
      insignificant period of time.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c03f6a19
  4. 14 Apr, 2009 1 commit
    • NeilBrown's avatar
      md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0
      NeilBrown authored
      The sync_completed file reports how much of a resync (or recovery or
      reshape) has been completed.
      However due to the possibility of out-of-order completion of writes,
      it is not certain to be accurate.
      
      We have an internal value - mddev->curr_resync_completed - which is an
      accurate value (though it might not always be quite so uptodate).
      
      So:
       - make curr_resync_completed be uptodate a little more often,
         particularly when raid5 reshape updates status in the metadata
       - report curr_resync_completed in the sysfs file
       - allow poll/select to report all updates to md/sync_completed.
      
      This makes sync_completed completed usable by any external metadata
      handler that wants to record this status information in its metadata.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      acb180b0
  5. 13 Apr, 2009 1 commit
  6. 30 Mar, 2009 20 commits
    • NeilBrown's avatar
      md: don't display meaningless values in sysfs files resync_start and sync_speed · d1a7c503
      NeilBrown authored
      When no resync if happening, both of these files currently have
      meaningless values (is slightly different ways).
      Change them to "none" in that case.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d1a7c503
    • NeilBrown's avatar
      md: add explicit method to signal the end of a reshape. · cea9c228
      NeilBrown authored
      Currently raid5 (the only module that supports restriping)
      notices that the reshape has finished be sync_request being
      given a large value, and handles any cleanup them.
      
      This patch changes it so md_check_recovery calls into an
      explicit finish_reshape method as well.
      
      The clean-up from sync_request can do things that need to be
      done promptly, typically things local to the raid5_conf_t
      structure.
      
      The "finish_reshape" method is called under the mddev_lock
      so it can do things involving reconfiguring the device.
      
      This allows us to get rid of md_set_array_sectors_locked, which
      would have caused a deadlock if you tried to stop and array
      while a reshape was happening.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cea9c228
    • Dan Williams's avatar
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams authored
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b522adcd
    • Dan Williams's avatar
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams authored
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1f403624
    • NeilBrown's avatar
      md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035
      NeilBrown authored
      2-drive raid5's aren't very interesting.  But if you are converting
      a raid1 into a raid5, you will at least temporarily have one.  And
      that it a good time to set the layout/chunksize for the new RAID5
      if you aren't happy with the defaults.
      
      layout and chunksize don't actually affect the placement of data
      on a 2-drive raid5, so we just do some internal book-keeping.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b3546035
    • NeilBrown's avatar
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown authored
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      245f46c2
    • NeilBrown's avatar
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown authored
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      409c57f3
    • NeilBrown's avatar
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown authored
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e0cf8f04
    • NeilBrown's avatar
      md: make sure new_level, new_chunksize, new_layout always have sensible values. · 34817e8c
      NeilBrown authored
      When an md array is undergoing a change, we have new_* fields that
      show the new values.
      When no change is happening, it is least confusing if these have
      the same value as the normal fields.
      This is true in most cases, but not when the values are set via sysfs.
      
      So fix this up.
      
      A subsequent patch will BUG_ON if these things aren't consistent.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      34817e8c
    • Andre Noll's avatar
      md: Represent raid device size in sectors. · dd8ac336
      Andre Noll authored
      This patch renames the "size" field of struct mdk_rdev_s to
      "sectors" and changes this field to store sectors instead of
      blocks.
      
      All users of this field, linear.c, raid0.c and md.c, are fixed up
      accordingly which gets rid of many multiplications and divisions.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dd8ac336
    • Andre Noll's avatar
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll authored
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      58c0fed4
    • NeilBrown's avatar
      md: be more consistent about setting WriteMostly flag when adding a drive to an array · 575a80fa
      NeilBrown authored
      When a drive is added to an array using ADD_NEW_DISK, there are two
      places we can get certain flags from:  the metadata on the disk or the
      flags passed through the IOCTL.
      
      For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
      from either of those sources depending on if it is set (i.e. we
      effectively 'or' the two sources together).
      
      This makes it awkward to clear, and is at best inconsistent.
      
      As documented code (in mdadm) requires that setting
      MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
      inconsistency by always using the value for this flag from the ioctl,
      and ignoring the value on disk.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      575a80fa
    • NeilBrown's avatar
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown authored
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97e4f42d
    • NeilBrown's avatar
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown authored
      It really is nicer to keep related code together..
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      43b2e5d8
    • NeilBrown's avatar
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown authored
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bff61975
    • NeilBrown's avatar
      md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21
      NeilBrown authored
      .. as they are part of the user-space interface.
      Also move MdpMinorShift into there so we can remove duplication.
      
      Lastly move mdp_major in.  It is less obviously part of the user-space
      interface, but do_mounts_md.c uses it, and it is acting a bit like
      user-space.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8b2b5c21
    • Christoph Hellwig's avatar
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig authored
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ef740c37
    • Christoph Hellwig's avatar
      md: stop defining MAJOR_NR · 3dbd8c2e
      Christoph Hellwig authored
      MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
      kernels, so no need to keep it around.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3dbd8c2e
    • Martin K. Petersen's avatar
      MD data integrity support · 3f9d99c1
      Martin K. Petersen authored
      md: Add support for data integrity to MD
      
      If all subdevices support the same protection format the MD device is
      flagged as integrity capable.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3f9d99c1
    • NeilBrown's avatar
      md: Fix is_mddev_idle test (again). · eea1bf38
      NeilBrown authored
      There are two problems with is_mddev_idle.
      
      1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
         rest are 'long'.
         So if sync_io were to wrap on a 64bit host, the value of
         curr_events would go very negative suddenly, and take a very
         long time to return to positive.
      
         So do all calculations as 'int'.  That gives us plenty of precision
         for what we need.
      
      2/ To initialise rdev->last_events we simply call is_mddev_idle, on
         the assumption that it will make sure that last_events is in a
         suitable range.  It used to do this, but now it does not.
         So now we need to be more explicit about initialisation.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      eea1bf38
  7. 04 Mar, 2009 1 commit
    • Dan Williams's avatar
      md: fix deadlock when stopping arrays · 5fd3a17e
      Dan Williams authored
      Resolve a deadlock when stopping redundant arrays, i.e. ones that
      require a call to sysfs_remove_group when shutdown.  The deadlock is
      summarized below:
      
      Thread1                Thread2
      -------                -------
      read sysfs attribute   stop array
                             take mddev lock
                             sysfs_remove_group
      sysfs_get_active
      wait for mddev lock
                             wait for active
      
      Sysrq-w:
      --------
      mdmon         S 00000017  2212  4163      1
        f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
        c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
        00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
      Call Trace:
        [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58
        [<c06ef451>] __mutex_lock_common+0x1d9/0x338
        [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338
        [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a
        [<c0634553>] ? mddev_lock+0x14/0x16
        [<c0634553>] mddev_lock+0x14/0x16
        [<c0634eda>] md_attr_show+0x2a/0x49
        [<c04e9997>] sysfs_read_file+0x93/0xf9
      mdadm         D 00000017  2812  4177      1
        f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
        c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
        00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
      Call Trace:
        [<c06eed1b>] schedule_timeout+0x1b/0x95
        [<c06eed1b>] ? schedule_timeout+0x1b/0x95
        [<c06eeb97>] ? wait_for_common+0x34/0xdc
        [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145
        [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd
        [<c06eec03>] wait_for_common+0xa0/0xdc
        [<c0428c7c>] ? default_wake_function+0x0/0x12
        [<c06eeccc>] wait_for_completion+0x17/0x19
        [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1
        [<c04e920e>] sysfs_hash_and_remove+0x42/0x55
        [<c04eb4db>] sysfs_remove_group+0x57/0x86
        [<c0638086>] do_md_stop+0x13a/0x499
      
      This has been there for a while, but is easier to trigger now that mdmon
      is closely watching sysfs.
      
      Cc: <stable@kernel.org>
      Reported-by: default avatarJacek Danecki <jacek.danecki@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      5fd3a17e
  8. 18 Feb, 2009 1 commit
  9. 06 Feb, 2009 1 commit
    • NeilBrown's avatar
      md: Ensure an md array never has too many devices. · de01dfad
      NeilBrown authored
      Each different metadata format supported by md supports a
      different maximum number of devices.
      We really should be enforcing this maximum in the kernel, but
      we aren't quite doing that properly.
      
      We currently only enforce it at the 'hot_add' point, which is an
      older interface which is not used by current userspace.
      
      We need to also enforce it at 'add_new_disk' time for active arrays
      and at 'do_md_run' time when starting a new array.
      
      So move the test from 'hot_add' into 'bind_rdev_to_array' which is
      called from both 'hot_add' and 'add_new_disk, and add a new
      test in 'analyse_sbs' which is called from 'do_md_run'.
      
      This bug (or missing feature) has been around "forever" and so
      the patch is suitable for any -stable that is currently maintained.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      de01dfad
  10. 08 Jan, 2009 8 commits
    • NeilBrown's avatar
      md: don't retry recovery of raid1 that fails due to error on source drive. · 4044ba58
      NeilBrown authored
      If a raid1 has only one working drive and it has a sector which
      gives an error on read, then an attempt to recover onto a spare will
      fail, but as the single remaining drive is not removed from the
      array, the recovery will be immediately re-attempted, resulting
      in an infinite recovery loop.
      
      So detect this situation and don't retry recovery once an error
      on the lone remaining drive is detected.
      
      Allow recovery to be retried once every time a spare is added
      in case the problem wasn't actually a media error.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4044ba58
    • NeilBrown's avatar
      md: Allow md devices to be created by name. · efeb53c0
      NeilBrown authored
      Using sequential numbers to identify md devices is somewhat artificial.
      Using names can be a lot more user-friendly.
      
      Also, creating md devices by opening the device special file is a bit
      awkward.
      
      So this patch provides a new option for creating and naming devices.
      
      Writing a name such as "md_home" to
          /sys/modules/md_mod/parameters/new_array
      will cause an array with that name to be created.  It will appear in
      /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
      It will have an arbitrary minor number allocated.
      
      md devices that a created by an open are destroyed on the last
      close when the device is inactive.
      For named md devices, they will not be destroyed until the array
      is explicitly stopped, either with the STOP_ARRAY ioctl or by
      writing 'clear' to /sys/block/md_XXXX/md/array_state.
      
      The name of the array must start 'md_' to avoid conflict with
      other devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      efeb53c0
    • NeilBrown's avatar
      md: make devices disappear when they are no longer needed. · d3374825
      NeilBrown authored
      Currently md devices, once created, never disappear until the module
      is unloaded.  This is essentially because the gendisk holds a
      reference to the mddev, and the mddev holds a reference to the
      gendisk, this a circular reference.
      
      If we drop the reference from mddev to gendisk, then we need to ensure
      that the mddev is destroyed when the gendisk is destroyed.  However it
      is not possible to hook into the gendisk destruction process to enable
      this.
      
      So we drop the reference from the gendisk to the mddev and destroy the
      gendisk when the mddev gets destroyed.  However this has a
      complication.
      Between the call
         __blkdev_get->get_gendisk->kobj_lookup->md_probe
      and the call
         __blkdev_get->md_open
      
      there is no obvious way to hold a reference on the mddev any more, so
      unless something is done, it will disappear and gendisk will be
      destroyed prematurely.
      
      Also, once we decide to destroy the mddev, there will be an unlockable
      moment before the gendisk is unlinked (blk_unregister_region) during
      which a new reference to the gendisk can be created.  We need to
      ensure that this reference can not be used.  i.e. the ->open must
      fail.
      
      So:
       1/  in md_probe we set a flag in the mddev (hold_active) which
           indicates that the array should be treated as active, even
           though there are no references, and no appearance of activity.
           This is cleared by md_release when the device is closed if it
           is no longer needed.
           This ensures that the gendisk will survive between md_probe and
           md_open.
      
       2/  In md_open we check if the mddev we expect to open matches
           the gendisk that we did open.
           If there is a mismatch we return -ERESTARTSYS and modify
           __blkdev_get to retry from the top in that case.
           In the -ERESTARTSYS sys case we make sure to wait until
           the old gendisk (that we succeeded in opening) is really gone so
           we loop at most once.
      
      Some udev configurations will always open an md device when it first
      appears.   If we allow an md device that was just created by an open
      to disappear on an immediate close, then this can race with such udev
      configurations and result in an infinite loop the device being opened
      and closed, then re-open due to the 'ADD' even from the first open,
      and then close and so on.
      So we make sure an md device, once created by an open, remains active
      at least until some md 'ioctl' has been made on it.  This means that
      all normal usage of md devices will allow them to disappear promptly
      when not needed, but the worst that an incorrect usage will do it
      cause an inactive md device to be left in existence (it can easily be
      removed).
      
      As an array can be stopped by writing to a sysfs attribute
        echo clear > /sys/block/mdXXX/md/array_state
      we need to use scheduled work for deleting the gendisk and other
      kobjects.  This allows us to wait for any pending gendisk deletion to
      complete by simply calling flush_scheduled_work().
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d3374825
    • NeilBrown's avatar
      md: centralise all freeing of an 'mddev' in 'md_free' · a21d1504
      NeilBrown authored
      md_free is the .release handler for the md kobj_type.
      So it makes sense to release all the objects referenced by
      the mddev in there, rather than just prior to calling kobject_put
      for what we think is the last time.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a21d1504
    • NeilBrown's avatar
      md: move allocation of ->queue from mddev_find to md_probe · 8b765398
      NeilBrown authored
      It is more balanced to just do simple initialisation in mddev_find,
      which allocates and links a new md device, and leave all the
      more sophisticated allocation to md_probe (which calls mddev_find).
      md_probe already allocated the gendisk.  It should allocate the
      queue too.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8b765398
    • Cheng Renquan's avatar
      md: need another print_sb for mdp_superblock_1 · cd2ac932
      Cheng Renquan authored
      md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
      with PRINT_RAID_DEBUG.  it will dump out all in use md devices
      information;
      
      However, it wrongly processed two types of superblock in one:
      
      The header file <linux/raid/md_p.h> has defined two types of superblock,
      struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
      metadata 0.90, and struct mdp_superblock_1 according to md with metadata
      1.0 and later,
      
      These two types of superblock are very different,
      
      The md_print_devices code processed them both in mdp_super_t, that would
      lead to wrong informaton dump like:
      
      	[ 6742.345877]
      	[ 6742.345887] md:	**********************************
      	[ 6742.345890] md:	* <COMPLETE RAID STATE PRINTOUT> *
      	[ 6742.345892] md:	**********************************
      	[ 6742.345896] md1: <ram7><ram6><ram5><ram4>
      	[ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
      	[ 6742.345909] md: rdev superblock:
      	[ 6742.345914] md:  SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d
      	[ 6742.345918] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
      	[ 6742.345922] md:     UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
      	[ 6742.345924]      D  0:  DISK<N:0,(1,8),R:0,S:6>
      	[ 6742.345930]      D  1:  DISK<N:1,(1,10),R:1,S:6>
      	[ 6742.345933]      D  2:  DISK<N:2,(1,12),R:2,S:6>
      	[ 6742.345937]      D  3:  DISK<N:3,(1,14),R:3,S:6>
      	[ 6742.345942] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
      	...
      	[ 6742.346058] md0: <ram3><ram2><ram1><ram0>
      	[ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
      	[ 6742.346070] md: rdev superblock:
      	[ 6742.346073] md:  SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c
      	[ 6742.346077] md:     L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
      	[ 6742.346081] md:     UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
      	[ 6742.346084]      D  0:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346089]      D  1:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346092]      D  2:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346096]      D  3:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346102] md:     THIS:  DISK<N:0,(0,0),R:0,S:0>
      	...
      	[ 6742.346219] md:	**********************************
      	[ 6742.346221]
      
      Here md1 is metadata 0.90.0, and md0 is metadata 1.2
      
      After some more code to distinguish these two types of superblock, in this patch,
      
      it will generate dump information like:
      
      	[ 7906.755790]
      	[ 7906.755799] md:	**********************************
      	[ 7906.755802] md:	* <COMPLETE RAID STATE PRINTOUT> *
      	[ 7906.755804] md:	**********************************
      	[ 7906.755808] md1: <ram7><ram6><ram5><ram4>
      	[ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
      	[ 7906.755821] md: rdev superblock (MJ:0):
      	[ 7906.755826] md:  SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3
      	[ 7906.755830] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
      	[ 7906.755834] md:     UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
      	[ 7906.755836]      D  0:  DISK<N:0,(1,8),R:0,S:6>
      	[ 7906.755842]      D  1:  DISK<N:1,(1,10),R:1,S:6>
      	[ 7906.755845]      D  2:  DISK<N:2,(1,12),R:2,S:6>
      	[ 7906.755849]      D  3:  DISK<N:3,(1,14),R:3,S:6>
      	[ 7906.755855] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
      	...
      	[ 7906.755972] md0: <ram3><ram2><ram1><ram0>
      	[ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
      	[ 7906.755984] md: rdev superblock (MJ:1):
      	[ 7906.755989] md:  SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd>
      	[ 7906.755990] md:    Name: "DG5:0" CT:1226410480
      	[ 7906.755998] md:       L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
      	[ 7906.755999] md:     Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
      	[ 7906.756001] md:       (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
      	[ 7906.756003] md:         (MaxDev:384)
      	...
      	[ 7906.756113] md:	**********************************
      	[ 7906.756116]
      
      this md0 (metadata 1.2) information dumping is exactly according to struct
      mdp_superblock_1.
      Signed-off-by: default avatarCheng Renquan <crquan@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cd2ac932
    • Cheng Renquan's avatar
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan authored
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: default avatarCheng Renquan <crquan@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      159ec1fc
    • NeilBrown's avatar
      md: use sysfs_notify_dirent to notify changes to md/sync_action. · 0c3573f1
      NeilBrown authored
      There is no compelling need for this, but sysfs_notify_dirent is a
      nicer interface and the change is good for consistency.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0c3573f1
  11. 05 Nov, 2008 1 commit
    • NeilBrown's avatar
      md: revert the recent addition of a call to the BLKRRPART ioctl. · cb3ac42b
      NeilBrown authored
      It turns out that it is only safe to call blkdev_ioctl when the device
      is actually open (as ->bd_disk is set to NULL on last close).  And it
      is quite possible for do_md_stop to be called when the device is not
      open.  So discard the call to blkdev_ioctl(BLKRRPART) which was
      added in
         commit 934d9c23
      
      It is just as easy to call this ioctl from userspace when needed (on
      mdadm -S) so leave it out of the kernel
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cb3ac42b