1. 24 Jul, 2009 5 commits
    • Yan Zheng's avatar
      Btrfs: Fix ordering of key field checks in btrfs_previous_item · 0a4eefbb
      Yan Zheng authored
      Check objectid of item before checking the item type, otherwise we may return
      zero for a key that is actually too low.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Yan Zheng's avatar
      Btrfs: find_free_dev_extent doesn't handle holes at the start of the device · 1fcbac58
      Yan Zheng authored
      find_free_dev_extent does not properly handle the case where
      the device is not complete free, and there is a free extent
      at the beginning of the device.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Diego Calleja's avatar
      Btrfs: Remove code duplication in comp_keys · 20736aba
      Diego Calleja authored
      comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
      call it.
      Signed-off-by: default avatarDiego Calleja <diegocg@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Josef Bacik's avatar
      Btrfs: async block group caching · 817d52f8
      Josef Bacik authored
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Josef Bacik's avatar
      Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081
      Josef Bacik authored
      Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
      tracking free space.  As free space gets fragmented, we end up with thousands of
      entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
      we currently don't ever flush free space cache back to disk this gets to be a
      bit unweildly on large fs's with lots of fragmentation.
      This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
      space cache.  Initially we calculate a threshold of extent entries we can
      handle, which is however many extent entries we can cram into 16k of ram.  The
      maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
      will be 32k of RAM, which scales much better than we did before.
      Once we pass the extent threshold, we start adding bitmaps and using those
      instead for tracking the free space.  This patch also makes it so that any free
      space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
      nice since we try and allocate out of the front of a block group, so if the
      front of a block group is heavily fragmented and then has a huge chunk of free
      space at the end, we go ahead and add the fragmented areas to bitmaps and use a
      normal extent entry to track the big chunk at the back of the block group.
      I've also taken the opportunity to revamp how we search for free space.
      Previously we indexed free space via an offset indexed rb tree and a bytes
      indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
      indexed rb tree.  This cuts the number of tree operations we were doing
      previously down by half, and gives us a little bit of a better allocation
      pattern since we will always start from a specific offset and search forward
      from there, instead of searching for the size we need and try and get it as
      close as possible to the offset we want.
      I've given this a healthy amount of testing pre-new format stuff, as well as
      post-new format stuff.  I've booted up my fedora box which is installed on btrfs
      with this patch and ran with it for a few days without issues.  I've not seen
      any performance regressions in any of my tests.
      Since the last patch Yan Zheng fixed a problem where we could have overlapping
      entries, so updating their offset inline would cause problems.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  2. 22 Jul, 2009 10 commits
  3. 02 Jul, 2009 7 commits
  4. 15 Jun, 2009 1 commit
  5. 11 Jun, 2009 4 commits
  6. 10 Jun, 2009 13 commits
    • Shin Hong's avatar
      Btrfs: init worker struct fields before kthread-run · fd0fb038
      Shin Hong authored
      This patch fixes a bug which may result race condition
      between btrfs_start_workers() and worker_loop().
      btrfs_start_workers() executed in a parent thread writes
      on workers->worker and worker_loop() in a child thread
      reads workers->worker. However, there is no synchronization
      enforcing the order of two operations.
      This patch makes btrfs_start_workers() fill workers->worker
      before it starts a child thread with worker_loop()
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Hisashi Hifumi's avatar
      Btrfs: pin buffers during write_dev_supers · 4eedeb75
      Hisashi Hifumi authored
      write_dev_supers is called in sequence.  First is it called with wait == 0,
      which starts IO on all of the super blocks for a given device.  Then it is
      called with wait == 1 to make sure they all reach the disk.
      It doesn't currently pin the buffers between the two calls, and it also
      assumes the buffers won't go away between the two calls, leading to
      an oops if the VM manages to free the buffers in the middle of the sync.
      This fixes that assumption and updates the code to return an error if things
      are not up to date when the wait == 1 run is done.
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: avoid races between super writeout and device list updates · e5e9a520
      Chris Mason authored
      On multi-device filesystems, btrfs writes supers to all of the devices
      before considering a sync complete.  There wasn't any additional
      locking between super writeout and the device list management code
      because device management was done inside a transaction and
      super writeout only happened  with no transation writers running.
      With the btrfs fsync log and other async transaction updates, this
      has been racey for some time.  This adds a mutex to protect
      the device list.  The existing volume mutex could not be reused due to
      transaction lock ordering requirements.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Al Viro's avatar
      Fix btrfs when ACLs are configured out · 7df336ec
      Al Viro authored
      ... otherwise generic_permission() will allow *anything* for all
      files you don't own and that have some group permissions.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Hisashi Hifumi's avatar
      Btrfs: fdatasync should skip metadata writeout · 524724ed
      Hisashi Hifumi authored
      In btrfs, fdatasync and fsync are identical, but
      fdatasync should skip committing transaction when
      inode->i_state is set just I_DIRTY_SYNC and this indicates
      only atime or/and mtime updates.
      Following patch improves fdatasync throughput.
      --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
      --file-fsync-mode=fdatasync run
      Test execution summary:
          total time:                          1980.6540s
          total number of events:              10001
          total time taken by event execution: 1192.9804
          per-request statistics:
               min:                            0.0000s
               avg:                            0.1193s
               max:                            15.3720s
               approx.  95 percentile:         0.7257s
      Threads fairness:
          events (avg/stddev):           625.0625/151.32
          execution time (avg/stddev):   74.5613/9.46
      Test execution summary:
          total time:                          1695.9118s
          total number of events:              10000
          total time taken by event execution: 871.3214
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0871s
               max:                            10.4644s
               approx.  95 percentile:         0.4787s
      Threads fairness:
          events (avg/stddev):           625.0000/131.86
          execution time (avg/stddev):   54.4576/8.98
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • David Woodhouse's avatar
      Btrfs: remove crc32c.h and use libcrc32c directly. · 163e783e
      David Woodhouse authored
      There's no need to preserve this abstraction; it used to let us use
      hardware crc32c support directly, but libcrc32c is already doing that for us
      through the crypto API -- so we're already using the Intel crc32c
      acceleration where appropriate.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Christoph Hellwig's avatar
      Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION · 6cbff00f
      Christoph Hellwig authored
      Add support for the standard attributes set via chattr and read via
      lsattr.  Currently we store the attributes in the flags value in
      the btrfs inode, but I wonder whether we should split it into two so
      that we don't have to keep converting between the two formats.
      Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
      as they were confusing the existing code and got in the way of the
      new additions.
      Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: autodetect SSD devices · c289811c
      Chris Mason authored
      During mount, btrfs will check the queue nonrot flag
      for all the devices found in the FS.  If they are all
      non-rotating, SSD mode is enabled by default.
      If the FS was mounted with -o nossd, the non-rotating
      flag is ignored.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: add mount -o ssd_spread to spread allocations out · 451d7585
      Chris Mason authored
      Some SSDs perform best when reusing block numbers often, while
      others perform much better when clustering strictly allocates
      big chunks of unused space.
      The default mount -o ssd will find rough groupings of blocks
      where there are a bunch of free blocks that might have some
      allocated blocks mixed in.
      mount -o ssd_spread will make sure there are no allocated blocks
      mixed in.  It should perform better on lower end SSDs.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: avoid allocation clusters that are too spread out · c6044801
      Chris Mason authored
      In SSD mode for data, and all the time for metadata the allocator
      will try to find a cluster of nearby blocks for allocations.  This
      commit adds extra checks to make sure that each free block in the
      cluster is close to the last one.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: Add mount -o nossd · 3b30c22f
      Chris Mason authored
      This allows you to turn off the ssd mode via remount.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: avoid IO stalls behind congested devices in a multi-device FS · d644d8a1
      Chris Mason authored
      The btrfs IO submission threads try to service a bunch of devices with a small
      number of threads.  They do a congestion check to try and avoid waiting
      on requests for a busy device.
      The checks make sure we've sent a few requests down to a given device just so
      that we aren't bouncing between busy devices without actually sending down
      any IO.  The counter used to decide if we can switch to the next device
      is somewhat overloaded.  It is also being used to decide if we've done
      a good batch of requests between the WRITE_SYNC or regular priority lists.
      It may get reset to zero often, leaving us hammering on a busy device
      instead of moving on to another disk.
      This commit adds a new counter for the number of bios sent while
      servicing a device.  It doesn't get reset or fiddled with.  On
      multi-device filesystems, this fixes IO stalls in streaming
      write workloads.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Chris Mason's avatar
      Btrfs: don't allow WRITE_SYNC bios to starve out regular writes · d84275c9
      Chris Mason authored
      Btrfs uses dedicated threads to submit bios when checksumming is on,
      which allows us to make sure the threads dedicated to checksumming don't get
      stuck waiting for requests.  For each btrfs device, there are
      two lists of bios.  One list is for WRITE_SYNC bios and the other
      is for regular priority bios.
      The IO submission threads used to process all of the WRITE_SYNC bios first and
      then switch to the regular bios.  This commit makes sure we don't completely
      starve the regular bios by rotating between the two lists.
      WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
      to run in batches to avoid seeking.  Benchmarking shows this eliminates
      stalls during streaming buffered writes on both multi-device and
      single device filesystems.
      If the regular bios starve, the system can end up with a large amount of ram
      pinned down in writeback pages.  If we are a little more fair between the two
      classes, we're able to keep throughput up and make progress on the bulk of
      our dirty ram.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>