1. 21 Sep, 2009 11 commits
    • Josef Bacik's avatar
      Btrfs: account for space used by the super mirrors · 1b2da372
      Josef Bacik authored
      
      
      As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
      space accounting for the space info's.  Currently we exclude the free space for
      the super mirrors, but the space they take up isn't accounted for in any of the
      counters.  This patch introduces bytes_super, which keeps track of the amount
      of bytes used for a super mirror in the block group cache and space info.  This
      makes sure that our free space caclucations will be completely accurate.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1b2da372
    • Josef Bacik's avatar
      Btrfs: fix extent entry threshold calculation · 25891f79
      Josef Bacik authored
      
      
      There is a slight problem with the extent entry threshold calculation for the
      free space cache.  We only adjust the threshold down as we add bitmaps, but
      never actually adjust the threshold up as we add bitmaps.  This means we could
      fragment the free space so badly that we end up using all bitmaps to describe
      the free space, use all the free space which would result in the bitmaps being
      freed, but then go to add free space again as we delete things and immediately
      add bitmaps since the extent threshold would still be 0.  Now as we free
      bitmaps the extent threshold will be ratcheted up to allow more extent entries
      to be added.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      25891f79
    • Josef Bacik's avatar
      Btrfs: remove dead code · f61408b8
      Josef Bacik authored
      
      
      This patch removes a bunch of dead code from the snapshot removal stuff.  It
      was confusing me when doing the metadata ENOSPC stuff so I killed it.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f61408b8
    • Josef Bacik's avatar
      Btrfs: fix bitmap size tracking · f019f426
      Josef Bacik authored
      
      
      When we first go to add free space, we allocate a new info and set the offset
      and bytes to the space we are adding.  This is fine, except we actually set the
      size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
      end up counting the same space twice.  This isn't a huge deal, it just makes
      the allocator behave weirdly since it will think that a bitmap entry has more
      space than it ends up actually having.  I used a BUG_ON() to catch when this
      problem happened, and with this patch I no longer get the BUG_ON().
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f019f426
    • Josef Bacik's avatar
      Btrfs: don't keep retrying a block group if we fail to allocate a cluster · 0a24325e
      Josef Bacik authored
      
      
      The box can get locked up in the allocator if we happen upon a block group
      under these conditions:
      
      1) During a commit, so caching threads cannot make progress
      2) Our block group currently is in the middle of being cached
      3) Our block group currently has plenty of free space in it
      4) Our block group is so fragmented that it ends up having no free space chunks
      larger than min_bytes calculated by btrfs_find_space_cluster.
      
      What happens is we try and do btrfs_find_space_cluster, which fails because it
      is unable to find enough free space chunks that are large than min_bytes and
      are close enough together.  Since the block group is not cached we do a
      wait_block_group_cache_progress, which waits for the number of bytes we need,
      except the block group already has _plenty_ of free space, its just severely
      fragmented, so we loop and try again, ad infinitum.  This patch keeps us from
      waiting on the block group to finish caching if we failed to find a free space
      cluster before.  It also makes sure that we don't even try to find a free space
      cluster if we are on our last loop in the allocator, since we will have tried
      everything at this point at it is futile.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0a24325e
    • Josef Bacik's avatar
      Btrfs: make balance code choose more wisely when relocating · ba1bf481
      Josef Bacik authored
      
      
      Currently, we can panic the box if the first block group we go to move is of a
      type where there is no space left to move those extents.  For example, if we
      fill the disk up with data, and then we try to balance and we have no room to
      move the data nor room to allocate new chunks, we will panic.  Change this by
      checking to see if we have room to move this chunk around, and if not, return
      -ENOSPC and move on to the next chunk.  This will make sure we remove block
      groups that are moveable, like if we have alot of empty metadata block groups,
      and then that way we make room to be able to balance our data chunks as well.
      Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
      panics with this patch.
      
      V1->V2:
      -actually search for a free extent on the device to make sure we can allocate a
      chunk if need be.
      
      -fix btrfs_shrink_device to make sure we actually try to relocate all the
      chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
      we don't remove the device with data still on it.
      
      -check to make sure the block group we are going to relocate isn't the last one
      in that particular space
      
      -fix a bug in btrfs_shrink_device where we would change the device's size and
      not fix it if we fail to do our relocate
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ba1bf481
    • Sage Weil's avatar
      Btrfs: fix arithmetic error in clone ioctl · 1fb58a60
      Sage Weil authored
      
      
      Fix an arithmetic error that was breaking extents cloned via the clone
      ioctl starting in the second half of a file.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1fb58a60
    • Yan, Zheng's avatar
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng authored
      
      
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      76dda93c
    • Yan, Zheng's avatar
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng authored
      
      
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Yan, Zheng's avatar
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng authored
      
      
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      13a8a7c8
    • Yan, Zheng's avatar
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng authored
      
      
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1c4850e2
  2. 18 Sep, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: search for an allocation hint while filling file COW · b917b7c3
      Chris Mason authored
      
      
      The allocator has some nice knobs for sending hints about where
      to try and allocate new blocks, but when we're doing file allocations
      we're not sending any hint at all.
      
      This commit adds a simple extent map search to see if we can
      quickly and easily find a hint for the allocator.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b917b7c3
    • Chris Mason's avatar
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason authored
      
      
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f85d7d6c
  3. 17 Sep, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng authored
      
      
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11833d66
  4. 15 Sep, 2009 3 commits
  5. 11 Sep, 2009 17 commits
    • Chris Mason's avatar
    • Chris Mason's avatar
      Btrfs: zero page past end of inline file items · 93c82d57
      Chris Mason authored
      
      
      When btrfs_get_extent is reading inline file items for readpage,
      it needs to copy the inline extent into the page.  If the
      inline extent doesn't cover all of the page, that means there
      is a hole in the file, or that our file is smaller than one
      page.
      
      readpage does zeroing for the case where the file is smaller than one
      page, but nobody is currently zeroing for the case where there is
      a hole after the inline item.
      
      This commit changes btrfs_get_extent to zero fill the page past
      the end of the inline item.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      93c82d57
    • Chris Mason's avatar
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason authored
      
      
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      50a9b214
    • Chris Mason's avatar
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason authored
      
      
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a1ed835e
    • Chris Mason's avatar
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason authored
      
      
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8b62b72b
    • Chris Mason's avatar
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason authored
      
      
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9655d298
    • Chris Mason's avatar
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason authored
      
      
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d5550c63
    • Chris Mason's avatar
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason authored
      
      
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2c64c53d
    • Chris Mason's avatar
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason authored
      
      
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1edbb734
    • Chris Mason's avatar
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason authored
      
      
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e48c465b
    • Chris Mason's avatar
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason authored
      
      
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      890871be
    • Chris Mason's avatar
      Btrfs: tweak congestion backoff · 57fd5a5f
      Chris Mason authored
      
      
      The btrfs io submission thread tries to back off congested devices in
      favor of rotating off to another disk.
      
      But, it tries to make sure it submits at least some IO before rotating
      on (the others may be congested too), and so it has a magic number of
      requests it tries to write before it hops.
      
      This makes the magic number smaller.  Testing shows that we're spending
      too much time on congested devices and leaving the other devices idle.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      57fd5a5f
    • Chris Mason's avatar
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason authored
      
      
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a97adc9f
    • Chris Mason's avatar
      Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84
      Chris Mason authored
      
      
      This changes the btrfs worker threads to batch work items
      into a local list.  It allows us to pull work items in
      large chunks and significantly reduces the number of times we
      need to take the worker thread spinlock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4f878e84
    • Chris Mason's avatar
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason authored
      
      
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • Chris Mason's avatar
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason authored
      
      
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      40431d6c
    • Chris Mason's avatar
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason authored
      
      
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9042846b
  6. 09 Sep, 2009 2 commits
    • Linus Torvalds's avatar
      Linux 2.6.31 · 74fca6a4
      Linus Torvalds authored
      74fca6a4
    • Ed Cashin's avatar
      aoe: allocate unused request_queue for sysfs · 7135a71b
      Ed Cashin authored
      Andy Whitcroft reported an oops in aoe triggered by use of an
      incorrectly initialised request_queue object:
      
        [ 2645.959090] kobject '<NULL>' (ffff880059ca22c0): tried to add
      		an uninitialized object, something is seriously wrong.
        [ 2645.959104] Pid: 6, comm: events/0 Not tainted 2.6.31-5-generic #24-Ubuntu
        [ 2645.959107] Call Trace:
        [ 2645.959139] [<ffffffff8126ca2f>] kobject_add+0x5f/0x70
        [ 2645.959151] [<ffffffff8125b4ab>] blk_register_queue+0x8b/0xf0
        [ 2645.959155] [<ffffffff8126043f>] add_disk+0x8f/0x160
        [ 2645.959161] [<ffffffffa01673c4>] aoeblk_gdalloc+0x164/0x1c0 [aoe]
      
      The request queue of an aoe device is not used but can be allocated in
      code that does not sleep.
      
      Bruno bisected this regression down to
      
        cd43e26f
      
        block: Expose stacked device queues in sysfs
      
      "This seems to generate /sys/block/$device/queue and its contents for
       everyone who is using queues, not just for those queues that have a
       non-NULL queue->request_fn."
      
      Addresses http://bugs.launchpad.net/bugs/410198
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13942
      
      
      
      Note that embedding a queue inside another object has always been
      an illegal construct, since the queues are reference counted and
      must persist until the last reference is dropped. So aoe was
      always buggy in this respect (Jens).
      Signed-off-by: default avatarEd Cashin <ecashin@coraid.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Bruno Premont <bonbons@linux-vserver.org>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      7135a71b
  7. 08 Sep, 2009 2 commits
  8. 07 Sep, 2009 2 commits