1. 21 Sep, 2009 2 commits
    • Yan, Zheng's avatar
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng authored
      
      
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      13a8a7c8
    • Yan, Zheng's avatar
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng authored
      
      
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1c4850e2
  2. 18 Sep, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: search for an allocation hint while filling file COW · b917b7c3
      Chris Mason authored
      
      
      The allocator has some nice knobs for sending hints about where
      to try and allocate new blocks, but when we're doing file allocations
      we're not sending any hint at all.
      
      This commit adds a simple extent map search to see if we can
      quickly and easily find a hint for the allocator.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b917b7c3
    • Chris Mason's avatar
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason authored
      
      
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f85d7d6c
  3. 17 Sep, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng authored
      
      
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11833d66
  4. 15 Sep, 2009 3 commits
  5. 11 Sep, 2009 17 commits
    • Chris Mason's avatar
    • Chris Mason's avatar
      Btrfs: zero page past end of inline file items · 93c82d57
      Chris Mason authored
      
      
      When btrfs_get_extent is reading inline file items for readpage,
      it needs to copy the inline extent into the page.  If the
      inline extent doesn't cover all of the page, that means there
      is a hole in the file, or that our file is smaller than one
      page.
      
      readpage does zeroing for the case where the file is smaller than one
      page, but nobody is currently zeroing for the case where there is
      a hole after the inline item.
      
      This commit changes btrfs_get_extent to zero fill the page past
      the end of the inline item.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      93c82d57
    • Chris Mason's avatar
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason authored
      
      
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      50a9b214
    • Chris Mason's avatar
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason authored
      
      
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a1ed835e
    • Chris Mason's avatar
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason authored
      
      
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8b62b72b
    • Chris Mason's avatar
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason authored
      
      
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9655d298
    • Chris Mason's avatar
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason authored
      
      
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d5550c63
    • Chris Mason's avatar
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason authored
      
      
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2c64c53d
    • Chris Mason's avatar
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason authored
      
      
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1edbb734
    • Chris Mason's avatar
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason authored
      
      
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e48c465b
    • Chris Mason's avatar
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason authored
      
      
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      890871be
    • Chris Mason's avatar
      Btrfs: tweak congestion backoff · 57fd5a5f
      Chris Mason authored
      
      
      The btrfs io submission thread tries to back off congested devices in
      favor of rotating off to another disk.
      
      But, it tries to make sure it submits at least some IO before rotating
      on (the others may be congested too), and so it has a magic number of
      requests it tries to write before it hops.
      
      This makes the magic number smaller.  Testing shows that we're spending
      too much time on congested devices and leaving the other devices idle.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      57fd5a5f
    • Chris Mason's avatar
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason authored
      
      
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a97adc9f
    • Chris Mason's avatar
      Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84
      Chris Mason authored
      
      
      This changes the btrfs worker threads to batch work items
      into a local list.  It allows us to pull work items in
      large chunks and significantly reduces the number of times we
      need to take the worker thread spinlock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4f878e84
    • Chris Mason's avatar
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason authored
      
      
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • Chris Mason's avatar
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason authored
      
      
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      40431d6c
    • Chris Mason's avatar
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason authored
      
      
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9042846b
  6. 09 Sep, 2009 2 commits
    • Linus Torvalds's avatar
      Linux 2.6.31 · 74fca6a4
      Linus Torvalds authored
      74fca6a4
    • Ed Cashin's avatar
      aoe: allocate unused request_queue for sysfs · 7135a71b
      Ed Cashin authored
      Andy Whitcroft reported an oops in aoe triggered by use of an
      incorrectly initialised request_queue object:
      
        [ 2645.959090] kobject '<NULL>' (ffff880059ca22c0): tried to add
      		an uninitialized object, something is seriously wrong.
        [ 2645.959104] Pid: 6, comm: events/0 Not tainted 2.6.31-5-generic #24-Ubuntu
        [ 2645.959107] Call Trace:
        [ 2645.959139] [<ffffffff8126ca2f>] kobject_add+0x5f/0x70
        [ 2645.959151] [<ffffffff8125b4ab>] blk_register_queue+0x8b/0xf0
        [ 2645.959155] [<ffffffff8126043f>] add_disk+0x8f/0x160
        [ 2645.959161] [<ffffffffa01673c4>] aoeblk_gdalloc+0x164/0x1c0 [aoe]
      
      The request queue of an aoe device is not used but can be allocated in
      code that does not sleep.
      
      Bruno bisected this regression down to
      
        cd43e26f
      
        block: Expose stacked device queues in sysfs
      
      "This seems to generate /sys/block/$device/queue and its contents for
       everyone who is using queues, not just for those queues that have a
       non-NULL queue->request_fn."
      
      Addresses http://bugs.launchpad.net/bugs/410198
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13942
      
      
      
      Note that embedding a queue inside another object has always been
      an illegal construct, since the queues are reference counted and
      must persist until the last reference is dropped. So aoe was
      always buggy in this respect (Jens).
      Signed-off-by: default avatarEd Cashin <ecashin@coraid.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Bruno Premont <bonbons@linux-vserver.org>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      7135a71b
  7. 08 Sep, 2009 2 commits
  8. 07 Sep, 2009 5 commits
  9. 06 Sep, 2009 3 commits
  10. 05 Sep, 2009 3 commits