1. 15 Jun, 2009 1 commit
  2. 10 Jun, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng authored
      
      
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5d4f98a2
  3. 24 Apr, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: fix deadlocks and stalls on dead root removal · 59bc5c75
      Chris Mason authored
      
      
      After a transaction commit, the old root of the subvol btrees are sent through
      snapshot removal.  This is what actually frees up any blocks replaced by
      COW, and anything the old blocks pointed to.
      
      Snapshot deletion will pause when a transaction commit has started, which
      helps to avoid a huge amount of delayed reference count updates piling up
      as the transaction is trying to close.
      
      But, this pause happens after the snapshot deletion process has asked other
      procs on the system to throttle back a bit so that it can make progress.
      
      We don't want to throttle everyone while we're waiting for the transaction
      commit, it leads to deadlocks in the user transaction ioctls used by Ceph
      and makes things slower in general.
      
      This patch changes things to avoid the throttling while we sleep.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      59bc5c75
  4. 02 Apr, 2009 1 commit
    • Sage Weil's avatar
      Btrfs: add flushoncommit mount option · dccae999
      Sage Weil authored
      
      
      The 'flushoncommit' mount option forces any data dirtied by a write in a
      prior transaction to commit as part of the current commit.  This makes
      the committed state a fully consistent view of the file system from the
      application's perspective (i.e., it includes all completed file system
      operations).  This was previously the behavior only when a snapshot is
      created.
      
      This is used by Ceph to ensure that completed writes make it to the
      platter along with the metadata operations they are bound to (by
      BTRFS_IOC_TRANS_{START,END}).
      
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      dccae999
  5. 03 Apr, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason authored
      
      
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      fa9c0d79
  6. 31 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason authored
      
      
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5a3f23d5
  7. 24 Mar, 2009 5 commits
    • Chris Mason's avatar
      Btrfs: Only let very young transactions grow during commit · 89573b9c
      Chris Mason authored
      
      
      Commits are fairly expensive, and so btrfs has code to sit around for a while
      during the commit and let new writers come in.
      
      But, while we're sitting there, new delayed refs might be added, and those
      can be expensive to process as well.  Unless the transaction is very very
      young, it makes sense to go ahead and let the commit finish without hanging
      around.
      
      The commit grow loop isn't as important as it used to be, the fsync logging
      code handles most performance critical syncs now.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      89573b9c
    • Chris Mason's avatar
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason authored
      
      
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • Chris Mason's avatar
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason authored
      
      
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3e69d58
    • Chris Mason's avatar
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason authored
      
      
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      56bec294
    • Chris Mason's avatar
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason authored
      
      
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  8. 12 Feb, 2009 1 commit
  9. 21 Jan, 2009 1 commit
  10. 06 Jan, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation · 180591bc
      Yan Zheng authored
      
      
      Snapshot creation happens at a specific time during transaction commit.  We
      need to make sure the code called by snapshot creation doesn't wait
      for the running transaction to commit.
      
      This changes btrfs_delete_inode and finish_pending_snaps to use
      btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.
      
      It would be better if btrfs_delete_inode didn't use the join, but the
      call path that triggers it is:
      
      btrfs_commit_transaction->create_pending_snapshots->
      create_pending_snapshot->btrfs_lookup_dentry->
      fixup_tree_root_location->btrfs_read_fs_root->
      btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput
      
      This will be fixed in a later patch by moving the orphan cleanup to the
      cleaner thread.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      180591bc
  11. 05 Jan, 2009 2 commits
  12. 11 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng authored
      
      
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  13. 08 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng authored
      
      
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  14. 02 Dec, 2008 1 commit
  15. 18 Nov, 2008 1 commit
  16. 17 Nov, 2008 3 commits
    • Chris Mason's avatar
      Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af
      Chris Mason authored
      
      
      Subvols and snapshots can now be referenced from any point in the directory
      tree.  We need to maintain back refs for them so we can find lost
      subvols.
      
      Forward refs are added so that we know all of the subvols and
      snapshots referenced anywhere in the directory tree of a single subvol.  This
      can be used to do recursive snapshotting (but they aren't yet) and it is
      also used to detect and prevent directory loops when creating new snapshots.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0660b5af
    • Chris Mason's avatar
      Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160
      Chris Mason authored
      
      
      Each subvolume has its own private inode number space, and so we need
      to fill in different device numbers for each subvolume to avoid confusing
      applications.
      
      This commit puts a struct super_block into struct btrfs_root so it can
      call set_anon_super() and get a different device number generated for
      each root.
      
      btrfs_rename is changed to prevent renames across subvols.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3394e160
    • Chris Mason's avatar
      Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c
      Chris Mason authored
      
      
      Before, all snapshots and subvolumes lived in a single flat directory.  This
      was awkward and confusing because the single flat directory was only writable
      with the ioctls.
      
      This commit changes the ioctls to create subvols and snapshots at any
      point in the directory tree.  This requires making separate ioctls for
      snapshot and subvol creation instead of a combining them into one.
      
      The subvol ioctl does:
      
      btrfsctl -S subvol_name parent_dir
      
      After the ioctl is done subvol_name lives inside parent_dir.
      
      The snapshot ioctl does:
      
      btrfsctl -s path_for_snapshot root_to_snapshot
      
      path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
      into directory and basename components.
      
      root_to_snapshot can be any file or directory in the FS.  The snapshot
      is taken of the entire root where that file lives.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3de4586c
  17. 07 Nov, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Avoid unplug storms during commit · 5f2cc086
      Chris Mason authored
      
      
      While doing a commit, btrfs makes sure all the metadata blocks
      were properly written to disk, calling wait_on_page_writeback for
      each page.  This writeback happens after allowing another transaction
      to start, so it competes for the disk with other processes in the FS.
      
      If the page writeback bit is still set, each wait_on_page_writeback might
      trigger an unplug, even though the page might be waiting for checksumming
      to finish or might be waiting for the async work queue to submit the
      bio.
      
      This trades wait_on_page_writeback for waiting on the extent writeback
      bits.  It won't trigger any unplugs and substantially improves performance
      in a number of workloads.
      
      This also changes the async bio submission to avoid requeueing if there
      is only one device.  The requeue just wastes CPU time because there are
      no other devices to service.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5f2cc086
  18. 30 Oct, 2008 2 commits
    • Yan Zheng's avatar
      Btrfs: update nodatacow code v2 · 80ff3856
      Yan Zheng authored
      
      
      This patch simplifies the nodatacow checker. If all references
      were created after the latest snapshot, then we can avoid COW
      safely. This patch also updates run_delalloc_nocow to do more
      fine-grained checking.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      80ff3856
    • Chris Mason's avatar
      Btrfs: prevent looping forever in finish_current_insert and del_pending_extents · 87ef2bb4
      Chris Mason authored
      
      
      finish_current_insert and del_pending_extents process extent tree modifications
      that build up while we are changing the extent tree.  It is a confusing
      bit of code that prevents recursion.
      
      Both functions run through a list of pending operations and both funcs
      add to the list of pending operations.  If you have two procs in either
      one of them, they can end up looping forever making more work for each other.
      
      This patch makes them walk forward through the list of pending changes instead
      of always trying to process the entire list.  At transaction commit
      time, we catch any changes that were left over.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      87ef2bb4
  19. 29 Oct, 2008 3 commits
    • Yan Zheng's avatar
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng authored
      
      
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • Josef Bacik's avatar
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik authored
      
      
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Yan Zheng's avatar
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng authored
      
      
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
  20. 03 Oct, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: remove last_log_alloc allocator optimization · 30c43e24
      Chris Mason authored
      
      
      The tree logging code was trying to separate tree log allocations
      from normal metadata allocations to improve writeback patterns during
      an fsync.
      
      But, the code was not effective and ended up just mixing tree log
      blocks with regular metadata.  That seems to be working fairly well,
      so the last_log_alloc code can be removed.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      30c43e24
  21. 29 Sep, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: add and improve comments · d352ac68
      Chris Mason authored
      
      
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d352ac68
  22. 26 Sep, 2008 3 commits
    • Zheng Yan's avatar
      Btrfs: update space balancing code · 1a40e23b
      Zheng Yan authored
      
      
      This patch updates the space balancing code to utilize the new
      backref format.  Before, btrfs-vol -b would break any COW links
      on data blocks or metadata.  This was slow and caused the amount
      of space used to explode if a large number of snapshots were present.
      
      The new code can keeps the sharing of all data extents and
      most of the tree blocks.
      
      To maintain the sharing of data extents, the space balance code uses
      a seperate inode hold data extent pointers, then updates the references
      to point to the new location.
      
      To maintain the sharing of tree blocks, the space balance code uses
      reloc trees to relocate tree blocks in reference counted roots.
      There is one reloc tree for each subvol, and all reloc trees share
      same root key objectid. Reloc trees are snapshots of the latest
      committed roots of subvols (root->commit_root).
      
      To relocate a tree block referenced by a subvol, there are two steps.
      COW the block through subvol's reloc tree, then update block pointer in
      the subvol to point to the new block. Since all reloc trees share
      same root key objectid, doing special handing for tree blocks
      owned by them is easy. Once a tree block has been COWed in one
      reloc tree, we can use the resulting new block directly when the
      same block is required to COW again through other reloc trees.
      In this way, relocated tree blocks are shared between reloc trees,
      so they are also shared between subvols.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1a40e23b
    • Zheng Yan's avatar
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan authored
      
      
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5b21f2ed
    • Zheng Yan's avatar
      Btrfs: Add shared reference cache · e4657689
      Zheng Yan authored
      
      
      Btrfs has a cache of reference counts in leaves, allowing it to
      avoid reading tree leaves while deleting snapshots.  To reduce
      contention with multiple subvolumes, this cache is private to each
      subvolume.
      
      This patch adds shared reference cache support. The new space
      balancing code plays with multiple subvols at the same time, So
      the old per-subvol reference cache is not well suited.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e4657689
  23. 25 Sep, 2008 6 commits
    • Chris Mason's avatar
      Btrfs: Record dirty pages tree-log pages in an extent_io tree · d0c803c4
      Chris Mason authored
      
      
      This is the same way the transaction code makes sure that all the
      other tree blocks are safely on disk.  There's an extent_io tree
      for each root, and any blocks allocated to the tree logs are
      recorded in that tree.
      
      At tree-log sync, the extent_io tree is walked to flush down the
      dirty pages and wait for them.
      
      The main benefit is less time spent walking the tree log and skipping
      clean pages, and getting sequential IO down to the drive.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d0c803c4
    • Chris Mason's avatar
      Btrfs: Tree logging fixes · 4bef0848
      Chris Mason authored
      
      
      * Pin down data blocks to prevent them from being reallocated like so:
      
      trans 1: allocate file extent
      trans 2: free file extent
      trans 3: free file extent during old snapshot deletion
      trans 3: allocate file extent to new file
      trans 3: fsync new file
      
      Before the tree logging code, this was legal because the fsync
      would commit the transation that did the final data extent free
      and the transaction that allocated the extent to the new file
      at the same time.
      
      With the tree logging code, the tree log subtransaction can commit
      before the transaction that freed the extent.  If we crash,
      we're left with two different files using the extent.
      
      * Don't wait in start_transaction if log replay is going on.  This
      avoids deadlocks from iput while we're cleaning up link counts in the
      replay code.
      
      * Don't deadlock in replay_one_name by trying to read an inode off
      the disk while holding paths for the directory
      
      * Hold the buffer lock while we mark a buffer as written.  This
      closes a race where someone is changing a buffer while we write it.
      They are supposed to mark it dirty again after they change it, but
      this violates the cow rules.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4bef0848
    • Chris Mason's avatar
      Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5
      Chris Mason authored
      
      
      File syncs and directory syncs are optimized by copying their
      items into a special (copy-on-write) log tree.  There is one log tree per
      subvolume and the btrfs super block points to a tree of log tree roots.
      
      After a crash, items are copied out of the log tree and back into the
      subvolume.  See tree-log.c for all the details.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e02119d5
    • Chris Mason's avatar
      Btrfs: Wait for async bio submissions to make some progress at queue time · b64a2851
      Chris Mason authored
      
      
      Before, the btrfs bdi congestion function was used to test for too many
      async bios.  This keeps that check to throttle pdflush, but also
      adds a check while queuing bios.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b64a2851
    • Chris Mason's avatar
      Btrfs: Transaction commit: don't use filemap_fdatawait · 777e6bd7
      Chris Mason authored
      
      
      After writing out all the remaining btree blocks in the transaction,
      the commit code would use filemap_fdatawait to make sure it was all
      on disk.  This means it would wait for blocks written by other procs
      as well.
      
      The new code walks the list of blocks for this transaction again
      and waits only for those required by this transaction.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      777e6bd7
    • Yan Zheng's avatar
      7ea394f1