1. 24 Mar, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason authored
      
      
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      56bec294
    • Chris Mason's avatar
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason authored
      
      
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  2. 10 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: Fix locking around adding new space_info · 4184ea7f
      Chris Mason authored
      
      
      Storage allocated to different raid levels in btrfs is tracked by
      a btrfs_space_info structure, and all of the current space_infos are
      collected into a list_head.
      
      Most filesystems have 3 or 4 of these structs total, and the list is
      only changed when new raid levels are added or at unmount time.
      
      This commit adds rcu locking on the list head, and properly frees
      things at unmount time.  It also clears the space_info->full flag
      whenever new space is added to the FS.
      
      The locking for the space info list goes like this:
      
      reads: protected by rcu_read_lock()
      writes: protected by the chunk_mutex
      
      At unmount time we don't need special locking because all the readers
      are gone.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4184ea7f
  3. 20 Feb, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik authored
      
      
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  4. 12 Feb, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason authored
      
      
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4008c04a
    • Jeff Mahoney's avatar
      Btrfs: remove btrfs_init_path · e00f7308
      Jeff Mahoney authored
      
      
      btrfs_init_path was initially used when the path objects were on the
      stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
      isn't required.
      
      This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e00f7308
  5. 04 Feb, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason authored
      
      
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b4ce94de
    • Chris Mason's avatar
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason authored
      
      
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c487685d
  6. 21 Jan, 2009 2 commits
    • Yan Zheng's avatar
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng authored
      
      
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      7237f183
    • Jan Engelhardt's avatar
      Btrfs: change/remove typedef · 95029d7d
      Jan Engelhardt authored
      
      
      Change one typedef to a regular enum, and remove an unused one.
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      95029d7d
  7. 05 Jan, 2009 1 commit
  8. 17 Dec, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason authored
      
      
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      cad321ad
  9. 12 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng authored
      
      
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      17d217fe
  10. 11 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng authored
      
      
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  11. 10 Dec, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason authored
      
      
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      459931ec
  12. 08 Dec, 2008 2 commits
    • Chris Mason's avatar
      Btrfs: Add inode sequence number for NFS and reserved space in a few structs · c3027eb5
      Chris Mason authored
      
      
      This adds a sequence number to the btrfs inode that is increased on
      every update.  NFS will be able to use that to detect when an inode has
      changed, without relying on inaccurate time fields.
      
      While we're here, this also:
      
      Puts reserved space into the super block and inode
      
      Adds a log root transid to the super so we can pick the newest super
      based on the fsync log as well as the main transaction ID.  For now
      the log root transid is always zero, but that'll get fixed.
      
      Adds a starting offset to the dev_item.  This will let us do better
      alignment calculations if we know the start of a partition on the disk.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3027eb5
    • Chris Mason's avatar
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason authored
      
      
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d20f7043
  13. 02 Dec, 2008 3 commits
  14. 20 Nov, 2008 1 commit
  15. 18 Nov, 2008 1 commit
  16. 17 Nov, 2008 5 commits
    • Chris Mason's avatar
      Btrfs: prevent loops in the directory tree when creating snapshots · ea9e8b11
      Chris Mason authored
      
      
      For a directory tree:
      
      /mnt/subvolA/subvolB
      
      btrfsctl -s /mnt/subvolA/subvolB /mnt
      
      Will create a directory loop with subvolA under subvolB.  This
      commit uses the forward refs for each subvol and snapshot to error out
      before creating the loop.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ea9e8b11
    • Chris Mason's avatar
      Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af
      Chris Mason authored
      
      
      Subvols and snapshots can now be referenced from any point in the directory
      tree.  We need to maintain back refs for them so we can find lost
      subvols.
      
      Forward refs are added so that we know all of the subvols and
      snapshots referenced anywhere in the directory tree of a single subvol.  This
      can be used to do recursive snapshotting (but they aren't yet) and it is
      also used to detect and prevent directory loops when creating new snapshots.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0660b5af
    • Chris Mason's avatar
      Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160
      Chris Mason authored
      
      
      Each subvolume has its own private inode number space, and so we need
      to fill in different device numbers for each subvolume to avoid confusing
      applications.
      
      This commit puts a struct super_block into struct btrfs_root so it can
      call set_anon_super() and get a different device number generated for
      each root.
      
      btrfs_rename is changed to prevent renames across subvols.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3394e160
    • Chris Mason's avatar
      Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c
      Chris Mason authored
      
      
      Before, all snapshots and subvolumes lived in a single flat directory.  This
      was awkward and confusing because the single flat directory was only writable
      with the ioctls.
      
      This commit changes the ioctls to create subvols and snapshots at any
      point in the directory tree.  This requires making separate ioctls for
      snapshot and subvol creation instead of a combining them into one.
      
      The subvol ioctl does:
      
      btrfsctl -S subvol_name parent_dir
      
      After the ioctl is done subvol_name lives inside parent_dir.
      
      The snapshot ioctl does:
      
      btrfsctl -s path_for_snapshot root_to_snapshot
      
      path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
      into directory and basename components.
      
      root_to_snapshot can be any file or directory in the FS.  The snapshot
      is taken of the entire root where that file lives.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3de4586c
    • Yan Zheng's avatar
      Btrfs: Seed device support · 2b82032c
      Yan Zheng authored
      
      
      Seed device is a special btrfs with SEEDING super flag
      set and can only be mounted in read-only mode. Seed
      devices allow people to create new btrfs on top of it.
      
      The new FS contains the same contents as the seed device,
      but it can be mounted in read-write mode.
      
      This patch does the following:
      
      1) split code in btrfs_alloc_chunk into two parts. The first part does makes
      the newly allocated chunk usable, but does not do any operation that modifies
      the chunk tree. The second part does the the chunk tree modifications. This
      division is for the bootstrap step of adding storage to the seed device.
      
      2) Update device management code to handle seed device.
      The basic idea is: For an FS grown from seed devices, its
      seed devices are put into a list. Seed devices are
      opened on demand at mounting time. If any seed device is
      missing or has been changed, btrfs kernel module will
      refuse to mount the FS.
      
      3) make btrfs_find_block_group not return NULL when all
      block groups are read-only.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      2b82032c
  17. 12 Nov, 2008 2 commits
    • Yan Zheng's avatar
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng authored
      
      
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      c146afad
    • Josef Bacik's avatar
      Btrfs: batch extent inserts/updates/deletions on the extent root · f3465ca4
      Josef Bacik authored
      
      
      While profiling the allocator I noticed a good amount of time was being spent in
      finish_current_insert and del_pending_extents, and as the filesystem filled up
      more and more time was being spent in those functions.  This patch aims to try
      and reduce that problem.  This happens two ways
      
      1) track if we tried to delete an extent that we are going to update or insert.
      Once we get into finish_current_insert we discard any of the extents that were
      marked for deletion.  This saves us from doing unnecessary work almost every
      time finish_current_insert runs.
      
      2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
      each individual extent and doing the needed operation, we instead keep the leaf
      around and see if there is anything else we can do on that leaf.  On the insert
      case I introduced a btrfs_insert_some_items, which will take an array of keys
      with an array of data_sizes and try and squeeze in as many of those keys as
      possible, and then return how many keys it was able to insert.  In the update
      case we search for an extent ref, update the ref and then loop through the leaf
      to see if any of the other refs we are looking to update are on that leaf, and
      then once we are done we release the path and search for the next ref we need to
      update.  And finally for the deletion we try and delete the extent+ref in pairs,
      so we will try to find extent+ref pairs next to the extent we are trying to free
      and free them in bulk if possible.
      
      This along with the other cluster fix that Chris pushed out a bit ago helps make
      the allocator preform more uniformly as it fills up the disk.  There is still a
      slight drop as we fill up the disk since we start having to stick new blocks in
      odd places which results in more COW's than on a empty fs, but the drop is not
      nearly as severe as it was before.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      f3465ca4
  18. 06 Nov, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Optimize compressed writeback and reads · 771ed689
      Chris Mason authored
      
      
      When reading compressed extents, try to put pages into the page cache
      for any pages covered by the compressed extent that readpages didn't already
      preload.
      
      Add an async work queue to handle transformations at delayed allocation processing
      time.  Right now this is just compression.  The workflow is:
      
      1) Find offsets in the file marked for delayed allocation
      2) Lock the pages
      3) Lock the state bits
      4) Call the async delalloc code
      
      The async delalloc code clears the state lock bits and delalloc bits.  It is
      important this happens before the range goes into the work queue because
      otherwise it might deadlock with other work queue items that try to lock
      those extent bits.
      
      The file pages are compressed, and if the compression doesn't work the
      pages are written back directly.
      
      An ordered work queue is used to make sure the inodes are written in the same
      order that pdflush or writepages sent them down.
      
      This changes extent_write_cache_pages to let the writepage function
      update the wbc nr_written count.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      771ed689
  19. 31 Oct, 2008 1 commit
  20. 30 Oct, 2008 3 commits
    • Yan Zheng's avatar
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng authored
      
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d899e052
    • Yan Zheng's avatar
      Btrfs: update nodatacow code v2 · 80ff3856
      Yan Zheng authored
      
      
      This patch simplifies the nodatacow checker. If all references
      were created after the latest snapshot, then we can avoid COW
      safely. This patch also updates run_delalloc_nocow to do more
      fine-grained checking.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      80ff3856
    • Yan Zheng's avatar
      Btrfs: update hole handling v2 · 9036c102
      Yan Zheng authored
      
      
      This patch splits the hole insertion code out of btrfs_setattr
      into btrfs_cont_expand and updates btrfs_get_extent to properly
      handle the case that file extent items are not continuous.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      9036c102
  21. 29 Oct, 2008 6 commits
    • Chris Mason's avatar
    • Yan Zheng's avatar
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng authored
      
      
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • Josef Bacik's avatar
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik authored
      
      
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Josef Bacik's avatar
      Btrfs: fix enospc when there is plenty of space · 80eb234a
      Josef Bacik authored
      
      
      So there is an odd case where we can possibly return -ENOSPC when there is in
      fact space to be had.  It only happens with Metadata writes, and happens _very_
      infrequently.  What has to happen is we have to allocate have allocated out of
      the first logical byte on the disk, which would set last_alloc to
      first_logical_byte(root, 0), so search_start == orig_search_start.  We then
      need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
      BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
      block_group_bits() won't match and we'll go to choose another block group.
      However because search_start matches orig_search_start we go to see if we can
      allocate a chunk.
      
      If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
      This is kind of a big flaw of the way find_free_extent works, as it along with
      find_free_space loop through _all_ of the block groups, not just the ones that
      we want to allocate out of.  This patch completely kills find_free_space and
      rolls it into find_free_extent.  I've introduced a sort of state machine into
      this, which will make it easier to get cache miss information out of the
      allocator, and will work well with my locking changes.
      
      The basic flow is this:  We have the variable loop which is 0, meaning we are
      in the hint phase.  We lookup the block group for the hint, and lookup the
      space_info for what we want to allocate out of.  If the block group we were
      pointed at by the hint either isn't of the correct type, or just doesn't have
      the space we need, we set head to space_info->block_groups, so we start at the
      beginning of the block groups for this particular space info, and loop through.
      
      This is also where we add the empty_cluster to total_needed.  At this point
      loop is set to 1 and we just loop through all of the block groups for this
      particular space_info looking for the space we need, just as find_free_space
      would have done, except we only hit the block groups we want and not _all_ of
      the block groups.  If we come full circle we see if we can allocate a chunk.
      If we cannot of course we exit with -ENOSPC and we are good.  If not we start
      over at space_info->block_groups and loop through again, with loop == 2.  If we
      come full circle and haven't found what we need then we exit with -ENOSPC.
      I've been running this for a couple of days now and it seems stable, and I
      haven't yet hit a -ENOSPC when there was plenty of space left.
      
      Also I've made a groups_sem to handle the group list for the space_info.  This
      is part of my locking changes, but is relatively safe and seems better than
      holding the space_info spinlock over that entire search time.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
       
      80eb234a
    • Yan Zheng's avatar
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng authored
      
      
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • Chris Mason's avatar
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason authored
      
      
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8b97818