1. 02 Jul, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: update backrefs while dropping snapshot · 2c47e605
      Yan Zheng authored
      
      
      The new backref format has restriction on type of backref item.  If a tree
      block isn't referenced by its owner tree, full backrefs must be used for the
      pointers in it. When a tree block loses its owner tree's reference, backrefs
      for the pointers in it should be updated to full backrefs. Current
      btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
      general use.
      
      This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
      problem in the restricted form btrfs_drop_snapshot is used today, but for
      general snapshot deletion this update is required.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2c47e605
  2. 10 Jun, 2009 5 commits
    • Al Viro's avatar
      Fix btrfs when ACLs are configured out · 7df336ec
      Al Viro authored
      
      
      ... otherwise generic_permission() will allow *anything* for all
      files you don't own and that have some group permissions.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      7df336ec
    • Christoph Hellwig's avatar
      Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION · 6cbff00f
      Christoph Hellwig authored
      
      
      Add support for the standard attributes set via chattr and read via
      lsattr.  Currently we store the attributes in the flags value in
      the btrfs inode, but I wonder whether we should split it into two so
      that we don't have to keep converting between the two formats.
      
      Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
      as they were confusing the existing code and got in the way of the
      new additions.
      
      Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
      trivial.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      6cbff00f
    • Chris Mason's avatar
      Btrfs: autodetect SSD devices · c289811c
      Chris Mason authored
      
      
      During mount, btrfs will check the queue nonrot flag
      for all the devices found in the FS.  If they are all
      non-rotating, SSD mode is enabled by default.
      
      If the FS was mounted with -o nossd, the non-rotating
      flag is ignored.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c289811c
    • Chris Mason's avatar
      Btrfs: add mount -o ssd_spread to spread allocations out · 451d7585
      Chris Mason authored
      
      
      Some SSDs perform best when reusing block numbers often, while
      others perform much better when clustering strictly allocates
      big chunks of unused space.
      
      The default mount -o ssd will find rough groupings of blocks
      where there are a bunch of free blocks that might have some
      allocated blocks mixed in.
      
      mount -o ssd_spread will make sure there are no allocated blocks
      mixed in.  It should perform better on lower end SSDs.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      451d7585
    • Yan Zheng's avatar
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng authored
      
      
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5d4f98a2
  3. 24 Apr, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: fix fallocate deadlock on inode extent lock · e980b50c
      Chris Mason authored
      
      
      The btrfs fallocate call takes an extent lock on the entire range
      being fallocated, and then runs through insert_reserved_extent on each
      extent as they are allocated.
      
      The problem with this is that btrfs_drop_extents may decide to try
      and take the same extent lock fallocate was already holding.  The solution
      used here is to push down knowledge of the range that is already locked
      going into btrfs_drop_extents.
      
      It turns out that at least one other caller had the same bug.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e980b50c
    • Josef Bacik's avatar
      Btrfs: try to keep a healthy ratio of metadata vs data block groups · 97e728d4
      Josef Bacik authored
      
      
      This patch makes the chunk allocator keep a good ratio of metadata vs data
      block groups.  By default for every 8 data block groups, we'll allocate 1
      metadata chunk, or about 12% of the disk will be allocated for metadata.  This
      can be changed by specifying the metadata_ratio mount option.
      
      This is simply the number of data block groups that have to be allocated to
      force a metadata chunk allocation.  By making sure we allocate metadata chunks
      more often, we are less likely to get into situations where the whole disk
      has been allocated as data block groups.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      97e728d4
  4. 02 Apr, 2009 3 commits
  5. 03 Apr, 2009 3 commits
    • Chris Mason's avatar
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason authored
      
      
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • Josef Bacik's avatar
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik authored
      
      
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      04018de5
    • Josef Bacik's avatar
      Btrfs: kill the block group alloc mutex · 6226cb0a
      Josef Bacik authored
      
      
      This patch removes the block group alloc mutex used to protect the free space
      tree for allocations and replaces it with a spin lock which is used only to
      protect the free space rb tree.  This means we only take the lock when we are
      directly manipulating the tree, which makes us a touch faster with
      multi-threaded workloads.
      
      This patch also gets rid of btrfs_find_free_space and replaces it with
      btrfs_find_space_for_alloc, which takes the number of bytes you want to
      allocate, and empty_size, which is used to indicate how much free space should
      be at the end of the allocation.
      
      It will return an offset for the allocator to use.  If we don't end up using it
      we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
      kill the alloc_mutex, since we need to make sure nobody else comes along and
      takes our space.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      6226cb0a
  6. 01 Apr, 2009 1 commit
    • Nick Piggin's avatar
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin authored
      
      
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c
  7. 31 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason authored
      
      
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5a3f23d5
  8. 24 Mar, 2009 5 commits
    • Chris Mason's avatar
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason authored
      
      
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      12fcfd22
    • Chris Mason's avatar
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason authored
      
      
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b9473439
    • Chris Mason's avatar
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason authored
      
      
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3e69d58
    • Chris Mason's avatar
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason authored
      
      
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      56bec294
    • Chris Mason's avatar
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason authored
      
      
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  9. 10 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: Fix locking around adding new space_info · 4184ea7f
      Chris Mason authored
      
      
      Storage allocated to different raid levels in btrfs is tracked by
      a btrfs_space_info structure, and all of the current space_infos are
      collected into a list_head.
      
      Most filesystems have 3 or 4 of these structs total, and the list is
      only changed when new raid levels are added or at unmount time.
      
      This commit adds rcu locking on the list head, and properly frees
      things at unmount time.  It also clears the space_info->full flag
      whenever new space is added to the FS.
      
      The locking for the space info list goes like this:
      
      reads: protected by rcu_read_lock()
      writes: protected by the chunk_mutex
      
      At unmount time we don't need special locking because all the readers
      are gone.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4184ea7f
  10. 20 Feb, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik authored
      
      
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  11. 12 Feb, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason authored
      
      
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4008c04a
    • Jeff Mahoney's avatar
      Btrfs: remove btrfs_init_path · e00f7308
      Jeff Mahoney authored
      
      
      btrfs_init_path was initially used when the path objects were on the
      stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
      isn't required.
      
      This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e00f7308
  12. 04 Feb, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason authored
      
      
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b4ce94de
    • Chris Mason's avatar
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason authored
      
      
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c487685d
  13. 21 Jan, 2009 2 commits
    • Yan Zheng's avatar
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng authored
      
      
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      7237f183
    • Jan Engelhardt's avatar
      Btrfs: change/remove typedef · 95029d7d
      Jan Engelhardt authored
      
      
      Change one typedef to a regular enum, and remove an unused one.
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      95029d7d
  14. 05 Jan, 2009 1 commit
  15. 17 Dec, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason authored
      
      
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      cad321ad
  16. 12 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng authored
      
      
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      17d217fe
  17. 11 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng authored
      
      
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  18. 10 Dec, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason authored
      
      
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      459931ec
  19. 08 Dec, 2008 2 commits
    • Chris Mason's avatar
      Btrfs: Add inode sequence number for NFS and reserved space in a few structs · c3027eb5
      Chris Mason authored
      
      
      This adds a sequence number to the btrfs inode that is increased on
      every update.  NFS will be able to use that to detect when an inode has
      changed, without relying on inaccurate time fields.
      
      While we're here, this also:
      
      Puts reserved space into the super block and inode
      
      Adds a log root transid to the super so we can pick the newest super
      based on the fsync log as well as the main transaction ID.  For now
      the log root transid is always zero, but that'll get fixed.
      
      Adds a starting offset to the dev_item.  This will let us do better
      alignment calculations if we know the start of a partition on the disk.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3027eb5
    • Chris Mason's avatar
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason authored
      
      
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d20f7043
  20. 02 Dec, 2008 3 commits
  21. 20 Nov, 2008 1 commit