1. 02 Apr, 2009 1 commit
  2. 03 Apr, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik authored
      
      
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      04018de5
  3. 24 Mar, 2009 4 commits
    • Chris Mason's avatar
      Btrfs: optimize fsyncs on old files · af4176b4
      Chris Mason authored
      
      
      The fsync log has code to make sure all of the parents of a file are in the
      log along with the file.  It uses a minimal log of the parent directory
      inodes, just enough to get the parent directory on disk.
      
      If the transaction that originally created a file is fully on disk,
      and the file hasn't been renamed or linked into other directories, we
      can safely skip the parent directory walk.  We know the file is on disk
      somewhere and we can go ahead and just log that single file.
      
      This is more important now because unrelated unlinks in the parent directory
      might make us force a commit if we try to log the parent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      af4176b4
    • Chris Mason's avatar
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason authored
      
      
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      12fcfd22
    • Chris Mason's avatar
      Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay · a74ac322
      Chris Mason authored
      
      
      During log replay, inodes are copied from the log to the main filesystem
      btrees.  Sometimes they have a zero link count in the log but they actually
      gain links during the replay or have some in the main btree.
      
      This patch updates the link count to be at least one after copying the
      inode out of the log.  This makes sure the inode is deleted during an
      iput while the rest of the replay code is still working on it.
      
      The log replay has fixup code to make sure that link counts are correct
      at the end of the replay, so we could use any non-zero number here and
      it would work fine.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a74ac322
    • Chris Mason's avatar
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason authored
      
      
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b9473439
  4. 12 Feb, 2009 1 commit
  5. 04 Feb, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason authored
      
      
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b4ce94de
  6. 21 Jan, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng authored
      
      
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      7237f183
  7. 09 Jan, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: explicitly mark the tree log root for writeback · e293e97e
      Chris Mason authored
      
      
      Each subvolume has an extent_state_tree used to mark metadata
      that needs to be sent to disk while syncing the tree.  This is
      used in addition to the dirty bits on the pages themselves so that
      a single subvolume can be sent to disk efficiently in disk order.
      
      Normally this marking happens in btrfs_alloc_free_block, which also does
      special recording of dirty tree blocks for the tree log roots.
      
      Yan Zheng noticed that when the root of the log tree is allocated, it is added
      to the wrong writeback list.  The fix used here is to explicitly set
      it dirty as part of tree log creation.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e293e97e
  8. 06 Jan, 2009 1 commit
    • Yan Zheng's avatar
      Btrfs: tree logging checksum fixes · 07d400a6
      Yan Zheng authored
      
      
      This patch contains following things.
      
      1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
      struct is kmalloced so we want to keep it reasonable.
      
      2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
      duplicated code in tree-log.c
      
      3) Remove replay_one_csum. csum items are replayed at the same time as
         replaying file extents. This guarantees we only replay useful csums.
      
      4) nbytes accounting fix.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      07d400a6
  9. 05 Jan, 2009 2 commits
  10. 17 Dec, 2008 1 commit
  11. 08 Dec, 2008 3 commits
    • Chris Mason's avatar
      Btrfs: Fix compressed checksum fsync log copies · 580afd76
      Chris Mason authored
      
      
      The fsync logging code makes sure to onl copy the relevant checksum for each
      extent based on the file extent pointers it finds.
      
      But for compressed extents, it needs to copy the checksum for the
      entire extent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      580afd76
    • Yan Zheng's avatar
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng authored
      
      
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      a512bbf8
    • Chris Mason's avatar
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason authored
      
      
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d20f7043
  12. 02 Dec, 2008 2 commits
  13. 30 Oct, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng authored
      
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d899e052
  14. 29 Oct, 2008 3 commits
    • Yan Zheng's avatar
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng authored
      
      
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • Josef Bacik's avatar
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik authored
      
      
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Chris Mason's avatar
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason authored
      
      
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8b97818
  15. 09 Oct, 2008 2 commits
    • Yan Zheng's avatar
      Btrfs: Remove offset field from struct btrfs_extent_ref · 3bb1a1bc
      Yan Zheng authored
      
      
      The offset field in struct btrfs_extent_ref records the position
      inside file that file extent is referenced by. In the new back
      reference system, tree leaves holding references to file extent
      are recorded explicitly. We can scan these tree leaves very quickly, so the
      offset field is not required.
      
      This patch also makes the back reference system check the objectid
      when extents are in deleting.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      3bb1a1bc
    • Yan Zheng's avatar
      Btrfs: Count space allocated to file in bytes · a76a3cd4
      Yan Zheng authored
      
      
      This patch makes btrfs count space allocated to file in bytes instead
      of 512 byte sectors.
      
      Everything else in btrfs uses a byte count instead of sector sizes or
      blocks sizes, so this fits better.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      a76a3cd4
  16. 25 Sep, 2008 10 commits