1. 10 Jun, 2009 2 commits
    • Hisashi Hifumi's avatar
      Btrfs: fdatasync should skip metadata writeout · 524724ed
      Hisashi Hifumi authored
      
      
      In btrfs, fdatasync and fsync are identical, but
      fdatasync should skip committing transaction when
      inode->i_state is set just I_DIRTY_SYNC and this indicates
      only atime or/and mtime updates.
      Following patch improves fdatasync throughput.
      
      --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
      --file-fsync-mode=fdatasync run
      
      Results:
      -2.6.30-rc8
      Test execution summary:
          total time:                          1980.6540s
          total number of events:              10001
          total time taken by event execution: 1192.9804
          per-request statistics:
               min:                            0.0000s
               avg:                            0.1193s
               max:                            15.3720s
               approx.  95 percentile:         0.7257s
      
      Threads fairness:
          events (avg/stddev):           625.0625/151.32
          execution time (avg/stddev):   74.5613/9.46
      
      -2.6.30-rc8-patched
      Test execution summary:
          total time:                          1695.9118s
          total number of events:              10000
          total time taken by event execution: 871.3214
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0871s
               max:                            10.4644s
               approx.  95 percentile:         0.4787s
      
      Threads fairness:
          events (avg/stddev):           625.0000/131.86
          execution time (avg/stddev):   54.4576/8.98
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      524724ed
    • Yan Zheng's avatar
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng authored
      
      
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5d4f98a2
  2. 27 Apr, 2009 1 commit
  3. 24 Apr, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: fix fallocate deadlock on inode extent lock · e980b50c
      Chris Mason authored
      
      
      The btrfs fallocate call takes an extent lock on the entire range
      being fallocated, and then runs through insert_reserved_extent on each
      extent as they are allocated.
      
      The problem with this is that btrfs_drop_extents may decide to try
      and take the same extent lock fallocate was already holding.  The solution
      used here is to push down knowledge of the range that is already locked
      going into btrfs_drop_extents.
      
      It turns out that at least one other caller had the same bug.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e980b50c
  4. 21 Apr, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: fix btrfs fallocate oops and deadlock · 546888da
      Chris Mason authored
      
      
      Btrfs fallocate was incorrectly starting a transaction with a lock held
      on the extent_io tree for the file, which could deadlock.  Strictly
      speaking it was using join_transaction which would be safe, but it is better
      to move the transaction outside of the lock.
      
      When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
      being called on an unlocked buffer.  This was triggering an assertion and
      oops because the lock is supposed to be held.
      
      The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
      been run.  btrfs_del_item takes care of dirtying things, so the solution is a
      to skip the btrfs_mark_buffer_dirty call in this case.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      546888da
  5. 20 Apr, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason authored
      
      
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d313d7a3
  6. 31 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason authored
      
      
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5a3f23d5
  7. 24 Mar, 2009 3 commits
    • Chris Mason's avatar
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason authored
      
      
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      12fcfd22
    • Chris Mason's avatar
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason authored
      
      
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b9473439
    • Chris Mason's avatar
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason authored
      
      
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      56bec294
  8. 20 Feb, 2009 2 commits
    • Josef Bacik's avatar
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik authored
      
      
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
    • Chris Mason's avatar
      Btrfs: check file pointer in btrfs_sync_file · 2cfbd50b
      Chris Mason authored
      
      
      fsync can be called by NFS with a null file pointer, and btrfs was
      oopsing in this case.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2cfbd50b
  9. 21 Jan, 2009 2 commits
    • Yan Zheng's avatar
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng authored
      
      
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      7237f183
    • Huang Weiyi's avatar
      Btrfs: removed unused #include <version.h>'s · 7eaebe7d
      Huang Weiyi authored
      
      
      Removed unused #include <version.h>'s in btrfs
      Signed-off-by: default avatarHuang Weiyi <weiyi.huang@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      7eaebe7d
  10. 06 Jan, 2009 1 commit
  11. 05 Jan, 2009 2 commits
  12. 12 Dec, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng authored
      
      
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      17d217fe
  13. 08 Dec, 2008 2 commits
    • Chris Mason's avatar
      Btrfs: Fix compressed checksum fsync log copies · 580afd76
      Chris Mason authored
      
      
      The fsync logging code makes sure to onl copy the relevant checksum for each
      extent based on the file extent pointers it finds.
      
      But for compressed extents, it needs to copy the checksum for the
      entire extent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      580afd76
    • Chris Mason's avatar
      Btrfs: Add inode sequence number for NFS and reserved space in a few structs · c3027eb5
      Chris Mason authored
      
      
      This adds a sequence number to the btrfs inode that is increased on
      every update.  NFS will be able to use that to detect when an inode has
      changed, without relying on inaccurate time fields.
      
      While we're here, this also:
      
      Puts reserved space into the super block and inode
      
      Adds a log root transid to the super so we can pick the newest super
      based on the fsync log as well as the main transaction ID.  For now
      the log root transid is always zero, but that'll get fixed.
      
      Adds a starting offset to the dev_item.  This will let us do better
      alignment calculations if we know the start of a partition on the disk.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3027eb5
  14. 02 Dec, 2008 1 commit
  15. 12 Nov, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: Fix race in btrfs_mark_extent_written · c36047d7
      Yan Zheng authored
      
      
      When extent needs to be split, btrfs_mark_extent_written truncates the extent
      first, then inserts a new extent and increases the reference count.
      
      The race happens if someone else deletes the old extent before the new extent
      is inserted. The fix here is increase the reference count in advance. This race
      is similar to the race in btrfs_drop_extents that was recently fixed.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      c36047d7
  16. 11 Nov, 2008 1 commit
    • Yan Zheng's avatar
      Btrfs: Fix starting search offset inside btrfs_drop_extents · 8247b41a
      Yan Zheng authored
      
      
      btrfs_drop_extents will drop paths and search again when it needs to
      force COW of higher nodes.  It was using the key it found during the last
      search as the offset for the next search.
      
      But, this wasn't always correct.  The key could be from before our desired
      range, and because we're dropping the path, it is possible for file's items
      to change while we do the search again.
      
      The fix here is to make sure we don't search for something smaller than
      the offset btrfs_drop_extents was called with.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8247b41a
  17. 10 Nov, 2008 2 commits
  18. 06 Nov, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Optimize compressed writeback and reads · 771ed689
      Chris Mason authored
      
      
      When reading compressed extents, try to put pages into the page cache
      for any pages covered by the compressed extent that readpages didn't already
      preload.
      
      Add an async work queue to handle transformations at delayed allocation processing
      time.  Right now this is just compression.  The workflow is:
      
      1) Find offsets in the file marked for delayed allocation
      2) Lock the pages
      3) Lock the state bits
      4) Call the async delalloc code
      
      The async delalloc code clears the state lock bits and delalloc bits.  It is
      important this happens before the range goes into the work queue because
      otherwise it might deadlock with other work queue items that try to lock
      those extent bits.
      
      The file pages are compressed, and if the compression doesn't work the
      pages are written back directly.
      
      An ordered work queue is used to make sure the inodes are written in the same
      order that pdflush or writepages sent them down.
      
      This changes extent_write_cache_pages to let the writepage function
      update the wbc nr_written count.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      771ed689
  19. 31 Oct, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Compression corner fixes · 70b99e69
      Chris Mason authored
      
      
      Make sure we keep page->mapping NULL on the pages we're getting
      via alloc_page.  It gets set so a few of the callbacks can do the right
      thing, but in general these pages don't have a mapping.
      
      Don't try to truncate compressed inline items in btrfs_drop_extents.
      The whole compressed item must be preserved.
      
      Don't try to create multipage inline compressed items.  When we try to
      overwrite just the first page of the file, we would have to read in and recow
      all the pages after it in the same compressed inline items.  For now, only
      create single page inline items.
      
      Make sure we lock pages in the correct order during delalloc.  The
      search into the state tree for delalloc bytes can return bytes before
      the page we already have locked.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      70b99e69
  20. 30 Oct, 2008 3 commits
    • Yan Zheng's avatar
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng authored
      
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      d899e052
    • Yan Zheng's avatar
      Btrfs: Fix bookend extent race v2 · 6643558d
      Yan Zheng authored
      
      
      When dropping middle part of an extent, btrfs_drop_extents truncates
      the extent at first, then inserts a bookend extent.
      
      Since truncation and insertion can't be done atomically, there is a small
      period that the bookend extent isn't in the tree. This causes problem for
      functions that search the tree for file extent item. The way to fix this is
      lock the range of the bookend extent before truncation.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      6643558d
    • Yan Zheng's avatar
      Btrfs: update hole handling v2 · 9036c102
      Yan Zheng authored
      
      
      This patch splits the hole insertion code out of btrfs_setattr
      into btrfs_cont_expand and updates btrfs_get_extent to properly
      handle the case that file extent items are not continuous.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      9036c102
  21. 29 Oct, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason authored
      
      
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8b97818
  22. 09 Oct, 2008 2 commits
    • Yan Zheng's avatar
      Btrfs: Remove offset field from struct btrfs_extent_ref · 3bb1a1bc
      Yan Zheng authored
      
      
      The offset field in struct btrfs_extent_ref records the position
      inside file that file extent is referenced by. In the new back
      reference system, tree leaves holding references to file extent
      are recorded explicitly. We can scan these tree leaves very quickly, so the
      offset field is not required.
      
      This patch also makes the back reference system check the objectid
      when extents are in deleting.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      3bb1a1bc
    • Yan Zheng's avatar
      Btrfs: Count space allocated to file in bytes · a76a3cd4
      Yan Zheng authored
      
      
      This patch makes btrfs count space allocated to file in bytes instead
      of 512 byte sectors.
      
      Everything else in btrfs uses a byte count instead of sector sizes or
      blocks sizes, so this fits better.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      a76a3cd4
  23. 03 Oct, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: O_DIRECT writes via buffered writes + invaldiate · cb843a6f
      Chris Mason authored
      
      
      This reworks the btrfs O_DIRECT write code a bit.  It had always fallen
      back to buffered IO and done an invalidate, but needed to be updated
      for the data=ordered code.  The invalidate wasn't actually removing pages
      because they were still inside an ordered extent.
      
      This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
      off IO in the main btrfs_file_write loop to keep the pipe down the the
      disk full as we process long writes.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      cb843a6f
  24. 29 Sep, 2008 1 commit
    • Chris Mason's avatar
      Btrfs: add and improve comments · d352ac68
      Chris Mason authored
      
      
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d352ac68
  25. 26 Sep, 2008 1 commit
    • Zheng Yan's avatar
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan authored
      
      
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5b21f2ed
  26. 25 Sep, 2008 4 commits