1. 25 Oct, 2012 1 commit
  2. 09 Oct, 2012 4 commits
    • Josef Bacik's avatar
      Btrfs: cache extent state when writing out dirty metadata pages · e6138876
      Josef Bacik authored
      Everytime we write out dirty pages we search for an offset in the tree,
      convert the bits in the state, and then when we wait we search for the
      offset again and clear the bits.  So for every dirty range in the io tree we
      are doing 4 rb searches, which is suboptimal.  With this patch we are only
      doing 2 searches for every cycle (modulo weird things happening).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
    • Miao Xie's avatar
      Btrfs: fix orphan transaction on the freezed filesystem · 354aa0fb
      Miao Xie authored
      With the following debug patch:
       static int btrfs_freeze(struct super_block *sb)
      + 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
      +	struct btrfs_transaction *trans;
      +	spin_lock(&fs_info->trans_lock);
      +	trans = fs_info->running_transaction;
      +	if (trans) {
      +		printk("Transid %llu, use_count %d, num_writer %d\n",
      +			trans->transid, atomic_read(&trans->use_count),
      +			atomic_read(&trans->num_writers));
      +	}
      +	spin_unlock(&fs_info->trans_lock);
       	return 0;
      I found there was a orphan transaction after the freeze operation was done.
      It is because the transaction may not be committed when the transaction handle
      end even though it is the last handle of the current transaction. This design
      avoid committing the transaction frequently, but also introduce the above
      So I add btrfs_attach_transaction() which can catch the current transaction
      and commit it. If there is no transaction, it will return ENOENT, and do not
      This function also can be used to instead of btrfs_join_transaction_freeze()
      because it don't increase the writer counter and don't start a new transaction,
      so it also can fix the deadlock between sync and freeze.
      Besides that, it is used to instead of btrfs_join_transaction() in
      transaction_kthread(), because if there is no transaction, the transaction
      kthread needn't anything.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Miao Xie's avatar
      Btrfs: add a type field for the transaction handle · a698d075
      Miao Xie authored
      This patch add a type field into the transaction handle structure,
      in this way, we needn't implement various end-transaction functions
      and can make the code more simple and readable.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Miao Xie's avatar
      Btrfs: fix memory leak in start_transaction() · e8830e60
      Miao Xie authored
      This patch fixes memory leak of the transaction handle which happened
      when starting transaction failed on a freezed fs.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
  3. 08 Oct, 2012 1 commit
  4. 04 Oct, 2012 3 commits
    • Josef Bacik's avatar
      Btrfs: fix race with freeze and free space inodes · 98114659
      Josef Bacik authored
      So we start our freeze, somebody comes in and does an fsync() on a file
      where we have to commit a transaction for whatever reason, and we will
      deadlock because the freeze is waiting on FS_FREEZE people to stop writing
      to the file system, but the transaction is waiting for its free space inodes
      to be written out, which are in turn waiting on sb_start_intwrite while
      trying to write the file extents.  To fix this we'll just skip the
      sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a
      transaction commit so we're safe wrt to freeze and this will keep us from
      deadlocking.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
    • Liu Bo's avatar
      Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents · 6bbe3a9c
      Liu Bo authored
      nocow_only is now an obsolete argument.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
    • Josef Bacik's avatar
      Btrfs: fix race in sync and freeze again · 60376ce4
      Josef Bacik authored
      I screwed this up, there is a race between checking if there is a running
      transaction and actually starting a transaction in sync where we could race
      with a freezer and get ourselves into trouble.  To fix this we need to make
      a new join type to only do the try lock on the freeze stuff.  If it fails
      we'll return EPERM and just return from sync.  This fixes a hang Liu Bo
      reported when running xfstest 68 in a loop.  Thanks,
      Reported-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
  5. 01 Oct, 2012 8 commits
    • Josef Bacik's avatar
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik authored
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
    • Josef Bacik's avatar
      Btrfs: move the sb_end_intwrite until after the throttle logic · 6df7881a
      Josef Bacik authored
      Sage reported the following lockdep backtrace
      [ BUG: bad unlock balance detected! ]
      3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
      btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
      [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
      but there are no more locks to release!
      other info that might help us debug this:
      1 lock held by btrfs-cleaner/7607:
       #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
      stack backtrace:
      Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
      Call Trace:
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
       [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
       [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
       [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810b2a95>] lock_release+0xd5/0x220
       [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
       [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
       [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
       [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
       [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
       [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
       [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
       [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
       [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
       [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
       [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
       [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
       [<ffffffff810791ee>] kthread+0xae/0xc0
       [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
       [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
       [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
       [<ffffffff8163e740>] ? gs_change+0x13/0x13
      This is because the throttle stuff can commit the transaction, which expects to
      be the one stopping the intwrite stuff, but we've already done it in the
      __btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
      lockdep go away.  Thanks,
      Tested-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
    • Miao Xie's avatar
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie authored
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Miao Xie's avatar
      Btrfs: fix the snapshot that should not exist · 42874b3d
      Miao Xie authored
      The snapshot should be the image of the fs tree before it was created,
      so the metadata of the snapshot should not exist in the its tree. But now, we
      found the directory item and directory name index is in both the snapshot tree
      and the fs tree. It introduces some problems and makes the users feel strange:
       # mkfs.btrfs /dev/sda1
       # mount /dev/sda1 /mnt
       # mkdir /mnt/1
       # cd /mnt/1
       # btrfs subvolume snapshot /mnt snap0
       # ls -a /mnt/1/snap0/1
       .	..	[no other file/dir]
       # ll /mnt/1/snap0/
       total 0
       drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
      			There is no file/dir in it, but it's size is 10
       # cd /mnt/1/snap0/1/snap0
       [Enter a unexisted directory successfully...]
      There is nothing in the directory 1 in snap0, but btrfs told the length of
      this directory is 10. Beside that, we can enter an unexisted directory, it is
      very strange to the users.
       # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
       # ll /mnt/1/snap0/1/
       total 0
       # ll /mnt/snap1/1/
       total 0
       drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
      And the source of snap1 did have any directory in Directory 1, but snap1 have
      a snap0, it is different between the source and the snapshot.
      So I think we should insert directory item and directory name index and update
      the parent inode as the last step of snapshot creation, and do not leave the
      useless metadata in the file tree.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Miao Xie's avatar
      Btrfs: fix full backref problem when inserting shared block reference · 361048f5
      Miao Xie authored
      If we create several snapshots at the same time, the following BUG_ON() will be
      	kernel BUG at fs/btrfs/extent-tree.c:6047!
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
       # for ((i=0; i<4; i++))
       > do
       > mkdir $i
       > for ((j=0; j<200; j++))
       > do
       > btrfs sub snap . $i/$j
       > done &
       > done
      The reason is:
      Before transaction commit, some operations changed the fs tree and new tree
      blocks were allocated because of COW. We used the implicit non-shared back
      reference for those newly allocated tree blocks because they were not shared by
      two or more trees.
      And then we created the first snapshot for the fs tree, according to the back
      reference rules, we also used implicit back refs for the child tree blocks of
      the root node of the fs tree, now those child nodes/leaves were shared by two
      Then We didn't deal with the delayed references, and continued to change the fs
      tree(created the second snapshot and inserted the dir item of the new snapshot
      into the fs tree). According to the rules of the back reference, we added full
      back refs for those tree blocks whose parents have be shared by two trees.
      Now some newly allocated tree blocks had two types of the references.
      As we know, the delayed reference system handles these delayed references from
      back to front, and the full delayed reference is inserted after the implicit
      ones. So when we dealt with the back references of those newly allocated tree
      blocks, the full references was dealt with at first. And if the first reference
      is a shared back reference and the tree block that the reference points to is
      newly allocated, It would be considered as a tree block which is shared by two
      or more trees when it is allocated and should be a full back reference not a
      implicit one, the flag of its reference also should be set to FULL_BACKREF.
      But in fact, it was a non-shared tree block with a implicit reference at
      beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
      was triggered.
      We have several methods to fix this bug:
      1. deal with delayed references after the snapshot is created and before we
         change the source tree of the snapshot. This is the easiest and safest way.
      2. modify the sort method of the delayed reference tree, make the full delayed
         references be inserted before the implicit ones. It is also very easy, but
         I don't know if it will introduce some problems or not.
      3. modify select_delayed_ref() and make it select the implicit delayed reference
         at first. This way is not so good because it may wastes CPU time if we have
         lots of delayed references.
      4. set the flags to FULL_BACKREF, this method is a little complex comparing with
         the 1st way.
      I chose the 1st way to fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Miao Xie's avatar
      Btrfs: fix error path in create_pending_snapshot() · 6fa9700e
      Miao Xie authored
      This patch fixes the following problem:
      - If we failed to deal with the delayed dir items, we should abort transaction,
        just as its comment said. Fix it.
      - If root reference or root back reference insertion failed, we should
        abort transaction. Fix it.
      - Fix the double free problem of pending->inherit.
      - Do not restore the trans->rsv if we doesn't change it.
      - make the error path more clearly.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    • Sage Weil's avatar
      Btrfs: set journal_info in async trans commit worker · e209db7a
      Sage Weil authored
      We expect current->journal_info to point to the trans handle we are
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
    • Sage Weil's avatar
      Btrfs: pass lockdep rwsem metadata to async commit transaction · 6fc4e354
      Sage Weil authored
      The freeze rwsem is taken by sb_start_intwrite() and dropped during the
      commit_ or end_transaction().  In the async case, that happens in a worker
      thread.  Tell lockdep the calling thread is releasing ownership of the
      rwsem and the async thread is picking it up.
      XFS plays the same trick in fs/xfs/xfs_aops.c.
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
  6. 28 Aug, 2012 2 commits
  7. 30 Jul, 2012 1 commit
    • Jan Kara's avatar
      btrfs: Convert to new freezing mechanism · b2b5ef5c
      Jan Kara authored
      We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
      freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
      the transaction mechanism to avoid starting transactions on frozen filesystem.
      At minimum this is necessary to stop iput() of unlinked file to change frozen
      filesystem during truncation.
      Checks in cleaner_kthread() and transaction_kthread() can be safely removed
      since btrfs_freeze() will lock the mutexes and thus block the threads (and they
      shouldn't have anything to do anyway).
      CC: linux-btrfs@vger.kernel.org
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  8. 25 Jul, 2012 1 commit
    • Alexander Block's avatar
      Btrfs: introduce subvol uuids and times · 8ea05e3a
      Alexander Block authored
      This patch introduces uuids for subvolumes. Each
      subvolume has it's own uuid. In case it was snapshotted,
      it also contains parent_uuid. In case it was received,
      it also contains received_uuid.
      It also introduces subvolume ctime/otime/stime/rtime. The
      first two are comparable to the times found in inodes. otime
      is the origin/creation time and ctime is the change time.
      stime/rtime are only valid on received subvolumes.
      stime is the time of the subvolume when it was
      sent. rtime is the time of the subvolume when it was
      Additionally to the times, we have a transid for each
      time. They are updated at the same place as the times.
      btrfs receive uses stransid and rtransid to find out
      if a received subvolume changed in the meantime.
      If an older kernel mounts a filesystem with the
      extented fields, all fields become invalid. The next
      mount with a new kernel will detect this and reset the
      Signed-off-by: default avatarAlexander Block <ablock84@googlemail.com>
      Reviewed-by: default avatarDavid Sterba <dave@jikos.cz>
      Reviewed-by: default avatarArne Jansen <sensille@gmx.net>
      Reviewed-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Reviewed-by: default avatarAlex Lyakas <alex.bolshoy.btrfs@gmail.com>
  9. 23 Jul, 2012 3 commits
    • Josef Bacik's avatar
      Btrfs: change how we indicate we're adding csums · 0e721106
      Josef Bacik authored
      There is weird logic I had to put in place to make sure that when we were
      adding csums that we'd used the delalloc block rsv instead of the global
      block rsv.  Part of this meant that we had to free up our transaction
      reservation before we ran the delayed refs since csum deletion happens
      during the delayed ref work.  The problem with this is that when we release
      a reservation we will add it to the global reserve if it is not full in
      order to keep us going along longer before we have to force a transaction
      commit.  By releasing our reservation before we run delayed refs we don't
      get the opportunity to drain down the global reserve for the work we did, so
      we won't refill it as often.  This isn't a problem per-se, it just results
      in us possibly committing transactions more and more often, and in rare
      cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
      out of space in our block rsv.
      This also helps us by holding onto space while the delayed refs run so we
      don't end up with as many people trying to do things at the same time, which
      again will help us not force commits or hit the use_block_rsv warnings.
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
    • Dan Carpenter's avatar
      Btrfs: small naming cleanup in join_transaction() · e4b50e14
      Dan Carpenter authored
      "root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
      because it is shorter and that's what is used in the rest of the
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
    • Chris Mason's avatar
      Btrfs: don't wait around for new log writers on an SSD · e39e64ac
      Chris Mason authored
      Waiting on spindles improves performance, but ssds want all the
      IO as quickly as we can push it down.
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
  10. 12 Jul, 2012 5 commits
  11. 10 Jul, 2012 2 commits
  12. 14 Jun, 2012 2 commits
    • Josef Bacik's avatar
      Btrfs: abort the transaction if the commit fails · 7b8b92af
      Josef Bacik authored
      If a transaction commit fails we don't abort it so we don't set an error on
      the file system.  This patch fixes that by actually calling the abort stuff
      and then adding a check for a fs error in the transaction start stuff to
      make sure it is caught properly.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
    • Josef Bacik's avatar
      Btrfs: wake up transaction waiters when aborting a transaction · d7096fc3
      Josef Bacik authored
      I was getting lots of hung tasks and a NULL pointer dereference because we
      are not cleaning up the transaction properly when it aborts.  First we need
      to reset the running_transaction to NULL so we don't get a bad dereference
      for any start_transaction callers after this.  Also we cannot rely on
      waitqueue_active() since it's just a list_empty(), so just call wake_up()
      directly since that will do the barrier for us and such.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
  13. 30 May, 2012 3 commits
  14. 18 Apr, 2012 1 commit
  15. 12 Apr, 2012 1 commit
  16. 29 Mar, 2012 1 commit
  17. 27 Mar, 2012 1 commit