1. 07 Jan, 2016 1 commit
    • David Sterba's avatar
      btrfs: allocate root item at snapshot ioctl time · b0c0ea63
      David Sterba authored
      
      
      The actual snapshot creation is delayed until transaction commit. If we
      cannot get enough memory for the root item there, we have to fail the
      whole transaction commit which is bad. So we'll allocate the memory at
      the ioctl call and pass it along with the pending_snapshot struct. The
      potential ENOMEM will be returned to the caller of snapshot ioctl.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0c0ea63
  2. 10 Dec, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana authored
      
      
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      348a0013
  3. 25 Nov, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana authored
      
      
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8eab77ff
  4. 21 Oct, 2015 6 commits
  5. 10 Oct, 2015 2 commits
  6. 05 Oct, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix deadlock when finalizing block group creation · d9a0540a
      Filipe Manana authored
      
      
      Josef ran into a deadlock while a transaction handle was finalizing the
      creation of its block groups, which produced the following trace:
      
        [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
        [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
        [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
        [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
        [260445.593137] Call Trace:
        [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
        [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
        [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
        [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
        [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
        [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
        [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
        [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
        [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      This happened because the same transaction handle created a large number
      of block groups and while finalizing their creation (inserting new items
      and updating existing items in the chunk and device trees) a new metadata
      extent had to be allocated and no free space was found in the current
      metadata block groups, which made find_free_extent() attempt to allocate
      a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
      ended up allocating a new system chunk too and exceeded the threshold
      of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
      final part of block group creation again (at
      btrfs_create_pending_block_groups()) and attempt to lock again the root
      of the chunk tree when it's already write locked by the same task.
      
      Similarly we can deadlock on extent tree nodes/leafs if while we are
      running delayed references we end up creating a new metadata block group
      in order to allocate a new node/leaf for the extent tree (as part of
      a CoW operation or growing the tree), as btrfs_create_pending_block_groups
      inserts items into the extent tree as well. In this case we get the
      following trace:
      
        [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
        [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
        [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
        [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
        [14242.773606] Call Trace:
        [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
        [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
        [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
        [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
        [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
        [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
        [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
        [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
        [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
        [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
        [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
        [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
        [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
        [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
        [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
        [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
        [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by never recursing into the finalization phase of block group
      creation and making sure we never trigger the finalization of block group
      creation while running delayed references.
      Reported-by: default avatarJosef Bacik <jbacik@fb.com>
      Fixes: 00d80e34
      
       ("Btrfs: fix quick exhaustion of the system array in the superblock")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      d9a0540a
  7. 29 Sep, 2015 1 commit
  8. 22 Sep, 2015 1 commit
    • Josef Bacik's avatar
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik authored
      
      
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2b9dbef2
  9. 19 Aug, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 1f9b8c8f
      Filipe Manana authored
      
      
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1f9b8c8f
  10. 15 Aug, 2015 1 commit
  11. 09 Aug, 2015 1 commit
  12. 29 Jul, 2015 1 commit
    • Jeff Mahoney's avatar
      btrfs: add missing discards when unpinning extents with -o discard · e33e17ee
      Jeff Mahoney authored
      
      
      When we clear the dirty bits in btrfs_delete_unused_bgs for extents
      in the empty block group, it results in btrfs_finish_extent_commit being
      unable to discard the freed extents.
      
      The block group removal patch added an alternate path to forget extents
      other than btrfs_finish_extent_commit.  As a result, any extents that
      would be freed when the block group is removed aren't discarded.  In my
      test run, with a large copy of mixed sized files followed by removal, it
      left nearly 2/3 of extents undiscarded.
      
      To clean up the block groups, we add the removed block group onto a list
      that will be discarded after transaction commit.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e33e17ee
  13. 22 Jul, 2015 1 commit
    • Zhao Lei's avatar
      btrfs: Fix lockdep warning of btrfs_run_delayed_iputs() · 8a733013
      Zhao Lei authored
      
      
      Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
      delayed_iput_sem in xfstests generic/241:
        [ 2061.345955] =============================================
        [ 2061.346027] [ INFO: possible recursive locking detected ]
        [ 2061.346027] 4.1.0+ #268 Tainted: G        W
        [ 2061.346027] ---------------------------------------------
        [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
        [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] but task is already holding lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] other info that might help us debug this:
        [ 2061.346027]  Possible unsafe locking scenario:
      
        [ 2061.346027]        CPU0
        [ 2061.346027]        ----
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]
         *** DEADLOCK ***
      It is rarely happened, about 1/400 in my test env.
      
      The reason is recursion of btrfs_run_delayed_iputs():
        cleaner_kthread
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock *2
        -> iput()
        -> ...
        -> btrfs_commit_transaction()
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock (dead lock) *2
        *1: recursion of btrfs_run_delayed_iputs()
        *2: warning of lockdep about delayed_iput_sem
      
      When fs is in high stress, new iputs may added into fs_info->delayed_iputs
      list when btrfs_run_delayed_iputs() is running, which cause
      second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
      again, and cause above lockdep warning.
      
      Actually, it will not cause real problem because both locks are read lock,
      but to avoid lockdep warning, we can do a fix.
      
      Fix:
        Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
        cleaner_kthread thread to break above recursion path.
        cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
        and don't need to call btrfs_run_delayed_iputs() again in
        btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.
      
      Test:
        No above lockdep warning after patch in 1200 generic/241 tests.
      Reported-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8a733013
  14. 11 Jul, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix list transaction->pending_ordered corruption · d3efe084
      Filipe Manana authored
      When we call btrfs_commit_transaction(), we splice the list "ordered"
      of our transaction handle into the transaction's "pending_ordered"
      list, but we don't re-initialize the "ordered" list of our transaction
      handle, this means it still points to the same elements it used to
      before the splice. Then we check if the current transaction's state is
      >= TRANS_STATE_COMMIT_START and if it is we end up calling
      btrfs_end_transaction() which simply splices again the "ordered" list
      of our handle into the transaction's "pending_ordered" list, leaving
      multiple pointers to the same ordered extents which results in list
      corruption when we are iterating, removing and freeing ordered extents
      at btrfs_wait_pending_ordered(), resulting in access to dangling
      pointers / use-after-free issues.
      Similarly, btrfs_end_transaction() can end up in some cases calling
      btrfs_commit_transaction(), and both did a list splice of the transaction
      handle's "ordered" list into the transaction's "pending_ordered" without
      re-initializing the handle's "ordered" list, resulting in exactly the
      same problem.
      
      This produces the following warning on a kernel with linked list
      debugging enabled:
      
      [109749.265416] ------------[ cut here ]------------
      [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
      [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
      (...)
      [109749.287505] Call Trace:
      [109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
      [109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
      [109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
      [109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
      [109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
      [109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
      [109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
      [109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
      [109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
      [109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
      [109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
      [109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
      [109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
      [109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
      [109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
      [109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
      [109749.382274] ---[ end trace 48e0d07f7c03d95a ]---
      
      On a non-debug kernel this leads to invalid memory accesses, causing a
      crash. Fix this by using list_splice_init() instead of list_splice() in
      btrfs_commit_transaction() and btrfs_end_transaction().
      
      Cc: stable@vger.kernel.org
      Fixes: 50d9aa99
      
       ("Btrfs: make sure logged extents complete in the current transaction V3"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      d3efe084
  15. 10 Jun, 2015 4 commits
  16. 03 Jun, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana authored
      
      
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4fbcdf66
  17. 10 Apr, 2015 4 commits
    • Chris Mason's avatar
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason authored
      
      
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1bbc621e
    • Josef Bacik's avatar
      Btrfs: reserve space for block groups · cb723e49
      Josef Bacik authored
      
      
      This changes our delayed refs calculations to include the space needed
      to write back dirty block groups.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb723e49
    • Josef Bacik's avatar
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik authored
      
      
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1262133b
    • Chris Mason's avatar
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason authored
      
      
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28ed1345
  18. 17 Mar, 2015 1 commit
    • Josef Bacik's avatar
      Btrfs: prepare block group cache before writing · dcdf7f6d
      Josef Bacik authored
      
      
      Writing the block group cache will modify the extent tree quite a bit because it
      truncates the old space cache and pre-allocates new stuff.  To try and cut down
      on the churn lets do the setup dance first, then later on hopefully we can avoid
      looping with newly dirtied roots.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dcdf7f6d
  19. 13 Mar, 2015 2 commits
    • Josef Bacik's avatar
      Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) · ea526d18
      Josef Bacik authored
      
      
      Dave could hit this assert consistently running btrfs/078.  This is because
      when we update the block groups we could truncate the free space, which would
      try to delete the csums for that range and dirty the csum root.  For this to
      happen we have to have already written out the csum root so it's kind of hard to
      hit this case.  This patch fixes this by changing the logic to only write the
      dirty block groups if the dirty_cowonly_roots list is empty.  This will get us
      the same effect as before since we add the extent root last, and will cover the
      case that we dirty some other root again but not the extent root.  Thanks,
      Reported-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ea526d18
    • Liu Bo's avatar
      Btrfs: catch transaction abortion after waiting for it · b4924a0f
      Liu Bo authored
      This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297
      
      .
      
      Fsync() can report success when it actually doesn't.  When we
      have several threads running fsync() at the same tiem and in one fsync() we
      get a transaction abortion due to some problems(in the test case it's disk
      failures), and other fsync()s may return successfully which makes userspace
      programs think that data is now safely flushed into disk.
      
      It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
      they get to try btrfs_commit_transaction() where it finds that there is
      already a transaction being committed, and they'll just call wait_for_commit()
      and return.  Note that we actually check "trans->aborted" in btrfs_end_transaction,
      but it's likely that the error message is still not yet throwed out and only after
      wait_for_commit() we're sure whether the transaction is committed successfully.
      
      This add the necessary check and it now passes the test.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b4924a0f
  20. 05 Mar, 2015 1 commit
  21. 03 Mar, 2015 1 commit
  22. 16 Feb, 2015 1 commit
  23. 14 Feb, 2015 1 commit
    • Zhao Lei's avatar
      btrfs: Fix out-of-space bug · 13212b54
      Zhao Lei authored
      Btrfs will report NO_SPACE when we create and remove files for several times,
      and we can't write to filesystem until mount it again.
      
      Steps to reproduce:
       1: Create a single-dev btrfs fs with default option
       2: Write a file into it to take up most fs space
       3: Delete above file
       4: Wait about 100s to let chunk removed
       5: goto 2
      
      Script is like following:
       #!/bin/bash
      
       # Recommend 1.2G space, too large disk will make test slow
       DEV="/dev/sda16"
       MNT="/mnt/tmp"
      
       dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
       file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
      
       echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
      
       for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
       echo "mkfs $DEV"
       mkfs.btrfs -f "$DEV" >/dev/null || exit 2
       echo "mount $DEV $MNT"
       mount "$DEV" "$MNT" || exit 2
      
       for ((loop_i = 0; loop_i < 20; loop_i++)); do
           echo
           echo "loop $loop_i"
      
           echo "dd file..."
           cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
           "${cmd[@]}" 2>/dev/null || {
               # NO_SPACE error triggered
               echo "dd failed: ${cmd[*]}"
               exit 1
           }
      
           echo "rm file..."
           rm -f "$MNT"/file0 || exit 2
      
           for ((i = 0; i < 10; i++)); do
               df "$MNT" | tail -1
               sleep 10
           done
       done
      
      Reason:
       It is triggered by commit: 47ab2a6c
      
      
       which is used to remove empty block groups automatically, but the
       reason is not in that patch. Code before works well because btrfs
       don't need to create and delete chunks so many times with high
       complexity.
       Above bug is caused by many reason, any of them can trigger it.
      
      Reason1:
       When we remove some continuous chunks but leave other chunks after,
       these disk space should be used by chunk-recreating, but in current
       code, only first create will successed.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason2:
       contains_pending_extent() return wrong value in calculation.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason3:
       btrfs_check_data_free_space() try to commit transaction and retry
       allocating chunk when the first allocating failed, but space_info->full
       is set in first allocating, and prevent second allocating in retry.
       Fixed in this patch by clear space_info->full in commit transaction.
      
       Tested for severial times by above script.
      
      Changelog v3->v4:
       use light weight int instead of atomic_t to record have_remove_bgs in
       transaction, suggested by:
       Josef Bacik <jbacik@fb.com>
      
      Changelog v2->v3:
       v2 fixed the bug by adding more commit-transaction, but we
       only need to reclaim space when we are really have no space for
       new chunk, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
      
       Actually, our code already have this type of commit-and-retry,
       we only need to make it working with removed-bgs.
       v3 fixed the bug with above way.
      
      Changelog v1->v2:
       v1 will introduce a new bug when delete and create chunk in same disk
       space in same transaction, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
       V2 fix this bug by commit transaction after remove block grops.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Suggested-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Suggested-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13212b54
  24. 21 Jan, 2015 2 commits
    • Josef Bacik's avatar
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik authored
      
      
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ce93ec54
    • Josef Bacik's avatar
      Btrfs: change how we track dirty roots · e7070be1
      Josef Bacik authored
      
      
      I've been overloading root->dirty_list to keep track of dirty roots and which
      roots need to have their commit roots switched at transaction commit time.  This
      could cause us to lose an update to the root which could corrupt the file
      system.  To fix this use a state bit to know if the root is dirty, and if it
      isn't set we go ahead and move the root to the dirty list.  This way if we
      re-dirty the root after adding it to the switch_commit list we make sure to
      update it.  This also makes it so that the extent root is always the last root
      on the dirty list to try and keep the amount of churn down at this point in the
      commit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e7070be1
  25. 20 Jan, 2015 1 commit
  26. 21 Nov, 2014 1 commit
    • Josef Bacik's avatar
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik authored
      
      
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      50d9aa99