1. 17 Jun, 2016 2 commits
  2. 12 May, 2016 3 commits
    • Filipe Manana's avatar
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana authored
      
      
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      578def7c
    • David Sterba's avatar
      btrfs: build fixup for qgroup_account_snapshot · 2c1984f2
      David Sterba authored
      
      
      The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an
      independent branch for the same merge target (4.7). To make the code
      compilable for bisectability reasons, add a temporary stub.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c1984f2
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup accounting when creating snapshot · 6426c7ad
      Qu Wenruo authored
      
      
      Current btrfs qgroup design implies a requirement that after calling
      btrfs_qgroup_account_extents() there must be a commit root switch.
      
      Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
      inside btrfs_commit_transaction() just be commit_cowonly_roots().
      
      However there is a exception at create_pending_snapshot(), which will
      call btrfs_qgroup_account_extents() but no any commit root switch.
      
      In case of creating a snapshot whose parent root is itself (create a
      snapshot of fs tree), it will corrupt qgroup by the following trace:
      (skipped unrelated data)
      ======
      btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
      qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
      qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
      btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
      ======
      
      The problem here is in first qgroup_account_extent(), the
      nr_new_roots of the extent is 1, which means its reference got
      increased, and qgroup increased its rfer and excl.
      
      But at second qgroup_account_extent(), its reference got decreased, but
      between these two qgroup_account_extent(), there is no switch roots.
      This leads to the same nr_old_roots, and this extent just got ignored by
      qgroup, which means this extent is wrongly accounted.
      
      Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
      create_pending_snapshot(), with needed preparation.
      
      Mark: I added a check at the top of qgroup_account_snapshot() to skip this
      code if qgroups are turned off. xfstest btrfs/122 exposes this problem.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6426c7ad
  3. 29 Apr, 2016 1 commit
  4. 28 Apr, 2016 1 commit
  5. 18 Feb, 2016 2 commits
  6. 07 Jan, 2016 3 commits
  7. 17 Dec, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix memory leaks after transaction is aborted · 7785a663
      Filipe Manana authored
      
      
      When a transaction is aborted, or its commit fails before writing the new
      superblock and calling btrfs_finish_extent_commit(), we leak reference
      counts on the block groups attached to the transaction's delete_bgs list,
      because btrfs_finish_extent_commit() is never called for those two cases.
      Fix this by dropping their references at btrfs_put_transaction(), which
      is called when transactions are aborted (by making the transaction kthread
      commit the transaction) or if their commits fail.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      7785a663
  8. 10 Dec, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana authored
      
      
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      348a0013
  9. 25 Nov, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana authored
      
      
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8eab77ff
  10. 21 Oct, 2015 6 commits
  11. 10 Oct, 2015 2 commits
  12. 05 Oct, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix deadlock when finalizing block group creation · d9a0540a
      Filipe Manana authored
      
      
      Josef ran into a deadlock while a transaction handle was finalizing the
      creation of its block groups, which produced the following trace:
      
        [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
        [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
        [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
        [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
        [260445.593137] Call Trace:
        [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
        [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
        [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
        [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
        [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
        [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
        [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
        [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
        [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      This happened because the same transaction handle created a large number
      of block groups and while finalizing their creation (inserting new items
      and updating existing items in the chunk and device trees) a new metadata
      extent had to be allocated and no free space was found in the current
      metadata block groups, which made find_free_extent() attempt to allocate
      a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
      ended up allocating a new system chunk too and exceeded the threshold
      of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
      final part of block group creation again (at
      btrfs_create_pending_block_groups()) and attempt to lock again the root
      of the chunk tree when it's already write locked by the same task.
      
      Similarly we can deadlock on extent tree nodes/leafs if while we are
      running delayed references we end up creating a new metadata block group
      in order to allocate a new node/leaf for the extent tree (as part of
      a CoW operation or growing the tree), as btrfs_create_pending_block_groups
      inserts items into the extent tree as well. In this case we get the
      following trace:
      
        [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
        [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
        [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
        [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
        [14242.773606] Call Trace:
        [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
        [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
        [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
        [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
        [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
        [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
        [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
        [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
        [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
        [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
        [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
        [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
        [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
        [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
        [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
        [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
        [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by never recursing into the finalization phase of block group
      creation and making sure we never trigger the finalization of block group
      creation while running delayed references.
      Reported-by: default avatarJosef Bacik <jbacik@fb.com>
      Fixes: 00d80e34
      
       ("Btrfs: fix quick exhaustion of the system array in the superblock")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      d9a0540a
  13. 29 Sep, 2015 1 commit
  14. 22 Sep, 2015 1 commit
    • Josef Bacik's avatar
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik authored
      
      
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2b9dbef2
  15. 19 Aug, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 1f9b8c8f
      Filipe Manana authored
      
      
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1f9b8c8f
  16. 15 Aug, 2015 1 commit
  17. 09 Aug, 2015 1 commit
  18. 29 Jul, 2015 1 commit
    • Jeff Mahoney's avatar
      btrfs: add missing discards when unpinning extents with -o discard · e33e17ee
      Jeff Mahoney authored
      
      
      When we clear the dirty bits in btrfs_delete_unused_bgs for extents
      in the empty block group, it results in btrfs_finish_extent_commit being
      unable to discard the freed extents.
      
      The block group removal patch added an alternate path to forget extents
      other than btrfs_finish_extent_commit.  As a result, any extents that
      would be freed when the block group is removed aren't discarded.  In my
      test run, with a large copy of mixed sized files followed by removal, it
      left nearly 2/3 of extents undiscarded.
      
      To clean up the block groups, we add the removed block group onto a list
      that will be discarded after transaction commit.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e33e17ee
  19. 22 Jul, 2015 1 commit
    • Zhao Lei's avatar
      btrfs: Fix lockdep warning of btrfs_run_delayed_iputs() · 8a733013
      Zhao Lei authored
      
      
      Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
      delayed_iput_sem in xfstests generic/241:
        [ 2061.345955] =============================================
        [ 2061.346027] [ INFO: possible recursive locking detected ]
        [ 2061.346027] 4.1.0+ #268 Tainted: G        W
        [ 2061.346027] ---------------------------------------------
        [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
        [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] but task is already holding lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] other info that might help us debug this:
        [ 2061.346027]  Possible unsafe locking scenario:
      
        [ 2061.346027]        CPU0
        [ 2061.346027]        ----
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]
         *** DEADLOCK ***
      It is rarely happened, about 1/400 in my test env.
      
      The reason is recursion of btrfs_run_delayed_iputs():
        cleaner_kthread
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock *2
        -> iput()
        -> ...
        -> btrfs_commit_transaction()
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock (dead lock) *2
        *1: recursion of btrfs_run_delayed_iputs()
        *2: warning of lockdep about delayed_iput_sem
      
      When fs is in high stress, new iputs may added into fs_info->delayed_iputs
      list when btrfs_run_delayed_iputs() is running, which cause
      second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
      again, and cause above lockdep warning.
      
      Actually, it will not cause real problem because both locks are read lock,
      but to avoid lockdep warning, we can do a fix.
      
      Fix:
        Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
        cleaner_kthread thread to break above recursion path.
        cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
        and don't need to call btrfs_run_delayed_iputs() again in
        btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.
      
      Test:
        No above lockdep warning after patch in 1200 generic/241 tests.
      Reported-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8a733013
  20. 11 Jul, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix list transaction->pending_ordered corruption · d3efe084
      Filipe Manana authored
      When we call btrfs_commit_transaction(), we splice the list "ordered"
      of our transaction handle into the transaction's "pending_ordered"
      list, but we don't re-initialize the "ordered" list of our transaction
      handle, this means it still points to the same elements it used to
      before the splice. Then we check if the current transaction's state is
      >= TRANS_STATE_COMMIT_START and if it is we end up calling
      btrfs_end_transaction() which simply splices again the "ordered" list
      of our handle into the transaction's "pending_ordered" list, leaving
      multiple pointers to the same ordered extents which results in list
      corruption when we are iterating, removing and freeing ordered extents
      at btrfs_wait_pending_ordered(), resulting in access to dangling
      pointers / use-after-free issues.
      Similarly, btrfs_end_transaction() can end up in some cases calling
      btrfs_commit_transaction(), and both did a list splice of the transaction
      handle's "ordered" list into the transaction's "pending_ordered" without
      re-initializing the handle's "ordered" list, resulting in exactly the
      same problem.
      
      This produces the following warning on a kernel with linked list
      debugging enabled:
      
      [109749.265416] ------------[ cut here ]------------
      [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
      [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
      (...)
      [109749.287505] Call Trace:
      [109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
      [109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
      [109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
      [109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
      [109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
      [109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
      [109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
      [109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
      [109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
      [109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
      [109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
      [109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
      [109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
      [109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
      [109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
      [109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
      [109749.382274] ---[ end trace 48e0d07f7c03d95a ]---
      
      On a non-debug kernel this leads to invalid memory accesses, causing a
      crash. Fix this by using list_splice_init() instead of list_splice() in
      btrfs_commit_transaction() and btrfs_end_transaction().
      
      Cc: stable@vger.kernel.org
      Fixes: 50d9aa99
      
       ("Btrfs: make sure logged extents complete in the current transaction V3"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      d3efe084
  21. 10 Jun, 2015 4 commits
  22. 03 Jun, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana authored
      
      
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4fbcdf66
  23. 10 Apr, 2015 3 commits
    • Chris Mason's avatar
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason authored
      
      
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1bbc621e
    • Josef Bacik's avatar
      Btrfs: reserve space for block groups · cb723e49
      Josef Bacik authored
      
      
      This changes our delayed refs calculations to include the space needed
      to write back dirty block groups.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb723e49
    • Josef Bacik's avatar
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik authored
      
      
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1262133b