• Filipe Manana's avatar
    Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
    Filipe Manana authored
    When we do a device replace, for each device extent we find from the
    source device, we set the corresponding block group to readonly mode to
    prevent writes into it from happening while we are copying the device
    extent from the source to the target device. However just before we set
    the block group to readonly mode some concurrent task might have already
    allocated an extent from it or decided it could perform a nocow write
    into one of its extents, which can make the device replace process to
    miss copying an extent since it uses the extent tree's commit root to
    search for extents and only once it finishes searching for all extents
    belonging to the block group it does set the left cursor to the logical
    end address of the block group - this is a problem if the respective
    ordered extents finish while we are searching for extents using the
    extent tree's commit root and no transaction commit happens while we
    are iterating the tree, since it's the delayed references created by the
    ordered extents (when they complete) that insert the extent items into
    the extent tree (using the non-commit root of course).
              CPU 1                                            CPU 2
           --> finds device extent belonging
               to block group X
                                   <transaction N starts>
                                                          starts buffered write
                                                          against some inode
                                                          writepages is run against
                                                          that inode forcing dellaloc
                                                          to run
                                                                          --> allocates an extent
                                                                              from block group X
                                                                              (which is not yet
                                                                               in RO mode)
                                                                          --> creates ordered extent Y
                                                              --> bio against the extent from
                                                                  block group X is submitted
           btrfs_inc_block_group_ro(bg X)
             --> sets block group X to readonly
           scrub_chunk(bg X)
             scrub_stripe(device extent from srcdev)
               --> keeps searching for extent items
                   belonging to the block group using
                   the extent tree's commit root
               --> it never blocks due to
                   fs_info->scrub_pause_req as no
                   one tries to commit transaction N
               --> copies all extents found from the
                   source device into the target device
               --> finishes search loop
                                                            bio completes
                                                            ordered extent Y completes
                                                            and creates delayed data
                                                            reference which will add an
                                                            extent item to the extent
                                                            tree when run (typically
                                                            at transaction commit time)
                                                              --> so the task doing the
                                                                  scrub/device replace
                                                                  at CPU 1 misses this
                                                                  and does not copy this
                                                                  extent into the new/target
           btrfs_dec_block_group_ro(bg X)
             --> turns block group X back to RW mode
           dev_replace->cursor_left is set to the
           logical end offset of block group X
    So fix this by waiting for all cow and nocow writes after setting a block
    group to readonly mode.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
ordered-data.h 7.08 KB