Skip to content
  • Filipe Manana's avatar
    Btrfs: fix race between device replace and block group removal · 57ba4cb8
    Filipe Manana authored
    When it's finishing, the device replace code iterates all extent maps
    representing block group and for each one that has a stripe that refers
    to the source device, it replaces its device with the target device.
    However when it replaces the source device with the target device it,
    the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
    only after its ID is changed to match the one from the source device.
    This leads to races with the chunk removal code that can temporarly see
    a device with an ID of 0ULL and then attempt to use that ID to remove
    items from the device tree and fail, causing a transaction abort:
    
    [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
    [ 9238.594377] ------------[ cut here ]------------
    [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
    [ 9238.594403] BTRFS: Transaction aborted (error 1)
    [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
    rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
    oppy [last unloaded: btrfs]
    [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
    [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [ 9238.594421]  0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
    [ 9238.594422]  0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
    [ 9238.594423]  0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
    [ 9238.594424] Call Trace:
    [ 9238.594428]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
    [ 9238.594430]  [<ffffffff81052b14>] __warn+0xc2/0xdd
    [ 9238.594432]  [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
    [ 9238.594434]  [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
    [ 9238.594450]  [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
    [ 9238.594452]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
    [ 9238.594464]  [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
    [ 9238.594476]  [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
    [ 9238.594489]  [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
    [ 9238.594490]  [<ffffffff8106f403>] kthread+0xd4/0xdc
    [ 9238.594494]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
    [ 9238.594495]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
    [ 9238.594496] ---[ end trace 183efbe50275f059 ]---
    
    The sequence of steps leading to this is like the following:
    
                  CPU 1                                           CPU 2
    
     btrfs_dev_replace_finishing()
    
       at this point
       dev_replace->tgtdev->devid ==
       BTRFS_DEV_REPLACE_DEVID (0ULL)
    
       ...
    
       btrfs_start_transaction()
       btrfs_commit_transaction()
    
                                                         btrfs_delete_unused_bgs()
                                                           btrfs_remove_chunk()
    
                                                             looks up for the extent map
                                                             corresponding to the chunk
    
                                                             lock_chunks() (chunk_mutex)
                                                             check_system_chunk()
                                                             unlock_chunks() (chunk_mutex)
    
       locks fs_info->chunk_mutex
    
       btrfs_dev_replace_update_device_in_mapping_tree()
         --> iterates fs_info->mapping_tree and
             replaces the device in every extent
             map's map->stripes[] with
             dev_replace->tgtdev, which still has
             an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
    
                                                             iterates over all stripes from
                                                             the extent map
    
                                                               --> calls btrfs_free_dev_extent()
                                                                   passing it the target device
                                                                   that still has an ID of 0ULL
    
                                                               --> btrfs_free_dev_extent() fails
                                                                 --> aborts current transaction
    
       finishes setting up the target device,
       namely it sets tgtdev->devid to the value
       of srcdev->devid (which is necessarily > 0)
    
       frees the srcdev
    
       unlocks fs_info->chunk_mutex
    
    So fix this by taking the device list mutex while processing the stripes
    for the chunk's extent map. This is similar to the race between device
    replace and block group creation that was fixed by commit 50460e37
    
    
    ("Btrfs: fix race when finishing dev replace leading to transaction abort").
    
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    57ba4cb8