1. 10 Apr, 2015 2 commits
    • Josef Bacik's avatar
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik authored
      
      
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1262133b
    • Chris Mason's avatar
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason authored
      
      
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28ed1345
  2. 01 Apr, 2015 2 commits
    • Guenter Roeck's avatar
      fs: btrfs: Add missing include file · 4a3d1caf
      Guenter Roeck authored
      
      
      Building alpha:allmodconfig fails with
      
      fs/btrfs/inode.c: In function 'check_direct_IO':
      fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'
      
      due to a missing include file.
      
      Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4a3d1caf
    • Chris Mason's avatar
      Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent() · dd825259
      Chris Mason authored
      
      
      The error handling path for alloc_reserved_tree_block is calling
      btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
      might sleep as we allocate extent_state objects:
      
       BUG: sleeping function called from invalid context at mm/slub.c:1268
       in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
       5 locks held by kworker/u4:7/11093:
        #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
        #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
        #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
        #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
        #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
       CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
       Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
       Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
        00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
        ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
        0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
       Call Trace:
        [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
        [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
        [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
        [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
        [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
        [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
        [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
        [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
        [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
        [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
        [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
        [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
        [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
        [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
        [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
        [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
        [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
        [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
        [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
        [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
        [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
        [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
        [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
        [<ffffffff81091d51>] ? process_one_work+0x151/0x520
        [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
        [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
        [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
        [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
        [<ffffffff81098686>] kthread+0xf6/0x120
        [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
        [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
        [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
       ------------[ cut here ]------------
      
      This changes things to free the path first, which will also unlock the
      extent buffer.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarDave Sterba <dsterba@suse.cz>
      Tested-by: default avatarDave Sterba <dsterba@suse.cz>
      dd825259
  3. 26 Mar, 2015 14 commits
    • Liu Bo's avatar
      Btrfs: Remove the check for old-style mkfs · e56a951e
      Liu Bo authored
      This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
      but it also allows us to mount a buggy btrfs if this btrfs has the right
      superblock head part but has something wrong with chunk tree part[1], and
      after that we can hit BUG_ON()s set in the code to prevent something
      impossible.
      
      Since David has released "Btrfs progs v3.19-rc2", just remove the check,
      if anyone who wants to make a fresh btrfs, please use the latest one.
      
      [1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e56a951e
    • Jeff Mahoney's avatar
      btrfs: cleanup orphans while looking up default subvolume · 727b9784
      Jeff Mahoney authored
      
      
      Orphans in the fs tree are cleaned up via open_ctree and subvolume
      orphans are cleaned via btrfs_lookup_dentry -- except when a default
      subvolume is in use.  The name for the default subvolume uses a manual
      lookup that doesn't trigger orphan cleanup and needs to trigger it
      manually as well. This doesn't apply to the remount case since the
      subvolumes are cleaned up by walking the root radix tree.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      727b9784
    • Tom Van Braeckel's avatar
      btrfs: explicitly set control file's private_data · d8620958
      Tom Van Braeckel authored
      The private_data member of the Btrfs control device file
      (/dev/btrfs-control) is used to hold the current transaction and needs
      to be initialized to NULL to signify that no transaction is in progress.
      
      We explicitly set the control file's private_data to NULL to be
      independent of whatever value the misc subsystem initializes it to.
      
      Backstory:
      ----------
      
      The misc subsystem (which is used by /dev/btrfs-control) initializes
      a file's private_data to point to the misc device when a driver has
      registered a custom open file operation and initializes it to NULL
      when a custom open file operation has *not* been provided.
      
      This subtle quirk is confusing, to the point where kernel code registers
      *empty* file open operations to have private_data point to the misc
      device structure.
      
      And it leads to bugs, where the addition or removal of a custom open
      file operation surprisingly changes the initial contents of a file's
      private_data structure.
      
      To simplify things in the misc subsystem, a patch [1] has been proposed
      to *always* set private_data to point to the misc device instead of
      only doing this when a custom open file operation has been registered.
      
      But before we can fix this in the misc subsystem itself, we need to
      modify the (few) drivers that rely on this very subtle behavior.
      
      [1] https://lkml.org/lkml/2014/12/4/939
      
      Signed-off-by: default avatarMartin Kepplinger <martink@posteo.de>
      Signed-off-by: default avatarTom Van Braeckel <tomvanbraeckel@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d8620958
    • Chengyu Song's avatar
      btrfs: incorrect handling for fiemap_fill_next_extent return · 26e726af
      Chengyu Song authored
      
      
      fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
      the last extent that will fit in user array. If 1 is returned, the return
      value may eventually returned to user space, which should not happen, according
      to manpage of ioctl.
      Signed-off-by: default avatarChengyu Song <csong84@gatech.edu>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      26e726af
    • David Sterba's avatar
      btrfs: don't accept bare namespace as a valid xattr · 3c3b04d1
      David Sterba authored
      Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
      works:
      
       $ touch file
       $ setfattr -n user. -v 1 file
       $ getfattr -d file
      user.="1"
      
      ie. the missing attribute name after the namespace.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
      
      Reported-by: default avatarWilliam Douglas <william.douglas@intel.com>
      CC: <stable@vger.kernel.org> # 2.6.29+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3c3b04d1
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · dcc82f47
      Filipe Manana authored
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725
      
       (Btrfs: fix extent pinning bugs in the tree log)
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dcc82f47
    • Filipe Manana's avatar
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana authored
      
      
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2f2ff0ee
    • Filipe Manana's avatar
      Btrfs: change the insertion criteria for the qgroup operations rbtree · bf691960
      Filipe Manana authored
      
      
      After looking at Liu Bo's recent patch (titled
      "Btrfs: fix comp_oper to get right order") I realized the search made by
      qgroup_oper_exists() was buggy because its rbtree navigation comparison
      function, comp_oper_exist(), only looks at the fields bytenr and ref_root
      of a tree node, ignoring the seq field completely. This was wrong because
      when we insert a node into the rbtree we use comp_oper(), which takes a
      decision based first on bytenr, then on seq and then on the ref_root field.
      That means qgroup_oper_exists() could miss the fact that at least one
      operation with given bytenr and ref_root exists.
      
      Consider the following simple example of a 3 nodes qgroup operations
      rbtree (created using comp_oper before this patch), where each node's key
      is a tuple with the shape (bytenr, seq, ref_root, op):
      
                                [ (4096, 2, 20, op X) ]
                               /                       \
                              /                         \
         [ (4096, 1, 5, op Y) ]                         [ (4096, 3, 10, op Z) ]
      
      qgroup_oper_exists() when called to search for an existing operation for
      bytenr 4096 and ref root 10 wouldn't find anything because it would go to
      the left subtree instead of the right subtree, since comp_oper_exits()
      ignores the seq field completely.
      
      Fix this by changing the insertion navigation function to use the ref_root
      field right after using the bytenr field and before using the seq field,
      so that qgroup_oper_exists() / comp_oper_exist() work as expected.
      
      This patch applies on top of the patch mentioned above from Liu.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bf691960
    • Filipe Manana's avatar
      Btrfs: add missing inode item update in fallocate() · 3d850dd4
      Filipe Manana authored
      
      
      If we fallocate(), without the keep size flag, into an area already covered
      by an extent previously fallocated, we were updating the inode's i_size but
      we weren't updating the inode item in the fs/subvol tree. A following umount
      + mount would result in a loss of the inode's size (and an fsync would miss
      too the fact that the inode changed).
      
      Reproducer:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt
        $ fallocate -n -l 1M /mnt/foobar
        $ fallocate -l 512K /mnt/foobar
        $ umount /mnt
        $ mount /dev/sdd /mnt
        $ od -t x1 /mnt/foobar
        0000000
      
      The expected result is:
      
        $ od -t x1 /mnt/foobar
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        2000000
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3d850dd4
    • Filipe Manana's avatar
      Btrfs: incremental send, remove dead code · 5f806c3a
      Filipe Manana authored
      The logic to detect path loops when attempting to apply a pending
      directory rename, introduced in commit
      f959492f
      
       (Btrfs: send, fix more issues related to directory renames)
      is no longer needed, and the respective fstests test case for that commit,
      btrfs/045, now passes without this code (as well as all the other test
      cases for send/receive).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5f806c3a
    • Filipe Manana's avatar
      Btrfs: incremental send, clear name from cache after orphanization · 8996a48c
      Filipe Manana authored
      
      
      If a directory's reference ends up being orphanized, because the inode
      currently being processed has a new path that matches that directory's
      path, make sure we evict the name of the directory from the name cache.
      This is because there might be descendent inodes (either directories or
      regular files) that will be orphanized later too, and therefore the
      orphan name of the ancestor must be used, otherwise we send issue rename
      operations with a wrong path in the send stream.
      
      Reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir -p /mnt/data/n1/n2/p1/p2
        $ mkdir /mnt/data/n4
        $ mkdir -p /mnt/data/p1/p2
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ mv /mnt/data/p1/p2 /mnt/data
        $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1
        $ mv /mnt/data/p2 /mnt/data/n1/n2/p1
        $ mv /mnt/data/n1/n2 /mnt/data/p1
        $ mv /mnt/data/p1 /mnt/data/n4
        $ mv /mnt/data/n4/p1/n2/p1 /mnt/data
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 -f /tmp/1.send
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt2
        $ btrfs receive /mnt2 -f /tmp/1.send
        $ btrfs receive /mnt2 -f /tmp/2.send
        ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory
      
      Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent
      snapshot are both orphanized during the incremental send, and as soon as
      data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2
      we use a source path of o263-6-o/p2 for the rename operation instead of
      the old path data/p1/p2 (the one before the orphanization of inode 263).
      
      A test case for xfstests follows soon.
      Reported-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8996a48c
    • Filipe Manana's avatar
      Btrfs: send, don't leave without decrementing clone root's send_progress · 2f1f465a
      Filipe Manana authored
      
      
      If the clone root was not readonly or the dead flag was set on it, we were
      leaving without decrementing the root's send_progress counter (and before
      we just incremented it). If a concurrent snapshot deletion was in progress
      and ended up being aborted, it would be impossible to later attempt to
      delete again the snapshot, since the root's send_in_progress counter could
      never go back to 0.
      
      We were also setting clone_sources_to_rollback to i + 1 too early - if we
      bailed out because the clone root we got is not readonly or flagged as dead
      we ended up later derreferencing a null pointer because we didn't assign
      the clone root to sctx->clone_roots[i].root:
      
      		for (i = 0; sctx && i < clone_sources_to_rollback; i++)
      			btrfs_root_dec_send_in_progress(
      					sctx->clone_roots[i].root);
      
      So just don't increment the send_in_progress counter if the root is readonly
      or flagged as dead.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2f1f465a
    • Filipe Manana's avatar
      Btrfs: send, add missing check for dead clone root · 5cc2b17e
      Filipe Manana authored
      
      
      After we locked the root's root item, a concurrent snapshot deletion
      call might have set the dead flag on it. So check if the dead flag
      is set and abort if it is, just like we do for the parent root.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5cc2b17e
    • Filipe Manana's avatar
      Btrfs: remove deleted xattrs on fsync log replay · 4f764e51
      Filipe Manana authored
      
      
      If we deleted xattrs from a file and fsynced the file, after a log replay
      the xattrs would remain associated to the file. This was an unexpected
      behaviour and differs from what other filesystems do, such as for example
      xfs and ext3/4.
      
      Fix this by, on fsync log replay, check if every xattr in the fs/subvol
      tree (that belongs to a logged inode) has a matching xattr in the log,
      and if it does not, delete it from the fs/subvol tree. This is a similar
      approach to what we do for dentries when we replay a directory from the
      fsync log.
      
      This issue is trivial to reproduce, and the following excerpt from my
      test for xfstests triggers the issue:
      
        _crash_and_mount()
        {
             # Simulate a crash/power loss.
             _load_flakey_table $FLAKEY_DROP_WRITES
             _unmount_flakey
             _load_flakey_table $FLAKEY_ALLOW_WRITES
             _mount_flakey
        }
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file and add 3 xattrs to it.
        touch $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar
      
        # Make sure everything is durably persisted.
        sync
      
        # Now delete the second xattr and fsync the inode.
        $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        _crash_and_mount
      
        # After the fsync log is replayed, the file should have only 2 xattrs, the ones
        # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file
        # with the 3 xattrs that we had before deleting the second one and fsyncing the
        # file.
        echo "xattr names and values after first fsync log replay:"
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch
      
        # Now write some data to our file, fsync it, remove the first xattr, add a new
        # hard link to our file and commit the fsync log by fsyncing some other new
        # file. This is to verify that after log replay our first xattr does not exist
        # anymore.
        echo "hello world!" >> $SCRATCH_MNT/foobar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
        $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
        ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
        touch $SCRATCH_MNT/qwerty
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty
      
        _crash_and_mount
      
        # Now only the xattr with name user.attr3 should be set in our file.
        echo "xattr names and values after second fsync log replay:"
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch
      
        status=0
        exit
      
      The expected golden output, which is produced with this patch applied or
      when testing against xfs or ext3/4, is:
      
        xattr names and values after first fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr3="val3"
      
        xattr names and values after second fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr3="val3"
      
      Without this patch applied, the output is:
      
        xattr names and values after first fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr2="val2"
        user.attr3="val3"
      
        xattr names and values after second fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr2="val2"
        user.attr3="val3"
      
      A patch with a test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4f764e51
  4. 25 Mar, 2015 3 commits
  5. 22 Mar, 2015 7 commits
  6. 21 Mar, 2015 11 commits
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma · f8975224
      Linus Torvalds authored
      Pull slave dmaengine fixes from Vinod Koul:
       "Four fixes for dw, pl08x, imx-sdma and at_hdmac driver.  Nothing
        unusual here, simple fixes to these drivers"
      
      * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: pl08x: Define capabilities for generic capabilities reporting
        dmaengine: dw: append MODULE_ALIAS for platform driver
        dmaengine: imx-sdma: switch to dynamic context mode after script loaded
        dmaengine: at_hdmac: Fix calculation of the residual bytes
      f8975224
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3d7a6db5
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "These are fixes for recent regressions (PCI/ACPI resources and at91
        RTC locking), a stable-candidate powercap RAPL driver fix and two ARM
        cpuidle fixes (one stable-candidate too).
      
        Specifics:
      
         - Revert a recent PCI commit related to IRQ resources management that
           introduced a regression for drivers attempting to bind to devices
           whose previous drivers did not balance pci_enable_device() and
           pci_disable_device() as expected (Rafael J Wysocki).
      
         - Fix a deadlock in at91_rtc_interrupt() introduced by a typo in a
           recent commit related to wakeup interrupt handling (Dan Carpenter).
      
         - Allow the power capping RAPL (Running-Average Power Limit) driver
           to use different energy units for domains within one CPU package
           which is necessary to handle Intel Haswell EP processors correctly
           (Jacob Pan).
      
         - Improve the cpuidle mvebu driver's handling of Armada XP SoCs by
           updating the target residency and exit latency numbers for those
           chips (Sebastien Rannou).
      
         - Prevent the cpuidle mvebu driver from calling cpu_pm_enter() twice
           in a row before cpu_pm_exit() is called on the same CPU which
           breaks the core's assumptions regarding the usage of those
           functions (Gregory Clement)"
      
      * tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "x86/PCI: Refine the way to release PCI IRQ resources"
        rtc: at91rm9200: double locking bug in at91_rtc_interrupt()
        powercap / RAPL: handle domains with different energy units
        cpuidle: mvebu: Update cpuidle thresholds for Armada XP SOCs
        cpuidle: mvebu: Fix the CPU PM notifier usage
      3d7a6db5
    • Linus Torvalds's avatar
      Merge git://people.freedesktop.org/~airlied/linux · 97448d5b
      Linus Torvalds authored
      Pull drm updates from Dave Airlie:
       "A bunch of fixes across drivers:
      
        radeon:
           disable two ended allocation for now, it breaks some stuff
      
        amdkfd:
           misc fixes
      
        nouveau:
           fix irq loop problem, add basic support for GM206 (new hw)
      
        i915:
           fix some WARNs people were seeing
      
        exynos:
           fix some iommu interactions causing boot failures"
      
      * git://people.freedesktop.org/~airlied/linux:
        drm/radeon: drop ttm two ended allocation
        drm/exynos: fix the initialization order in FIMD
        drm/exynos: fix typo config name correctly.
        drm/exynos: Check for NULL dereference of crtc
        drm/exynos: IS_ERR() vs NULL bug
        drm/exynos: remove unused files
        drm/i915: Make sure the primary plane is enabled before reading out the fb state
        drm/nouveau/bios: fix i2c table parsing for dcb 4.1
        drm/nouveau/device/gm100: Basic GM206 bring up (as copy of GM204)
        drm/nouveau/device: post write to NV_PMC_BOOT_1 when flipping endian switch
        drm/nouveau/gr/gf100: fix some accidental or'ing of buffer addresses
        drm/nouveau/fifo/nv04: remove the loop from the interrupt handler
        drm/radeon: Changing number of compute pipe lines
        drm/amdkfd: Fix SDMA queue init. in non-HWS mode
        drm/amdkfd: destroy mqd when destroying kernel queue
        drm/i915: Ensure plane->state->fb stays in sync with plane->fb
      97448d5b
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-4.0-part2' of... · bb8ef2fb
      Linus Torvalds authored
      Merge tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
      
      Pull more DeviceTree fixes vfom Rob Herring:
      
       - revert setting stdout-path as preferred console.  This caused
         regressions in PowerMACs and other systems.
      
       - yet another fix for stdout-path option parsing.
      
       - fix error path handling in of_irq_parse_one
      
      * tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        Revert "of: Fix premature bootconsole disable with 'stdout-path'"
        of: handle both '/' and ':' in path strings
        of: unittest: Add option string test case with longer path
        of/irq: Fix of_irq_parse_one() returned error codes
      bb8ef2fb
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · e477f3e0
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are current target-pending fixes for v4.0-rc5 code that have made
        their way into the queue over the last weeks.
      
        The fixes this round include:
      
         - Fix long-standing iser-target logout bug related to early
           conn_logout_comp completion, resulting in iscsi_conn use-after-tree
           OOpsen.  (Sagi + nab)
      
         - Fix long-standing tcm_fc bug in ft_invl_hw_context() failure
           handing for DDP hw offload.  (DanC)
      
         - Fix incorrect use of unprotected __transport_register_session() in
           tcm_qla2xxx + other single local se_node_acl fabrics.  (Bart)
      
         - Fix reference leak in target_submit_cmd() -> target_get_sess_cmd()
           for ack_kref=1 failure path.  (Bart)
      
         - Fix pSCSI backend ->get_device_type() statistics OOPs with
           un-configured device.  (Olaf + nab)
      
         - Fix virtual LUN=0 target_configure_device failure OOPs at modprobe
           time.  (Claudio + nab)
      
         - Fix FUA write false positive failure regression in v4.0-rc1 code.
           (Christophe Vu-Brugier + HCH)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: do not reject FUA CDBs when write cache is enabled but emulate_write_cache is 0
        target: Fix virtual LUN=0 target_configure_device failure OOPs
        target/pscsi: Fix NULL pointer dereference in get_device_type
        tcm_fc: missing curly braces in ft_invl_hw_context()
        target: Fix reference leak in target_get_sess_cmd() error path
        loop/usb/vhost-scsi/xen-scsiback: Fix use of __transport_register_session
        tcm_qla2xxx: Fix incorrect use of __transport_register_session
        iscsi-target: Avoid early conn_logout_comp for iser connections
        Revert "iscsi-target: Avoid IN_LOGOUT failure case for iser-target"
        target: Disallow changing of WRITE cache/FUA attrs after export
      e477f3e0
    • Linus Torvalds's avatar
      Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · da6b9a20
      Linus Torvalds authored
      Pull devicemapper fixes from Mike Snitzer:
       "A handful of stable fixes for DM:
         - fix thin target to always zero-fill reads to unprovisioned blocks
         - fix to interlock device destruction's suspend from internal
           suspends
         - fix 2 snapshot exception store handover bugs
         - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"
      
      * tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
        dm snapshot: suspend merging snapshot when doing exception handover
        dm snapshot: suspend origin when doing exception handover
        dm: hold suspend_lock while suspending device during device deletion
        dm thin: fix to consistently zero-fill reads to unprovisioned blocks
      da6b9a20
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 521d4746
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "Most of these are fixing extent reservation accounting, or corners
        with tree writeback during commit.
      
        Josef's set does add a test, which isn't strictly a fix, but it'll
        keep us from making this same mistake again"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix outstanding_extents accounting in DIO
        Btrfs: add sanity test for outstanding_extents accounting
        Btrfs: just free dummy extent buffers
        Btrfs: account merges/splits properly
        Btrfs: prepare block group cache before writing
        Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
        Btrfs: account for the correct number of extents for delalloc reservations
        Btrfs: fix merge delalloc logic
        Btrfs: fix comp_oper to get right order
        Btrfs: catch transaction abortion after waiting for it
        btrfs: fix sizeof format specifier in btrfs_check_super_valid()
      521d4746
    • Linus Torvalds's avatar
      Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux · 0d122f74
      Linus Torvalds authored
      Pull nfsd bufix from Bruce Fields:
       "This is a fix for a crash easily triggered by 4.1 activity to a server
        built with CONFIG_NFSD_PNFS.
      
        There are some more bugfixes queued up that I intend to pass along
        next week, but this is the most critical"
      
      * 'for-4.0' of git://linux-nfs.org/~bfields/linux:
        Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
      0d122f74
    • Linus Torvalds's avatar
      Merge tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs · c6ef8145
      Linus Torvalds authored
      Pull UBI fix from Artem Bityutskiy:
       "This fixes a bug introduced during the v4.0 merge window where we
        forgot to put braces where they should be"
      
      * tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs:
        UBI: fix missing brace control flow
      c6ef8145
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 60ed380e
      Linus Torvalds authored
      Pull arm64 fixes from Catalin Marinas:
      
       - mm switching fix where the kernel pgd ends up in the user TTBR0 after
         returning from an EFI run-time services call
      
       - fix __GFP_ZERO handling for atomic pool and CMA DMA allocations (the
         generic code does get the gfp flags, so it's left with the arch code
         to memzero accordingly)
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Honor __GFP_ZERO in dma allocations
        arm64: efi: don't restore TTBR0 if active_mm points at init_mm
      60ed380e
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm · 62a202d7
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Another few ARM fixes.  Fabrice fixed the L2 cache DT parsing to allow
        prefetch configuration to be specified even when the cache size
        parsing fails.
      
        Laura noticed that the setting of page attributes wasn't working for
        modules due to is_module_addr() always returning false.
      
        Marc Gonzalez (aka Mason) noticed a potential latent bug with the way
        we read one of the CPUID registers (where we could attempt to read a
        non-present CPUID register which may fault.)
      
        I've fixed an issue where 32-bit DMA masks were failing with memory
        which extended to the top of physical address space, and I've also
        added debugging output of the page tables when we hit a data access
        exception which we don't specifically handle - prompted by the lack of
        information in a bug report"
      
      * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
        ARM: 8313/1: Use read_cpuid_ext() macro instead of inline asm
        ARM: 8311/1: Don't use is_module_addr in setting page attributes
        ARM: 8310/1: l2c: Fix prefetch settings dt parsing
        ARM: dump pgd, pmd and pte states on unhandled data abort faults
        ARM: dma-api: fix off-by-one error in __dma_supported()
      62a202d7
  7. 20 Mar, 2015 1 commit
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-cpuidle', 'powercap', 'irq-pm' and 'acpi-resources' · 9c86286a
      Rafael J. Wysocki authored
      * pm-cpuidle:
        cpuidle: mvebu: Update cpuidle thresholds for Armada XP SOCs
        cpuidle: mvebu: Fix the CPU PM notifier usage
      
      * powercap:
        powercap / RAPL: handle domains with different energy units
      
      * irq-pm:
        rtc: at91rm9200: double locking bug in at91_rtc_interrupt()
      
      * acpi-resources:
        Revert "x86/PCI: Refine the way to release PCI IRQ resources"
      9c86286a