1. 27 May, 2010 2 commits
  2. 26 May, 2010 6 commits
  3. 25 May, 2010 18 commits
    • Chris Mason's avatar
      Btrfs: rework O_DIRECT enospc handling · 4845e44f
      Chris Mason authored
      
      
      This changes O_DIRECT write code to mark extents as delalloc
      while it is processing them.  Yan Zheng has reworked the
      enospc accounting based on tracking delalloc extents and
      this makes it much easier to track enospc in the O_DIRECT code.
      
      There are a few space cases with the O_DIRECT code though,
      it only sets the EXTENT_DELALLOC bits, instead of doing
      EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
      we don't want to mess with clearing the dirty and uptodate
      bits when things go wrong.  This is important because there
      are no pages in the page cache, so any extent state structs
      that we put in the tree won't get freed by releasepage.  We have
      to clear them ourselves as the DIO ends.
      
      With this commit, we reserve space at in btrfs_file_aio_write,
      and then as each btrfs_direct_IO call progresses it sets
      EXTENT_DELALLOC on the range.
      
      btrfs_get_blocks_direct is responsible for clearing the delalloc
      at the same time it drops the extent lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4845e44f
    • Kay Sievers's avatar
      driver core: add devname module aliases to allow module on-demand auto-loading · 578454ff
      Kay Sievers authored
      This adds:
        alias: devname:<name>
      to some common kernel modules, which will allow the on-demand loading
      of the kernel module when the device node is accessed.
      
      Ideally all these modules would be compiled-in, but distros seems too
      much in love with their modularization that we need to cover the common
      cases with this new facility. It will allow us to remove a bunch of pretty
      useless init scripts and modprobes from init scripts.
      
      The static device node aliases will be carried in the module itself. The
      program depmod will extract this information to a file in the module directory:
        $ cat /lib/modules/2.6.34-00650-g537b60d1
      
      -dirty/modules.devname
        # Device nodes to trigger on-demand module loading.
        microcode cpu/microcode c10:184
        fuse fuse c10:229
        ppp_generic ppp c108:0
        tun net/tun c10:200
        dm_mod mapper/control c10:235
      
      Udev will pick up the depmod created file on startup and create all the
      static device nodes which the kernel modules specify, so that these modules
      get automatically loaded when the device node is accessed:
        $ /sbin/udevd --debug
        ...
        static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
        static_dev_create_from_modules: mknod '/dev/fuse' c10:229
        static_dev_create_from_modules: mknod '/dev/ppp' c108:0
        static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
        static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
        udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
        udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
      
      A few device nodes are switched to statically allocated numbers, to allow
      the static nodes to work. This might also useful for systems which still run
      a plain static /dev, which is completely unsafe to use with any dynamic minor
      numbers.
      
      Note:
      The devname aliases must be limited to the *common* and *single*instance*
      device nodes, like the misc devices, and never be used for conceptually limited
      systems like the loop devices, which should rather get fixed properly and get a
      control node for losetup to talk to, instead of creating a random number of
      device nodes in advance, regardless if they are ever used.
      
      This facility is to hide the mess distros are creating with too modualized
      kernels, and just to hide that these modules are not compiled-in, and not to
      paper-over broken concepts. Thanks! :)
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Ian Kent <raven@themaw.net>
      Signed-Off-By: default avatarKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      578454ff
    • Chris Mason's avatar
      Btrfs: use async helpers for DIO write checksumming · eaf25d93
      Chris Mason authored
      
      
      The async helper threads offload crc work onto all the
      CPUs, and make streaming writes much faster.  This
      changes the O_DIRECT write code to use them.  The only
      small complication was that we need to pass in the
      logical offset in the file for each bio, because we can't
      find it in the bio's pages.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      eaf25d93
    • Chris Mason's avatar
      Btrfs: don't walk around with task->state != TASK_RUNNING · ed3b3d31
      Chris Mason authored
      
      
      Yan Zheng noticed two places we were doing a lot of work
      without task->state set to TASK_RUNNING.  This sets the state
      properly after we get ready to sleep but decide not to.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ed3b3d31
    • Josef Bacik's avatar
      Btrfs: do aio_write instead of write · 11c65dcc
      Josef Bacik authored
      
      
      In order for AIO to work, we need to implement aio_write.  This patch converts
      our btrfs_file_write to btrfs_aio_write.  I've tested this with xfstests and
      nothing broke, and the AIO stuff magically started working.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11c65dcc
    • Josef Bacik's avatar
      Btrfs: add basic DIO read/write support · 4b46fce2
      Josef Bacik authored
      
      
      This provides basic DIO support for reading and writing.  It does not do the
      work to recover from mismatching checksums, that will come later.  A few design
      changes have been made from Jim's code (sorry Jim!)
      
      1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
      code in order to account for all of BTRFS's oddities, but thanks to that work it
      seems like the best bet is to just ignore compression and such and just opt to
      fallback on buffered IO.
      
      2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
      it's own buffering to make dio with compressed extents work.  Now we just
      fallback onto normal buffered IO.
      
      3) Use ordered extents for the writes so that all of the
      
      lock_extent()
      lookup_ordered()
      
      type checks continue to work.
      
      4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
      DIO writes.
      
      I've tested this with fsx and everything works great.  This patch depends on my
      dio and filemap.c patches to work.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4b46fce2
    • Yan, Zheng's avatar
      Btrfs: Metadata ENOSPC handling for balance · 3fd0a558
      Yan, Zheng authored
      
      
      This patch adds metadata ENOSPC handling for the balance code.
      It is consisted by following major changes:
      
      1. Avoid COW tree leave in the phrase of merging tree.
      
      2. Handle interaction with snapshot creation.
      
      3. make the backref cache can live across transactions.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3fd0a558
    • Yan, Zheng's avatar
      Btrfs: Pre-allocate space for data relocation · efa56464
      Yan, Zheng authored
      
      
      Pre-allocate space for data relocation. This can detect ENOPSC
      condition caused by fragmentation of free space.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      efa56464
    • Yan, Zheng's avatar
      Btrfs: Metadata ENOSPC handling for tree log · 4a500fd1
      Yan, Zheng authored
      
      
      Previous patches make the allocater return -ENOSPC if there is no
      unreserved free metadata space. This patch updates tree log code
      and various other places to propagate/handle the ENOSPC error.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4a500fd1
    • Yan, Zheng's avatar
      Btrfs: Metadata reservation for orphan inodes · d68fc57b
      Yan, Zheng authored
      
      
      reserve metadata space for handling orphan inodes
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d68fc57b
    • Yan, Zheng's avatar
      Btrfs: Introduce global metadata reservation · 8929ecfa
      Yan, Zheng authored
      
      
      Reserve metadata space for extent tree, checksum tree and root tree
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8929ecfa
    • Yan, Zheng's avatar
      Btrfs: Update metadata reservation for delayed allocation · 0ca1f7ce
      Yan, Zheng authored
      
      
      Introduce metadata reservation context for delayed allocation
      and update various related functions.
      
      This patch also introduces EXTENT_FIRST_DELALLOC control bit for
      set/clear_extent_bit. It tells set/clear_bit_hook whether they
      are processing the first extent_state with EXTENT_DELALLOC bit
      set. This change is important if set/clear_extent_bit involves
      multiple extent_state.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0ca1f7ce
    • Yan, Zheng's avatar
      Btrfs: Integrate metadata reservation with start_transaction · a22285a6
      Yan, Zheng authored
      
      
      Besides simplify the code, this change makes sure all metadata
      reservation for normal metadata operations are released after
      committing transaction.
      
      Changes since V1:
      
      Add code that check if unlink and rmdir will free space.
      
      Add ENOSPC handling for clone ioctl.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a22285a6
    • Yan, Zheng's avatar
      Btrfs: Introduce contexts for metadata reservation · f0486c68
      Yan, Zheng authored
      
      
      Introducing metadata reseravtion contexts has two major advantages.
      First, it makes metadata reseravtion more traceable. Second, it can
      reclaim freed space and re-add them to the itself after transaction
      committed.
      
      Besides add btrfs_block_rsv structure and related helper functions,
      This patch contains following changes:
      
      Move code that decides if freed tree block should be pinned into
      btrfs_free_tree_block().
      
      Make space accounting more accurate, mainly for handling read only
      block groups.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f0486c68
    • Yan, Zheng's avatar
      Btrfs: Kill init_btrfs_i() · 2ead6ae7
      Yan, Zheng authored
      
      
      All code in init_btrfs_i can be moved into btrfs_alloc_inode()
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2ead6ae7
    • Yan, Zheng's avatar
      Btrfs: Shrink delay allocated space in a synchronized · 5da9d01b
      Yan, Zheng authored
      
      
      Shrink delayed allocation space in a synchronized manner is more
      controllable than flushing all delay allocated space in an async
      thread.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5da9d01b
    • Yan, Zheng's avatar
      Btrfs: Kill allocate_wait in space_info · 424499db
      Yan, Zheng authored
      
      
      We already have fs_info->chunk_mutex to avoid concurrent
      chunk creation.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      424499db
    • Yan, Zheng's avatar
      Btrfs: Link block groups of different raid types · b742bb82
      Yan, Zheng authored
      
      
      The size of reserved space is stored in space_info. If block groups
      of different raid types are linked to separate space_info, changing
      allocation profile will corrupt reserved space accounting.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b742bb82
  4. 21 May, 2010 2 commits
  5. 15 May, 2010 1 commit
  6. 28 Apr, 2010 1 commit
  7. 26 Apr, 2010 1 commit
  8. 06 Apr, 2010 1 commit
  9. 05 Apr, 2010 5 commits
    • Josef Bacik's avatar
      Btrfs: fix data enospc check overflow · ab6e2410
      Josef Bacik authored
      
      
      Because we account for reserved space we get from the allocator before we
      actually account for allocating delalloc space, we can have a small window where
      the amount of "used" space in a space_info is more than the total amount of
      space in the space_info.  This will cause a overflow in our check, so it will
      seem like we have _tons_ of free space, and we'll allow reservations to occur
      that will end up larger than the amount of space we have.  I've seen users
      report ENOSPC panic's in cow_file_range a few times recently, so I tried to
      reproduce this problem and found I could reproduce it if I ran one of my tests
      in a loop for like 20 minutes.  With this patch my test ran all night without
      issues.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ab6e2410
    • Chris Mason's avatar
      Btrfs: add check for changed leaves in setup_leaf_for_split · 109f6aef
      Chris Mason authored
      
      
      setup_leaf_for_split needs to drop the path and search again, and has
      checks to see if the item we want to split changed size.  But, it misses
      the case where the leaf changed and now has enough room for the item
      we want to insert.
      
      This adds an extra check to make sure the leaf really needs splitting
      before we call btrfs_split_leaf(), which keeps us from trying to split
      a leaf with a single item.
      
      btrfs_split_leaf() will blindly split the single item leaf, leaving us
      with one good leaf and one empty leaf and then a crash.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      109f6aef
    • Sage Weil's avatar
      Btrfs: create snapshot references in same commit as snapshot · 6bdb72de
      Sage Weil authored
      
      
      This creates the reference to a new snapshot in the same commit as the
      snapshot itself.  This avoids the need for a second commit in order for a
      snapshot to be persistent, and also avoids the problem of "leaking" a
      new snapshot tree root if the host crashes before the second commit takes
      place.
      
      It is not at all clear to me why it wasn't always done this way.  If there
      is still a reason for the two-stage {create,finish}_pending_snapshots()
      approach I'm missing something!  :)
      
      I've been running this for a couple weeks under pretty heavy usage (a few
      snapshots per minute) without obvious problems.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      6bdb72de
    • Josef Bacik's avatar
      Btrfs: fix small race with delalloc flushing waitqueue's · b5cb1600
      Josef Bacik authored
      
      
      Everytime we start a new flushing thread, we init the waitqueue if there isn't a
      flushing thread running.  The problem with this is we check
      space_info->flushing, which we clear right before doing a wake_up on the
      flushing waitqueue, which causes problems if we init the waitqueue in the middle
      of clearing the flushing flagh and calling wake_up.  This is hard to hit, but
      the code is wrong anyway, so init the flushing/allocating waitqueue when
      creating the space info and let it be.  I haven't seen the panic since I've been
      using this patch.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b5cb1600
    • Nick Piggin's avatar
      Btrfs: use add_to_page_cache_lru, use __page_cache_alloc · 28ecb609
      Nick Piggin authored
      
      
      Pagecache pages should be allocated with __page_cache_alloc, so they
      obey pagecache memory policies.
      
      add_to_page_cache_lru is exported, so it should be used. Benefits over
      using a private pagevec: neater code, 128 bytes fewer stack used, percpu
      lru ordering is preserved, and finally don't need to flush pagevec
      before returning so batching may be shared with other LRU insertions.
      
      Signed-off-by: Nick Piggin <npiggin@suse.de>:
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      28ecb609
  10. 30 Mar, 2010 3 commits
    • Josef Bacik's avatar
      Btrfs: fix chunk allocate size calculation · 0cad8a11
      Josef Bacik authored
      
      
      If the amount of free space left in a device is less than what we think should
      be the minimum size, just ignore the minimum size and use the amount we have.  I
      ran into this running tests on a 600mb volume, the chunk allocator wouldn't let
      me allocate the last 52mb of the disk for data because we want to have at least
      64mb chunks for data.  This patch fixes that problem.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0cad8a11
    • Josef Bacik's avatar
      Btrfs: kill max_extent mount option · 287a0ab9
      Josef Bacik authored
      
      
      As Yan pointed out, theres not much reason for all this complicated math to
      account for file extents being split up into max_extent chunks, since they are
      likely to all end up in the same leaf anyway.  Since there isn't much reason to
      use max_extent, just remove the option altogether so we have one less thing we
      need to test.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      287a0ab9
    • Josef Bacik's avatar
      Btrfs: fail to mount if we have problems reading the block groups · 1b1d1f66
      Josef Bacik authored
      
      
      We don't actually check the return value of btrfs_read_block_groups, so we can
      possibly succeed to mount, but then fail to say read the superblock xattr for
      selinux which will cause the vfs code to deactivate the super.
      
      This is a problem because in find_free_extent we just assume that we
      will find the right space_info for the allocation we want.  But if we
      failed to read the block groups, we won't have setup any space_info's,
      and we'll hit a NULL pointer deref in find_free_extent.
      
      This patch fixes that problem by checking the return value of
      btrfs_read_block_groups, and failing out properly.  I've also added a
      check in find_free_extent so if for some reason we don't find an
      appropriate space_info, we just return -ENOSPC.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1b1d1f66