1. 23 Jul, 2012 5 commits
    • Liu Bo's avatar
      Btrfs: improve multi-thread buffer read · 67c9684f
      Liu Bo authored
      
      
      While testing with my buffer read fio jobs[1], I find that btrfs does not
      perform well enough.
      
      Here is a scenario in fio jobs:
      
      We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
      and all of them will race on add_to_page_cache_lru(), and if one thread
      successfully puts its page into the page cache, it takes the responsibility
      to read the page's data.
      
      And what's more, reading a page needs a period of time to finish, in which
      other threads can slide in and process rest pages:
      
           t1          t2          t3          t4
         add Page1
         read Page1  add Page2
           |         read Page2  add Page3
           |            |        read Page3  add Page4
           |            |           |        read Page4
      -----|------------|-----------|-----------|--------
           v            v           v           v
          bio          bio         bio         bio
      
      Now we have four bios, each of which holds only one page since we need to
      maintain consecutive pages in bio.  Thus, we can end up with far more bios
      than we need.
      
      Here we're going to
      a) delay the real read-page section and
      b) try to put more pages into page cache.
      
      With that said, we can make each bio hold more pages and reduce the number
      of bios we need.
      
      Here is some numbers taken from fio results:
               w/o patch                 w patch
             -------------  --------  ---------------
      READ:    745MB/s        +25%       934MB/s
      
      [1]:
      [global]
      group_reporting
      thread
      numjobs=4
      bs=32k
      rw=read
      ioengine=sync
      directory=/mnt/btrfs/
      
      [READ]
      filename=foobar
      size=2000M
      invalidate=1
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      67c9684f
    • Josef Bacik's avatar
      Btrfs: lock the transition from dirty to writeback for an eb · 51561ffe
      Josef Bacik authored
      
      
      There is a small window where an eb can have no IO bits set on it, which
      could potentially result in extent_buffer_under_io() returning false when we
      want it to return true, which could result in not fun things happening.  So
      in order to protect this case we need to hold the refs_lock when we make
      this transition to make sure we get reliable results out of
      extent_buffer_udner_io().  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      51561ffe
    • Josef Bacik's avatar
      Btrfs: fix potential race in extent buffer freeing · 594831c4
      Josef Bacik authored
      
      
      This sounds sort of impossible but it is the only thing I can think of and
      at the very least it is theoretically possible so here it goes.
      
      If we are in try_release_extent_buffer we will check that the ref count on
      the extent buffer is 1 and not under IO, and then go down and clear the tree
      ref.  If between this check and clearing the tree ref somebody else comes in
      and grabs a ref on the eb and the marks it dirty before
      try_release_extent_buffer() does it's tree ref clear we can end up with a
      dirty eb that will be freed while it is still dirty which will result in a
      panic.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      594831c4
    • Josef Bacik's avatar
      Btrfs: don't return true in releasepage unless we actually freed the eb · e64860aa
      Josef Bacik authored
      
      
      I noticed while looking at an extent_buffer race that we will
      unconditionally return 1 if we get down to release_extent_buffer after
      clearing the tree ref.  However we can easily race in here and get a ref on
      the eb and not actually free the eb.  So make release_extent_buffer return 1
      if it free'd the eb and 0 if not so we can be a little kinder to the vm.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      e64860aa
    • Anand Jain's avatar
      btrfs read error corrected message floods the console during recovery · d5b025d5
      Anand Jain authored
      
      
      Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      d5b025d5
  2. 02 Jul, 2012 1 commit
    • Josef Bacik's avatar
      Btrfs: hold a ref on the inode during writepages · 7fd1a3f7
      Josef Bacik authored
      
      
      We can race with unlink and not actually be able to do our igrab in
      btrfs_add_ordered_extent.  This will result in all sorts of problems.
      Instead of doing the complicated work to try and handle returning an error
      properly from btrfs_add_ordered_extent, just hold a ref to the inode during
      writepages.  If we cannot grab a ref we know we're freeing this inode anyway
      and can just drop the dirty pages on the floor, because screw them we're
      going to invalidate them anyway.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      7fd1a3f7
  3. 14 Jun, 2012 1 commit
    • Josef Bacik's avatar
      Btrfs: use rcu to protect device->name · 606686ee
      Josef Bacik authored
      
      
      Al pointed out that we can just toss out the old name on a device and add a
      new one arbitrarily, so anybody who uses device->name in printk could
      possibly use free'd memory.  Instead of adding locking around all of this he
      suggested doing it with RCU, so I've introduced a struct rcu_string that
      does just that and have gone through and protected all accesses to
      device->name that aren't under the uuid_mutex with rcu_read_lock().  This
      protects us and I will use it for dealing with removing the device that we
      used to mount the file system in a later patch.  Thanks,
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      606686ee
  4. 30 May, 2012 4 commits
    • Stefan Behrens's avatar
      Btrfs: add device counters for detected IO and checksum errors · 442a4f63
      Stefan Behrens authored
      
      
      The goal is to detect when drives start to get an increased error rate,
      when drives should be replaced soon. Therefore statistic counters are
      added that count IO errors (read, write and flush). Additionally, the
      software detected errors like checksum errors and corrupted blocks are
      counted.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      442a4f63
    • Liu Bo's avatar
      Btrfs: use fastpath in extent state ops as much as possible · d1ac6e41
      Liu Bo authored
      
      
      Fully utilize our extent state's new helper functions to use
      fastpath as much as possible.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      d1ac6e41
    • Josef Bacik's avatar
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik authored
      
      
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      5fd02043
    • Josef Bacik's avatar
      Btrfs: fix compile warnings in extent_io.c · d7dbe9e7
      Josef Bacik authored
      
      
      These warnings are bogus since we will always have at least one page in an
      eb, but to make the compiler happy just set ret = 0 in these two cases.
      Thanks,
      Btrfs: fix compile warnings in extent_io.c
      
      These warnings are bogus since we will always have at least one page in an
      eb, but to make the compiler happy just set ret = 0 in these two cases.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      d7dbe9e7
  5. 26 May, 2012 1 commit
  6. 11 May, 2012 4 commits
  7. 04 May, 2012 1 commit
    • Josef Bacik's avatar
      Btrfs: fix page leak when allocing extent buffers · 17de39ac
      Josef Bacik authored
      
      
      If we happen to alloc a extent buffer and then alloc a page and notice that
      page is already attached to an extent buffer, we will only unlock it and
      free our existing eb.  Any pages currently attached to that eb will be
      properly freed, but we don't do the page_cache_release() on the page where
      we noticed the other extent buffer which can cause us to leak pages and I
      hope cause the weird issues we've been seeing in this area.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      17de39ac
  8. 18 Apr, 2012 3 commits
    • Josef Bacik's avatar
      Btrfs: always store the mirror we read the eb from · 5cf1ab56
      Josef Bacik authored
      
      
      A user reported a panic where we were trying to fix a bad mirror but the
      mirror number we were giving was 0, which is invalid.  This is because we
      don't do the transid verification until after the read, so as far as the
      read code is concerned the read was a success.  So instead store the mirror
      we read from so that if there is some failure post read we know which mirror
      to try next and which mirror needs to be fixed if we find a good copy of the
      block.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      5cf1ab56
    • Li Zefan's avatar
      Btrfs: avoid possible use-after-free in clear_extent_bit() · cdc6a395
      Li Zefan authored
      
      
      clear_extent_bit()
      {
          next_node = rb_next(&state->rb_node);
          ...
          clear_state_bit(state);  <-- this may free next_node
          if (next_node) {
              state = rb_entry(next_node);
              ...
          }
      }
      
      clear_state_bit() calls merge_state() which may free the next node
      of the passing extent_state, so clear_extent_bit() may end up
      referencing freed memory.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      cdc6a395
    • Li Zefan's avatar
      Btrfs: retrurn void from clear_state_bit · 8e52acf7
      Li Zefan authored
      
      
      Currently it returns a set of bits that were cleared, but this return
      value is not used at all.
      
      Moreover it doesn't seem to be useful, because we may clear the bits
      of a few extent_states, but only the cleared bits of last one is
      returned.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      8e52acf7
  9. 12 Apr, 2012 2 commits
  10. 26 Mar, 2012 8 commits
    • Josef Bacik's avatar
      Btrfs: deal with read errors on extent buffers differently · ea466794
      Josef Bacik authored
      
      
      Since we need to read and write extent buffers in their entirety we can't use
      the normal bio_readpage_error stuff since it only works on a per page basis.  So
      instead make it so that if we see an io error in endio we just mark the eb as
      having an IO error and then in btree_read_extent_buffer_pages we will manually
      try other mirrors and then overwrite the bad mirror if we find a good copy.
      This works with larger than page size blocks.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ea466794
    • Chris Mason's avatar
      Btrfs: loop waiting on writeback · a098d8e8
      Chris Mason authored
      
      
      lock_extent_buffer_for_io needs to loop around and make sure the
      writeback bits are not set.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a098d8e8
    • Josef Bacik's avatar
      Btrfs: ensure an entire eb is written at once · 0b32f4bb
      Josef Bacik authored
      
      
      This patch simplifies how we track our extent buffers.  Previously we could exit
      writepages with only having written half of an extent buffer, which meant we had
      to track the state of the pages and the state of the extent buffers differently.
      Now we only read in entire extent buffers and write out entire extent buffers,
      this allows us to simply set bits in our bflags to indicate the state of the eb
      and we no longer have to do things like track uptodate with our iotree.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0b32f4bb
    • Josef Bacik's avatar
      Btrfs: introduce mark_extent_buffer_accessed · 5df4235e
      Josef Bacik authored
      
      
      Because an eb can have multiple pages we need to make sure that all pages within
      the eb are markes as accessed, since releasepage can be called against any page
      in the eb.  This will keep us from possibly evicting hot eb's when we're doing
      larger than pagesize eb's.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      5df4235e
    • Josef Bacik's avatar
      Btrfs: introduce free_extent_buffer_stale · 3083ee2e
      Josef Bacik authored
      
      
      Because btrfs cow's we can end up with extent buffers that are no longer
      necessary just sitting around in memory.  So instead of evicting these pages, we
      could end up evicting things we actually care about.  Thus we have
      free_extent_buffer_stale for use when we are freeing tree blocks.  This will
      make it so that the ref for the eb being in the radix tree is dropped as soon as
      possible and then is freed when the refcount hits 0 instead of waiting to be
      released by releasepage.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      3083ee2e
    • Josef Bacik's avatar
      Btrfs: only use the existing eb if it's count isn't 0 · 115391d2
      Josef Bacik authored
      
      
      We can run into a problem where we find an eb for our existing page already on
      the radix tree but it has a ref count of 0.  It hasn't yet been removed by RCU
      yet so this can cause issues where we will use the EB after free.  So do
      atomic_inc_not_zero on the exists->refs and if it is zero just do
      synchronize_rcu() and try again.  We won't have to worry about new allocators
      coming in since they will block on the page lock at this point.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      115391d2
    • Josef Bacik's avatar
      Btrfs: set page->private to the eb · 4f2de97a
      Josef Bacik authored
      
      
      We spend a lot of time looking up extent buffers from pages when we could just
      store the pointer to the eb the page is associated with in page->private.  This
      patch does just that, and it makes things a little simpler and reduces a bit of
      CPU overhead involved with doing metadata IO.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      4f2de97a
    • Chris Mason's avatar
      Btrfs: allow metadata blocks larger than the page size · 727011e0
      Chris Mason authored
      
      
      A few years ago the btrfs code to support blocks lager than
      the page size was disabled to fix a few corner cases in the
      page cache handling.  This fixes the code to properly support
      large metadata blocks again.
      
      Since current kernels will crash early and often with larger
      metadata blocks, this adds an incompat bit so that older kernels
      can't mount it.
      
      This also does away with different blocksizes for nodes and leaves.
      You get a single block size for all tree blocks.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      727011e0
  11. 22 Mar, 2012 1 commit
  12. 21 Mar, 2012 7 commits
  13. 20 Mar, 2012 1 commit
  14. 23 Feb, 2012 1 commit
    • Chris Mason's avatar
      Btrfs: clear the extent uptodate bits during parent transid failures · 50653190
      Chris Mason authored
      
      
      If btrfs reads a block and finds a parent transid mismatch, it clears
      the uptodate flags on the extent buffer, and the pages inside it.  But
      we only clear the uptodate bits in the state tree if the block straddles
      more than one page.
      
      This is from an old optimization from to reduce contention on the extent
      state tree.  But it is buggy because the code that retries a read from
      a different copy of the block is going to find the uptodate state bits
      set and skip the IO.
      
      The end result of the bug is that we'll never actually read the good
      copy (if there is one).
      
      The fix here is to always clear the uptodate state bits, which is safe
      because this code is only called when the parent transid fails.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      50653190