1. 08 Jun, 2012 1 commit
    • Benjamin Marzinski's avatar
      GFS2: Use lvbs for storing rgrp information with mount option · 90306c41
      Benjamin Marzinski authored
      
      
      Instead of reading in the resource groups when gfs2 is checking
      for free space to allocate from, gfs2 can store the necessary infromation
      in the resource group's lvb.  Also, instead of searching for unlinked
      inodes in every resource group that's checked for free space, gfs2 can
      store the number of unlinked but inodes in the lvb, and only check for
      unlinked inodes if it will find some.
      
      The first time a resource group is locked, the lvb must initialized.
      Since this involves counting the unlinked inodes in the resource group,
      this takes a little extra time.  But after that, if the resource group
      is locked with GL_SKIP, the buffer head won't be read in unless it's
      actually needed.
      
      Enabling the resource groups lvbs is done via the rgrplvb mount option.  If
      this option isn't set, the lvbs will still be set and updated, but they won't
      be verfied or used by the filesystem.  To safely turn on this option, all of
      the nodes mounting the filesystem must be running code with this patch, and
      the filesystem must have been completely unmounted since they were updated.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      90306c41
  2. 06 Jun, 2012 2 commits
  3. 11 May, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Add rgrp information to block_alloc trace point · 41db1ab9
      Bob Peterson authored
      
      
      This is a second attempt at a patch that adds rgrp information to the
      block allocation trace point for GFS2. As suggested, the patch was
      modified to list the rgrp information _after_ the fields that exist today.
      
      Again, the reason for this patch is to allow us to trace and debug
      problems with the block reservations patch, which is still in the works.
      We can debug problems with reservations if we can see what block allocations
      result from the block reservations. It may also be handy in figuring out
      if there are problems in rgrp free space accounting. In other words,
      we can use it to track the rgrp and its free space along side the allocations
      that are taking place.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      41db1ab9
  4. 27 Apr, 2012 1 commit
  5. 24 Apr, 2012 5 commits
  6. 05 Apr, 2012 1 commit
  7. 26 Mar, 2012 1 commit
  8. 05 Mar, 2012 2 commits
    • Bob Peterson's avatar
      GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd · 58884c4d
      Bob Peterson authored
      
      
      This patch adds a call to gfs2_rindex_update from function gfs2_blk2rgrpd
      and removes calls to it that are made redundant by it. The problem is
      that a gfs2_grow can add rgrps to the rindex, then put those rgrps into
      use, thus rendering the rindex we read in at mount time incomplete.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      58884c4d
    • Bob Peterson's avatar
      GFS2: Eliminate sd_rindex_mutex · 6aad1c3d
      Bob Peterson authored
      
      
      Over time, we've slowly eliminated the use of sd_rindex_mutex.
      Up to this point, it was only used in two places: function
      gfs2_ri_total (which totals the file system size by reading
      and parsing the rindex file) and function gfs2_rindex_update
      which updates the rgrps in memory. Both of these functions have
      the rindex glock to protect them, so the rindex is unnecessary.
      Since gfs2_grow writes to the rindex via the meta_fs, the mutex
      is in the wrong order according to the normal rules. This patch
      eliminates the mutex entirely to avoid the problem.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      6aad1c3d
  9. 01 Mar, 2012 1 commit
  10. 28 Feb, 2012 2 commits
    • Steven Whitehouse's avatar
      GFS2: FITRIM ioctl support · 66fc061b
      Steven Whitehouse authored
      
      
      The FITRIM ioctl provides an alternative way to send discard requests to
      the underlying device. Using the discard mount option results in every
      freed block generating a discard request to the block device. This can
      be slow, since many block devices can only process discard requests of
      larger sizes, and also such operations can be time consuming.
      
      Rather than using the discard mount option, FITRIM allows a sweep of the
      filesystem on an occasional basis, and also to optionally avoid sending
      down discard requests for smaller regions.
      
      In GFS2 FITRIM will work at resource group granularity. There is a flag
      for each resource group which keeps track of which resource groups have
      been trimmed. This flag is reset whenever a deallocation occurs in the
      resource group, and set whenever a successful FITRIM of that resource
      group has taken place. This helps to reduce repeated discard requests
      for the same block ranges, again improving performance.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      66fc061b
    • Steven Whitehouse's avatar
      GFS2: Read resource groups on mount · a365fbf3
      Steven Whitehouse authored
      
      
      This makes mount take slightly longer, but at the same time, the first
      write to the filesystem will be faster too. It also means that if there
      is a problem in the resource index, then we can refuse to mount rather
      than having to try and report that when the first write occurs.
      
      In addition, to avoid recursive locking, we hvae to take account of
      instances when the rindex glock may already be held when we are
      trying to update the rbtree of resource groups.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      a365fbf3
  11. 11 Jan, 2012 1 commit
  12. 22 Nov, 2011 2 commits
    • Steven Whitehouse's avatar
      GFS2: Fix multi-block allocation · 6a8099ed
      Steven Whitehouse authored
      
      
      Clean up gfs2_alloc_blocks so that it takes the full extent length
      rather than just the number of non-inode blocks as an argument. That
      will only make a difference in the inode allocation case for now.
      
      Also, this fixes the extent length handling around gfs2_alloc_extent() so
      that multi block allocations will work again.
      
      The rd_last_alloc block is set to the final block in the allocated
      extent (as per the update to i_goal, but referenced to a different
      start point).
      
      This also removes the dinode argument to rgblk_search() which is no
      longer used.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      6a8099ed
    • Bob Peterson's avatar
      GFS2: decouple quota allocations from block allocations · 564e12b1
      Bob Peterson authored
      
      
      This patch separates the code pertaining to allocations into two
      parts: quota-related information and block reservations.
      This patch also moves all the block reservation structure allocations to
      function gfs2_inplace_reserve to simplify the code, and moves
      the frees to function gfs2_inplace_release.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      564e12b1
  13. 21 Nov, 2011 3 commits
  14. 18 Nov, 2011 1 commit
  15. 15 Nov, 2011 2 commits
  16. 21 Oct, 2011 8 commits
    • Steven Whitehouse's avatar
      GFS2: Remove two unused variables · 9ae32429
      Steven Whitehouse authored
      
      
      The two variables being initialised in gfs2_inplace_reserve
      to track the file & line number of the caller are never
      used, so we might as well remove them.
      
      If something does go wrong, then a stack trace is probably
      more useful anyway.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      9ae32429
    • Steven Whitehouse's avatar
      GFS2: Fix off-by-one in gfs2_blk2rgrpd · f75bbfb4
      Steven Whitehouse authored
      
      
      Bob reported:
      
      I found an off-by-one problem with how I coded this section:
      It should be:
      
      + else if (blk >= cur->rd_data0 + cur->rd_data)
      
      In fact, cur->rd_data0 + cur->rd_data is the start of the next
      rgrp (the next ri_addr), so without the "=" check it can land on
      the wrong rgrp.
      
      In all normal cases, this won't be a problem: you're searching
      for a block _within_ the rgrp, which will pass the test properly.
      Where it gets into trouble is if you search the rgrps for the
      block exactly equal to ri_addr.  I don't think anything in the
      kernel does this, but I found a place in gfs2-utils gfs2_edit
      where it does.  So I definitely need to fix it in libgfs2.  I'd
      like to suggest we fix it in the kernel as well for the sake of
      keeping the functions similar.
      
      So this patch fixes the above mentioned off by one error as well
      as removing the unused parent pointer.
      Reported-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      f75bbfb4
    • Steven Whitehouse's avatar
      GFS2: Correctly set goal block after allocation · ccad4e14
      Steven Whitehouse authored
      
      
      The new goal block should be set to the end of the newly
      allocated extent, not the start of it.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      ccad4e14
    • Steven Whitehouse's avatar
      GFS2: Use cached rgrp in gfs2_rlist_add() · 70b0c365
      Steven Whitehouse authored
      
      
      Each block which is deallocated, requires a call to gfs2_rlist_add()
      and each of those calls was calling gfs2_blk2rgrpd() in order to
      figure out which rgrp the block belonged in. This can be speeded up
      by making use of the rgrp cached in the inode. We also reset this
      cached rgrp in case the block has changed rgrp. This should provide
      a big reduction in gfs2_blk2rgrpd() calls during deallocation.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      70b0c365
    • Steven Whitehouse's avatar
      GFS2: Remove obsolete assert · 534029e2
      Steven Whitehouse authored
      
      
      Given that a resource group has been locked, there is no reason why
      we should not be able to allocate as many blocks as are free. The
      al_requested parameter should really be considered as a minimum
      number of blocks to be available. Should this limit be overshot,
      there are other mechanisms which will prevent over allocation.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      534029e2
    • Steven Whitehouse's avatar
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse authored
      
      
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      54335b1f
    • Steven Whitehouse's avatar
      GFS2: Make resource groups "append only" during life of fs · 8339ee54
      Steven Whitehouse authored
      
      
      Since we have ruled out supporting online filesystem shrink,
      it is possible to make the resource group list append only
      during the life of a super block. This gives several benefits:
      
      Firstly, we only need to read new rindex elements as they are added
      rather than needing to reread the whole rindex file each time one
      element is added.
      
      Secondly, the rindex glock can be held for much shorter periods of
      time, and is completely removed from the fast path for allocations.
      The lock is taken in shared mode only when updating the resource
      groups when the first allocation occurs, and after a grow has
      taken place.
      
      Thirdly, this results in a reduction in code size, and everything
      gets a lot simpler to understand in this area.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      8339ee54
    • Bob Peterson's avatar
      GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621
      Bob Peterson authored
      
      
      Here is an update of Bob's original rbtree patch which, in addition, also
      resolves the rather strange ref counting that was being done relating to
      the bitmap blocks.
      
      Originally we had a dual system for journaling resource groups. The metadata
      blocks were journaled and also the rgrp itself was added to a list. The reason
      for adding the rgrp to the list in the journal was so that the "repolish
      clones" code could be run to update the free space, and potentially send any
      discard requests when the log was flushed. This was done by comparing the
      "cloned" bitmap with what had been written back on disk during the transaction
      commit.
      
      Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
      until the journal had been flushed. For that reason, there was a rather
      complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
      both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
      count on the buffers.
      
      However, the journal maintains a reference count on the buffers anyway, since
      they are being journaled as metadata buffers. So by moving the code which deals
      with the post-journal accounting for bitmap blocks to the metadata journaling
      code, we can entirely dispense with the rather strange buffer ref counting
      scheme and also the requirement to journal the rgrps.
      
      The net result of all this is that the ->sd_rindex_spin is left to do exactly
      one job, and that is to look after the rbtree or rgrps.
      
      This patch is designed to be a stepping stone towards using RCU for the rbtree
      of resource groups, however the reduction in the number of uses of the
      ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
      anyway.
      
      The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
      be removed in future in favour of calling the functions directly where required
      in the code. That will allow locking of resource groups without needing to
      actually read them in - something that could be useful in speeding up statfs.
      
      In the mean time though it is valid to dereference ->bi_bh only when the rgrp
      is locked. This is basically the same rule as before, modulo the references not
      being valid until the following journal flush.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      7c9ca621
  17. 15 Jul, 2011 1 commit
  18. 21 May, 2011 1 commit
    • Steven Whitehouse's avatar
      GFS2: Wipe directory hash table metadata when deallocating a directory · 6d3117b4
      Steven Whitehouse authored
      
      
      The deallocation code for directories in GFS2 is largely divided into
      two parts. The first part deallocates any directory leaf blocks and
      marks the directory as being a regular file when that is complete. The
      second stage was identical to deallocating regular files.
      
      Regular files have their data blocks in a different
      address space to directories, and thus what would have been normal data
      blocks in a regular file (the hash table in a GFS2 directory) were
      deallocated correctly. However, a reference to these blocks was left in the
      journal (assuming of course that some previous activity had resulted in
      those blocks being in the journal or ail list).
      
      This patch uses the i_depth as a test of whether the inode is an
      exhash directory (we cannot test the inode type as that has already
      been changed to a regular file at this stage in deallocation)
      
      The original issue was reported by Chris Hertel as an issue he encountered
      running bonnie++
      Reported-by: default avatarChristopher R. Hertel <crh@samba.org>
      Cc: Abhijith Das <adas@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      6d3117b4
  19. 20 Apr, 2011 2 commits
  20. 18 Apr, 2011 1 commit
    • Bob Peterson's avatar
      GFS2: filesystem hang caused by incorrect lock order · 44ad37d6
      Bob Peterson authored
      
      
      This patch fixes a deadlock in GFS2 where two processes are trying
      to reclaim an unlinked dinode:
      One holds the inode glock and calls gfs2_lookup_by_inum trying to look
      up the inode, which it can't, due to I_FREEING.  The other has set
      I_FREEING from vfs and is at the beginning of gfs2_delete_inode
      waiting for the glock, which is held by the first.  The solution is to
      add a new non_block parameter to the gfs2_iget function that causes it
      to return -ENOENT if the inode is being freed.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      44ad37d6
  21. 24 Feb, 2011 1 commit
    • Bob Peterson's avatar
      GFS2: deallocation performance patch · 4c16c36a
      Bob Peterson authored
      
      
      This patch is a performance improvement to GFS2's dealloc code.
      Rather than update the quota file and statfs file for every
      single block that's stripped off in unlink function do_strip,
      this patch keeps track and updates them once for every layer
      that's stripped.  This is done entirely inside the existing
      transaction, so there should be no risk of corruption.
      The other functions that deallocate blocks will be unaffected
      because they are using wrapper functions that do the same
      thing that they do today.
      
      I tested this code on my roth cluster by creating 200
      files in a directory, each of which is 100MB, then on
      four nodes, I simultaneously deleted the files, thus competing
      for GFS2 resources (but different files).  The commands
      I used were:
      
      [root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      
      The performance increase was significant:
      
                   roth-01     roth-02     roth-03     roth-05
                   ---------   ---------   ---------   ---------
      old: real    0m34.027    0m25.021s   0m23.906s   0m35.646s
      new: real    0m22.379s   0m24.362s   0m24.133s   0m18.562s
      
      Total time spent deleting:
      old: 118.6s
      new:  89.4
      
      For this particular case, this showed a 25% performance increase for
      GFS2 unlinks.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      4c16c36a