1. 14 May, 2012 2 commits
    • Dave Chinner's avatar
      xfs: limit specualtive delalloc to maxioffset · 3ed9116e
      Dave Chinner authored
      Speculative delayed allocation beyond EOF near the maximum supported
      file offset can result in creating delalloc extents beyond
      mp->m_maxioffset (8EB). These can never be trimmed during
      xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and
      that results in assert failures in xfs_fs_destroy_inode() due to
      delalloc blocks still being present. xfstests 071 exposes this
      Limit speculative delalloc to mp->m_maxioffset to avoid this
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
    • Dave Chinner's avatar
      xfs: use shared ilock mode for direct IO writes by default · 507630b2
      Dave Chinner authored
      For the direct IO write path, we only really need the ilock to be taken in
      exclusive mode during IO submission if we need to do extent allocation
      instead of all the time.
      Change the block mapping code to take the ilock in shared mode for the
      initial block mapping, and only retake it exclusively when we actually
      have to perform extent allocations.  We were already dropping the ilock
      for the transaction allocation, so this doesn't introduce new race windows.
      Based on an earlier patch from Dave Chinner.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
  2. 05 Mar, 2012 1 commit
  3. 17 Jan, 2012 1 commit
    • Christoph Hellwig's avatar
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig authored
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
  4. 13 Jan, 2012 1 commit
  5. 11 Oct, 2011 4 commits
    • Christoph Hellwig's avatar
      xfs: simplify xfs_trans_ijoin* again · ddc3415a
      Christoph Hellwig authored
      There is no reason to keep a reference to the inode even if we unlock
      it during transaction commit because we never drop a reference between
      the ijoin and commit.  Also use this fact to merge xfs_trans_ijoin_ref
      back into xfs_trans_ijoin - the third argument decides if an unlock
      is needed now.
      I'm actually starting to wonder if allowing inodes to be unlocked
      at transaction commit really is worth the effort.  The only real
      benefit is that they can be unlocked earlier when commiting a
      synchronous transactions, but that could be solved by doing the
      log force manually after the unlock, too.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
    • Dave Chinner's avatar
      xfs: rename xfs_bmapi to xfs_bmapi_write · c0dc7828
      Dave Chinner authored
      Now that all the read-only users of xfs_bmapi have been converted to
      use xfs_bmapi_read(), we can remove all the read-only handling cases
      from xfs_bmapi().
      Once this is done, rename xfs_bmapi to xfs_bmapi_write to reflect
      the fact it is for allocation only. This enables us to kill the
      XFS_BMAPI_WRITE flag as well.
      Also clean up xfs_bmapi_write to the style used in the newly added
      xfs_bmapi_read/delay functions.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
    • Christoph Hellwig's avatar
      xfs: introduce xfs_bmapi_delay() · 4403280a
      Christoph Hellwig authored
      Delalloc reservations are much simpler than allocations, so give
      them a separate bmapi-level interface.  Using the previously added
      xfs_bmapi_reserve_delalloc we get a function that is only minimally
      more complicated than xfs_bmapi_read, which is far from the complexity
      in xfs_bmapi.  Also remove the XFS_BMAPI_DELAY code after switching
      over the only user to xfs_bmapi_delay.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
    • Dave Chinner's avatar
      xfs: introduce xfs_bmapi_read() · 5c8ed202
      Dave Chinner authored
      xfs_bmapi() currently handles both extent map reading and
      allocation. As a result, the code is littered with "if (wr)"
      branches to conditionally do allocation operations if required.
      This makes the code much harder to follow and causes significant
      indent issues with the code.
      Given that read mapping is much simpler than allocation, we can
      split out read mapping from xfs_bmapi() and reuse the logic that
      we have already factored out do do all the hard work of handling the
      extent map manipulations. The results in a much simpler function for
      the common extent read operations, and will allow the allocation
      code to be simplified in another commit.
      Once xfs_bmapi_read() is implemented, convert all the callers of
      xfs_bmapi() that are only reading extents to use the new function.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
  6. 11 Jul, 2011 1 commit
  7. 08 Jul, 2011 1 commit
  8. 06 Mar, 2011 2 commits
  9. 28 Jan, 2011 1 commit
  10. 03 Jan, 2011 1 commit
    • Dave Chinner's avatar
      xfs: dynamic speculative EOF preallocation · 055388a3
      Dave Chinner authored
      Currently the size of the speculative preallocation during delayed
      allocation is fixed by either the allocsize mount option of a
      default size. We are seeing a lot of cases where we need to
      recommend using the allocsize mount option to prevent fragmentation
      when buffered writes land in the same AG.
      Rather than using a fixed preallocation size by default (up to 64k),
      make it dynamic by basing it on the current inode size. That way the
      EOF preallocation will increase as the file size increases.  Hence
      for streaming writes we are much more likely to get large
      preallocations exactly when we need it to reduce fragementation.
      For default settings, the size of the initial extents is determined
      by the number of parallel writers and the amount of memory in the
      machine. For 4GB RAM and 4 concurrent 32GB file writes:
      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
      and for 16 concurrent 16GB file writes:
       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
      Because it is hard to take back specualtive preallocation, cases
      where there are large slow growing log files on a nearly full
      filesystem may cause premature ENOSPC. Hence as the filesystem nears
      full, the maximum dynamic prealloc size іs reduced according to this
      table (based on 4k block size):
      freespace       max prealloc size
        >5%             full extent (8GB)
        4-5%             2GB (8GB >> 2)
        3-4%             1GB (8GB >> 3)
        2-3%           512MB (8GB >> 4)
        1-2%           256MB (8GB >> 5)
        <1%            128MB (8GB >> 6)
      This should reduce the amount of space held in speculative
      preallocation for such cases.
      The allocsize mount option turns off the dynamic behaviour and fixes
      the prealloc size to whatever the mount option specifies. i.e. the
      behaviour is unchanged.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
  11. 16 Dec, 2010 2 commits
  12. 26 Jul, 2010 6 commits
  13. 19 May, 2010 7 commits
  14. 14 Dec, 2009 1 commit
    • Christoph Hellwig's avatar
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig authored
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
         cat /sys/kernel/debug/tracing/trace
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      for some examples.
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
  15. 11 Dec, 2009 1 commit
  16. 10 Jun, 2009 1 commit
  17. 08 Jun, 2009 1 commit
    • Christoph Hellwig's avatar
      xfs: kill xfs_qmops · 7d095257
      Christoph Hellwig authored
      Kill the quota ops function vector and replace it with direct calls or
      stubs in the CONFIG_XFS_QUOTA=n case.
      Make sure we check XFS_IS_QUOTA_RUNNING in the right spots.  We can remove
      the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
      This brings us back closer to the way this code worked in IRIX and earlier
      Linux versions, but we keep a lot of the more useful factoring of common
      Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
      Reduces the size of the source code by about 250 lines and the size of
      XFS module by about 1.5 kilobytes with quotas enabled:
         text	   data	    bss	    dec	    hex	filename
       615957	   2960	   3848	 622765	  980ad	fs/xfs/xfs.o
       617231	   3152	   3848	 624231	  98667	fs/xfs/xfs.o.old
       - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
         the inode locked and xfs_qm_dqattach which does the locking around it,
         thus removing XFS_QMOPT_ILOCKED.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarEric Sandeen <sandeen@sandeen.net>
  18. 06 Apr, 2009 3 commits
    • Dave Chinner's avatar
      xfs: remove xfs_flush_space · 8de2bf93
      Dave Chinner authored
      The only thing we need to do now when we get an ENOSPC condition during delayed
      allocation reservation is flush all the other inodes with delalloc blocks on
      them and retry without EOF preallocation. Remove the unneeded mess that is
      xfs_flush_space() and just call xfs_flush_inodes() directly from
      Also, change the location of the retry label to avoid trying to do EOF
      preallocation because we don't want to do that at ENOSPC. This enables us to
      remove the BMAPI_SYNC flag as it is no longer used.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    • Dave Chinner's avatar
      xfs: make inode flush at ENOSPC synchronous · 5825294e
      Dave Chinner authored
      When we are writing to a single file and hit ENOSPC, we trigger a background
      flush of the inode and try again.  Because we hold page locks and the iolock,
      the flush won't proceed until after we release these locks. This occurs once
      we've given up and ENOSPC has been reported. Hence if this one is the only
      dirty inode in the system, we'll get an ENOSPC prematurely.
      To fix this, remove the async flush from the allocation routines and move
      it to the top of the write path where we can do a synchronous flush
      and retry the write again. Only retry once as a second ENOSPC indicates
      that we really are ENOSPC.
      This avoids a page cache deadlock when trying to do this flush synchronously
      in the allocation layer that was identified by Mikulas Patocka.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    • Dave Chinner's avatar
      xfs: use xfs_sync_inodes() for device flushing · a8d770d9
      Dave Chinner authored
      Currently xfs_device_flush calls sync_blockdev() which is
      a no-op for XFS as all it's metadata is held in a different
      address to the one sync_blockdev() works on.
      Call xfs_sync_inodes() instead to flush all the delayed
      allocation blocks out. To do this as efficiently as possible,
      do it via two passes - one to do an async flush of all the
      dirty blocks and a second to wait for all the IO to complete.
      This requires some modification to the xfs-sync_inodes_ag()
      flush code to do efficiently.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
  19. 18 Jan, 2009 1 commit
  20. 15 Jan, 2009 1 commit
  21. 21 Dec, 2008 1 commit
    • Lachlan McIlroy's avatar
      [XFS] Fix speculative allocation beyond eof · 9f6c92b9
      Lachlan McIlroy authored
      Speculative allocation beyond eof doesn't work properly.  It was
      broken some time ago after a code cleanup that moved what is now
      xfs_iomap_eof_align_last_fsb() and xfs_iomap_eof_want_preallocate()
      out of xfs_iomap_write_delay() into separate functions.  The code
      used to use the current file size in various checks but got changed
      to be max(file_size, i_new_size).  Since i_new_size is the result
      of 'offset + count' then in xfs_iomap_eof_want_preallocate() the
      check for '(offset + count) <= isize' will always be true.
      ie if 'offset + count' is > ip->i_size then isize will be i_new_size
      and equal to 'offset + count'.
      This change fixes all the places that used to use the current file
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarLachlan McIlroy <lachlan@sgi.com>