1. 20 Jul, 2011 3 commits
    • Christoph Hellwig's avatar
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig authored
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • Christoph Hellwig's avatar
      fs: move inode_dio_wait calls into ->setattr · 562c72aa
      Christoph Hellwig authored
      Let filesystems handle waiting for direct I/O requests themselves instead
      of doing it beforehand.  This means filesystem-specific locks to prevent
      new dio referenes from appearing can be held.  This is important to allow
      generalizing i_dio_count to non-DIO_LOCKING filesystems.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      562c72aa
    • Christoph Hellwig's avatar
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig authored
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  2. 19 Jul, 2011 3 commits
  3. 25 May, 2011 1 commit
  4. 13 May, 2011 1 commit
  5. 22 Feb, 2011 1 commit
  6. 07 Mar, 2011 1 commit
    • Tao Ma's avatar
      ocfs2: Remove EXIT from masklog. · c1e8d35e
      Tao Ma authored
      mlog_exit is used to record the exit status of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      This patch just try to remove it or change it. So:
      1. if all the error paths already use mlog_errno, it is just removed.
         Otherwise, it will be replaced by mlog_errno.
      2. if it is used to print some return value, it is replaced with
         mlog(0,...).
      mlog_exit_ptr is changed to mlog(0.
      All those mlog(0,...) will be replaced with trace events later.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      c1e8d35e
  7. 20 Feb, 2011 1 commit
    • Tao Ma's avatar
      ocfs2: Remove ENTRY from masklog. · ef6b689b
      Tao Ma authored
      ENTRY is used to record the entry of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      So for mlog_entry_void, we just remove it.
      for mlog_entry(...), we replace it with mlog(0,...), and they
      will be replace by trace event later.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      ef6b689b
  8. 17 Jan, 2011 2 commits
    • Christoph Hellwig's avatar
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig authored
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • Christoph Hellwig's avatar
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig authored
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  9. 12 Jan, 2011 1 commit
  10. 06 Jan, 2011 1 commit
  11. 09 Dec, 2010 1 commit
  12. 25 Oct, 2010 1 commit
  13. 22 Oct, 2010 1 commit
  14. 11 Oct, 2010 1 commit
    • Tristan Ye's avatar
      ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. · 7bdb0d18
      Tristan Ye authored
      Currently, the default behavior of O_DIRECT writes was allowing
      concurrent writing among nodes to the same file, with no cluster
      coherency guaranteed (no EX lock held).  This can leave stale data in
      the cache for buffered reads on other nodes.
      
      The new mount option introduce a chance to choose two different
      behaviors for O_DIRECT writes:
      
          * coherency=full, as the default value, will disallow
                            concurrent O_DIRECT writes by taking
                            EX locks.
      
          * coherency=buffered, allow concurrent O_DIRECT writes
                                without EX lock among nodes, which
                                gains high performance at risk of
                                getting stale data on other nodes.
      Signed-off-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      7bdb0d18
  15. 15 Sep, 2010 1 commit
  16. 10 Sep, 2010 2 commits
  17. 08 Sep, 2010 3 commits
  18. 11 Aug, 2010 2 commits
  19. 09 Aug, 2010 2 commits
    • Christoph Hellwig's avatar
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig authored
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
    • Christoph Hellwig's avatar
      remove inode_setattr · 1025774c
      Christoph Hellwig authored
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1025774c
  20. 16 Jul, 2010 1 commit
  21. 08 Jul, 2010 2 commits
    • Joel Becker's avatar
      ocfs2: Zero the tail cluster when extending past i_size. · 5693486b
      Joel Becker authored
      ocfs2's allocation unit is the cluster.  This can be larger than a block
      or even a memory page.  This means that a file may have many blocks in
      its last extent that are beyond the block containing i_size.  There also
      may be more unwritten extents after that.
      
      When ocfs2 grows a file, it zeros the entire cluster in order to ensure
      future i_size growth will see cleared blocks.  Unfortunately,
      block_write_full_page() drops the pages past i_size.  This means that
      ocfs2 is actually leaking garbage data into the tail end of that last
      cluster.  This is a bug.
      
      We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
      when a write or truncate is past i_size.  They will use
      ocfs2_zero_extend() to ensure the data is properly zeroed.
      
      Older versions of ocfs2_zero_extend() simply zeroed every block between
      i_size and the zeroing position.  This presumes three things:
      
      1) There is allocation for all of these blocks.
      2) The extents are not unwritten.
      3) The extents are not refcounted.
      
      (1) and (2) hold true for non-sparse filesystems, which used to be the
      only users of ocfs2_zero_extend().  (3) is another bug.
      
      Since we're now using ocfs2_zero_extend() for sparse filesystems as
      well, we teach ocfs2_zero_extend() to check every extent between
      i_size and the zeroing position.  If the extent is unwritten, it is
      ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      5693486b
    • Joel Becker's avatar
      ocfs2: When zero extending, do it by page. · a4bfb4cf
      Joel Becker authored
      ocfs2_zero_extend() does its zeroing block by block, but it calls a
      function named ocfs2_write_zero_page().  Let's have
      ocfs2_write_zero_page() handle the page level.  From
      ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      a4bfb4cf
  22. 27 May, 2010 2 commits
  23. 21 May, 2010 2 commits
  24. 18 May, 2010 3 commits
    • Tristan Ye's avatar
      Ocfs2: Optimize punching-hole code. · c1631d4a
      Tristan Ye authored
      This patch simplifies the logic of handling existing holes and
      skipping extent blocks and removes some confusing comments.
      
      The patch survived the fill_verify_holes testcase in ocfs2-test.
      It also passed my manual sanity check and stress tests with enormous
      extent records.
      
      Currently punching a hole on a file with 3+ extent tree depth was
      really a performance disaster.  It can even take several hours,
      though we may not hit this in real life with such a huge extent
      number.
      
      One simple way to improve the performance is quite straightforward.
      From the logic of truncate, we can punch the hole from hole_end to
      hole_start, which reduces the overhead of btree operations in a
      significant way, such as tree rotation and moving.
      
      Following is the testing result when punching hole from 0 to file end
      in bytes, on a 1G file, 1G file consists of 256k extent records, each record
      cover 4k data(just one cluster, clustersize is 4k):
      
      ===========================================================================
       * Original punching-hole mechanism:
      ===========================================================================
      
         I waited 1 hour for its completion, unfortunately it's still ongoing.
      
      ===========================================================================
       * Patched punching-hode mechanism:
      ===========================================================================
      
         real 0m2.518s
         user 0m0.000s
         sys  0m2.445s
      
      That means we've gained up to 1000 times improvement on performance in this
      case, whee! It's fairly cool. and it looks like that performance gain will
      be raising when extent records grow.
      
      The patch was based on my former 2 patches, which were about truncating
      codes optimization and fixup to handle CoW on punching hole.
      Signed-off-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      c1631d4a
    • Tristan Ye's avatar
      Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing. · e8aec068
      Tristan Ye authored
      Based on the previous patch of optimizing truncate, the bugfix for
      refcount trees when punching holes can be fairly easy
      and straightforward since most of work we should take into account for
      refcounting have been completed already in ocfs2_remove_btree_range().
      
      This patch performs CoW for refcounted extents when a hole being punched
      whose start or end offset were in the middle of a cluster, which means
      partial zeroing of the cluster will be performed soon.
      
      The patch has been tested fixing the following bug:
      
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1216Signed-off-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      e8aec068
    • Tristan Ye's avatar
      Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead. · 78f94673
      Tristan Ye authored
      Truncate is just a special case of punching holes(from new i_size to
      end), we therefore could take advantage of the existing
      ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
      alloc.c.  The goal here is to make truncate more generic and
      straightforward.
      
      Several functions only used by ocfs2_commit_truncate() will smiply be
      removed.
      
      ocfs2_remove_btree_range() was originally used by the hole punching
      code, which didn't take refcount trees into account (definitely a bug).
      We therefore need to change that func a bit to handle refcount trees.
      It must take the refcount lock, calculate and reserve blocks for
      refcount tree changes, and decrease refcounts at the end.  We replace 
      ocfs2_lock_allocators() here by adding a new func
      ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
      reserve.  This will not hurt any other code using
      ocfs2_remove_btree_range() (such as dir truncate and hole punching).
      
      I merged the following steps into one patch since they may be
      logically doing one thing, though I know it looks a little bit fat
      to review.
      
      1). Remove redundant code used by ocfs2_commit_truncate(), since we're
          moving to ocfs2_remove_btree_range anyway.
      
      2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
          accepting some extra blocks to reserve.
      
      3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
          needs.  It's safe to do this since it's only being called by
          truncate.
      
      4). Change ocfs2_remove_btree_range() a bit to take refcount case into
          account.
      
      5). Finally, we change ocfs2_commit_truncate() to call
          ocfs2_remove_btree_range() in a proper way.
      
      The patch has been tested normally for sanity check, stress tests
      with heavier workload will be expected.
      
      Based on this patch, fixing the punching holes bug will be fairly easy.
      Signed-off-by: default avatarTristan Ye <tristan.ye@oracle.com>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      78f94673
  25. 05 May, 2010 1 commit