1. 20 Jul, 2011 3 commits
    • Christoph Hellwig's avatar
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig authored
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • Christoph Hellwig's avatar
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig authored
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
    • Christoph Hellwig's avatar
      fs: simplify handling of zero sized reads in __blockdev_direct_IO · f9b5570d
      Christoph Hellwig authored
      Reject zero sized reads as soon as we know our I/O length, and don't
      borther with locks or allocations that might have to be cleaned up
      otherwise.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f9b5570d
  2. 10 Mar, 2011 2 commits
    • Jens Axboe's avatar
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe authored
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      721a9602
    • Jens Axboe's avatar
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe authored
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  3. 20 Jan, 2011 1 commit
  4. 19 Jan, 2011 1 commit
  5. 26 Oct, 2010 1 commit
  6. 09 Sep, 2010 1 commit
    • Jeff Moyer's avatar
      O_DIRECT: fix the splitting up of contiguous I/O · 7a801ac6
      Jeff Moyer authored
      commit c2c6ca41 (direct-io: do not merge logically non-contiguous requests)
      introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
      to the block layer.  The problem is that the code expected
      dio->block_in_file to correspond to the current page in the dio.  In fact,
      it corresponds to the previous page submitted via submit_page_section.
      This was purely an oversight, as the dio->cur_page_fs_offset field was
      introduced for just this purpose.  This patch simply uses the correct
      variable when calculating whether there is a mismatch between contiguous
      logical blocks and contiguous physical blocks (as described in the
      comments).
      
      I also switched the if conditional following this check to an else if, to
      ensure that we never call dio_bio_submit twice for the same dio (in
      theory, this should not happen, anyway).
      
      I've tested this by running blktrace and verifying that a 64KB I/O was
      submitted as a single I/O.  I also ran the patched kernel through
      xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
      and verified that there were no regressions as compared to an unpatched
      kernel.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarJosef Bacik <jbacik@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a801ac6
  7. 09 Aug, 2010 1 commit
    • Christoph Hellwig's avatar
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig authored
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  8. 27 Jul, 2010 1 commit
    • Christoph Hellwig's avatar
      direct-io: move aio_complete into ->end_io · 552ef802
      Christoph Hellwig authored
      Filesystems with unwritten extent support must not complete an AIO request
      until the transaction to convert the extent has been commited.  That means
      the aio_complete calls needs to be moved into the ->end_io callback so
      that the filesystem can control when to call it exactly.
      
      This makes a bit of a mess out of dio_complete and the ->end_io callback
      prototype even more complicated. 
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: Jan Kara <jack@suse.cz> 
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      552ef802
  9. 26 Jul, 2010 1 commit
  10. 27 May, 2010 1 commit
    • npiggin@suse.de's avatar
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de authored
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
  11. 25 May, 2010 2 commits
    • Josef Bacik's avatar
      direct-io: do not merge logically non-contiguous requests · c2c6ca41
      Josef Bacik authored
      Btrfs cannot handle having logically non-contiguous requests submitted.  For
      example if you have
      
      Logical:  [0-4095][HOLE][8192-12287]
      Physical: [0-4095]      [4096-8191]
      
      Normally the DIO code would put these into the same BIO's.  The problem is we
      need to know exactly what offset is associated with what BIO so we can do our
      checksumming and unlocking properly, so putting them in the same BIO doesn't
      work.  So add another check where we submit the current BIO if the physical
      blocks are not contigous OR the logical blocks are not contiguous.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c2c6ca41
    • Josef Bacik's avatar
      direct-io: add a hook for the fs to provide its own submit_bio function · facd07b0
      Josef Bacik authored
      Because BTRFS can do RAID and such, we need our own submit hook so we can setup
      the bio's in the correct fashion, and handle checksum errors properly.  So there
      are a few changes here
      
      1) The submit_io hook.  This is straightforward, just call this instead of
      submit_bio.
      
      2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
      writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
      of falling back on buffered IO if it encounters a compressed extent, since we
      need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
      from get_block we'll return back and fallback on buffered just like the write
      case.
      
      I've tested these changes with fsx and everything seems to work.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      facd07b0
  12. 17 Dec, 2009 1 commit
  13. 16 Dec, 2009 3 commits
    • Christoph Hellwig's avatar
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig authored
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c
    • Christoph Hellwig's avatar
      direct-io: cleanup blockdev_direct_IO locking · 5fe878ae
      Christoph Hellwig authored
      Currently the locking in blockdev_direct_IO is a mess, we have three
      different locking types and very confusing checks for some of them.  The
      most complicated one is DIO_OWN_LOCKING for reads, which happens to not
      actually be used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
      case is unused anyway, and the write side is almost identical to
      DIO_NO_LOCKING.  The difference is that DIO_NO_LOCKING always sets the
      create argument for the get_blocks callback to zero, but we can easily
      move that to the actual get_blocks callbacks.  There are four users of the
      DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
      fine with the new version, ocfs2 only errors out if create were ever set,
      and we can remove this dead code now, the block device code only ever uses
      create for an error message if we are fully beyond the device which can
      never happen, and last but not least XFS will need the new behavour for
      writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a
      separate flag, although for now both flags always get set at the same
      time.
      
      Also revamp the documentation of the locking scheme to actually make
      sense.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fe878ae
    • Jeff Moyer's avatar
      dio: don't zero out the pages array inside struct dio · 23aee091
      Jeff Moyer authored
      Intel reported a performance regression caused by the following commit:
      
      commit 848c4dd5
      Author: Zach Brown <zach.brown@oracle.com>
      Date:   Mon Aug 20 17:12:01 2007 -0700
      
          dio: zero struct dio with kzalloc instead of manually
      
          This patch uses kzalloc to zero all of struct dio rather than
          manually trying to track which fields we rely on being zero.  It
          passed aio+dio stress testing and some bug regression testing on
          ext3.
      
          This patch was introduced by Linus in the conversation that lead up
          to Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
            6a648fa7
      
          It makes the code a bit smaller.  Maybe a couple fewer cachelines to
          load, if we're lucky:
      
             text    data     bss     dec     hex filename
          3285925  568506 1304616 5159047  4eb887 vmlinux
          3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
          I was unable to measure a stable difference in the number of cpu
          cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
          at ~340MB/s.
      
          So the resulting intent of the patch isn't a performance gain but to
          avoid exposing ourselves to the risk of finding another field like
          .map_bh.b_state where we rely on zeroing but don't enforce it in the
          code.
      
      Zach surmised that zeroing out the page array was what caused most of
      the problem, and suggested the approach taken in the attached patch for
      resolving the issue.  Intel re-tested with this patch and saw a 0.6%
      performance gain (the original regression was 0.5%).
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarZach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23aee091
  14. 26 Nov, 2009 1 commit
    • Vivek Goyal's avatar
      Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3
      Vivek Goyal authored
      There seems to be a regression in direct write path due to following
      commit in for-2.6.33 branch of block tree.
      
      commit 1af60fbd
      Author: Jeff Moyer <jmoyer@redhat.com>
      Date:   Fri Oct 2 18:56:53 2009 -0400
      
          block: get rid of the WRITE_ODIRECT flag
      
      Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
      the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
      more request from the queue and not idle on it (despite the fact that
      queue's think time is less and it is not seeky).
      
      So direct writers lose big time when competing with sequential readers.
      
      Using fio, I have run one direct writer and two sequential readers and
      following are the results with 2.6.32-rc7 kernel and with for-2.6.33
      branch.
      
      Test
      ====
      1 direct writer and 2 sequential reader running simultaneously.
      
      [global]
      directory=/mnt/sdc/fio/
      runtime=10
      
      [seqwrite]
      rw=write
      size=4G
      direct=1
      
      [seqread]
      rw=read
      size=2G
      numjobs=2
      
      2.6.32-rc7
      ==========
      direct writes: aggrb=2,968KB/s
      readers	     : aggrb=101MB/s
      
      for-2.6.33 branch
      =================
      direct write: aggrb=19KB/s
      readers	      aggrb=137MB/s
      
      This patch brings back the WRITE_ODIRECT flag, with the difference that we
      don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
      submission of request and an explicit unplug from submitter is required.
      
      That way we fix the jeff's issue of not enough merging taking place in aio
      path as well as make sure direct writes get their fair share.
      
      After the fix
      =============
      for-2.6.33 + fix
      ----------------
      direct writes: aggrb=2,728KB/s
      reads: aggrb=103MB/s
      
      Thanks
      Vivek
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d9449ce3
  15. 28 Oct, 2009 2 commits
    • Jeff Moyer's avatar
      aio: implement request batching · cfb1e33e
      Jeff Moyer authored
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
    • Jeff Moyer's avatar
      block: get rid of the WRITE_ODIRECT flag · 1af60fbd
      Jeff Moyer authored
      Hi,
      
      The WRITE_ODIRECT flag is only used in one place, and that code path
      happens to also call blk_run_address_space.  The introduction of this
      flag, then, could result in the device being unplugged twice for every
      I/O.
      
      Further, with the batching changes in the next patch, we don't want an
      O_DIRECT write to imply a queue unplug.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      1af60fbd
  16. 22 May, 2009 1 commit
  17. 15 Apr, 2009 1 commit
  18. 06 Apr, 2009 1 commit
  19. 06 Jan, 2009 1 commit
    • Dmitri Monakhov's avatar
      fs: truncate blocks outside i_size after O_DIRECT write error · 0f64415d
      Dmitri Monakhov authored
      In case of error extending write may have instantiated a few blocks
      outside i_size.  We need to trim these blocks.  We have to do it
      *regardless* to blocksize.  At least ext2, ext3 and reiserfs interpret
      (i_size < biggest block) condition as error.  Fsck will complain about
      wrong i_size.  Then fsck will fix the error by changing i_size according
      to the biggest block.  This is bad because this blocks contain garbage
      from previous write attempt.  And result in data corruption.
      
      ####TESTCASE_BEGIN
      $touch /mnt/test/BIG_FILE
      ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
      open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
      write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
      ## size and block sould't be changed because write op failed.
      $stat /mnt/test/BIG_FILE
      File: `/mnt/test/BIG_FILE'
      Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
      <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
      Device: fe07h/65031d Inode: 14 Links: 1
      Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
      Access: 2007-01-24 20:03:38.000000000 +0300
      Modify: 2007-01-24 20:03:38.000000000 +0300
      Change: 2007-01-24 20:03:39.000000000 +0300
      
      #fsck.ext3 -f /dev/VG/test
      e2fsck 1.39 (29-May-2006)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
      Pass 2: Checking directory structure
      ....
      #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
      index af0558d..4e88bea 100644
      
      [akpm@linux-foundation.org: use i_size_read()]
      Signed-off-by: default avatarDmitri Monakhov <dmonakhov@openvz.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f64415d
  20. 16 Oct, 2008 1 commit
  21. 26 Jul, 2008 1 commit
  22. 05 Feb, 2008 1 commit
    • Christoph Lameter's avatar
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter authored
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  23. 16 Oct, 2007 1 commit
    • Nick Piggin's avatar
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin authored
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  24. 10 Oct, 2007 1 commit
  25. 20 Aug, 2007 1 commit
    • Zach Brown's avatar
      dio: zero struct dio with kzalloc instead of manually · 848c4dd5
      Zach Brown authored
      This patch uses kzalloc to zero all of struct dio rather than manually
      trying to track which fields we rely on being zero.  It passed aio+dio
      stress testing and some bug regression testing on ext3.
      
      This patch was introduced by Linus in the conversation that lead up to
      Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
        6a648fa7
      
      It makes the code a bit smaller.  Maybe a couple fewer cachelines to
      load, if we're lucky:
      
         text    data     bss     dec     hex filename
      3285925  568506 1304616 5159047  4eb887 vmlinux
      3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
      I was unable to measure a stable difference in the number of cpu cycles
      spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
      ~340MB/s.
      
      So the resulting intent of the patch isn't a performance gain but to
      avoid exposing ourselves to the risk of finding another field like
      .map_bh.b_state where we rely on zeroing but don't enforce it in the
      code.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      848c4dd5
  26. 11 Aug, 2007 1 commit
  27. 03 Jul, 2007 1 commit
    • Zach Brown's avatar
      dio: remove bogus refcounting BUG_ON · fcb82f88
      Zach Brown authored
      Badari Pulavarty reported a case of this BUG_ON is triggering during
      testing.  It's completely bogus and should be removed.
      
      It's trying to notice if we left references to the dio hanging around in
      the sync case.  They should have been dropped as IO completed while this
      path was in dio_await_completion().  This condition will also be
      checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
      lines lower.  So to start this BUG_ON() is redundant.
      
      More fatally, it's dereferencing dio-> after having dropped its
      reference.  It's only safe to dereference the dio after releasing the
      lock if the final reference was just dropped.  Another CPU might free
      the dio in bio completion and reuse the memory after this path drops the
      dio lock but before the BUG_ON() is evaluated.
      
      This patch passed aio+dio regression unit tests and aio-stress on ext3.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb82f88
  28. 09 May, 2007 1 commit
    • Nate Diller's avatar
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller authored
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: default avatarNate Diller <nate.diller@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
  29. 08 May, 2007 1 commit
  30. 10 Dec, 2006 4 commits
    • Zach Brown's avatar
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown authored
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Zach Brown's avatar
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown authored
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Zach Brown's avatar
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown authored
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Zach Brown's avatar
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown authored
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0273201e