1. 08 May, 2007 1 commit
  2. 10 Dec, 2006 7 commits
    • Zach Brown's avatar
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown authored
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Zach Brown's avatar
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown authored
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Zach Brown's avatar
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown authored
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Zach Brown's avatar
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown authored
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Zach Brown's avatar
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown authored
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Zach Brown's avatar
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown authored
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      In http://lkml.org/lkml/2006/9/21/103
       IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      This patch:
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Andrew Morton's avatar
      [PATCH] io-accounting: direct-io · 98c4d57d
      Andrew Morton authored
      Account for direct-io.
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  3. 03 Jul, 2006 1 commit
  4. 23 Jun, 2006 1 commit
    • Jens Axboe's avatar
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe authored
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: default avatarJens Axboe <axboe@suse.de>
  5. 31 Mar, 2006 1 commit
  6. 28 Mar, 2006 1 commit
  7. 26 Mar, 2006 1 commit
  8. 25 Mar, 2006 1 commit
    • Chen, Kenneth W's avatar
      [PATCH] direct-io: bug fix in dio handling write error · 174e27c6
      Chen, Kenneth W authored
      There is a bug in direct-io on propagating write error up to the higher I/O
      layer.  When performing an async ODIRECT write to a block device, if a
      device error occurred (like media error or disk is pulled), the error code
      is only propagated from device driver to the DIO layer.  The error code
      stops at finished_one_bio().  The aysnc write, however, is supposedly have
      a corresponding AIO event with appropriate return code (in this case -EIO).
       Application which waits on the async write event, will hang forever since
      such AIO event is lost forever (if such app did not use the timeout option
      in io_getevents call.  Regardless, an AIO event is lost).
      The discovery of above bug leads to another discovery of potential race
      window with dio->result.  The fundamental problem is that dio->result is
      overloaded with dual use: an indicator of fall back path for partial dio
      write, and an error indicator used in the I/O completion path.  In the
      event of device error, the setting of -EIO to dio->result clashes with
      value used to track partial write that activates the fall back path.
      It was also pointed out that it is impossible to use dio->result to track
      partial write and at the same time to track error returned from device
      driver.  Because direct_io_work can only determines whether it is a partial
      write at the end of io submission and in mid stream of those io submission,
      a return code could be coming back from the driver.  Thus messing up all
      the subsequent logic.
      Proposed fix is to separating out error code returned by the IO completion
      path from partial IO submit tracking.  A new variable is added to dio
      structure specifically to track io error returned in the completion path.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: default avatarZach Brown <zach.brown@oracle.com>
      Acked-by: default avatarSuparna Bhattacharya <suparna@in.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  9. 14 Mar, 2006 1 commit
  10. 03 Feb, 2006 1 commit
    • Jeff Moyer's avatar
      [PATCH] fix O_DIRECT read of last block in a sparse file · 35dc8161
      Jeff Moyer authored
      Currently, if you open a file O_DIRECT, truncate it to a size that is not a
      multiple of the disk block size, and then try to read the last block in the
      file, the read will return 0.  The problem is in do_direct_IO, here:
              /* Handle holes */
              if (!buffer_mapped(map_bh)) {
                      char *kaddr;
                      if (dio->block_in_file >=
                              i_size_read(dio->inode)>>blkbits) {
                              /* We hit eof */
                              goto out;
      We shift off any remaining bytes in the final block of the I/O, resulting
      in a 0-sized read.  I've attached a patch that fixes this.  I'm not happy
      about how ugly the math is getting, so suggestions are more than welcome.
      I've tested this with a simple program that performs the steps outlined for
      reproducing the problem above.  Without the patch, we get a 0-sized result
      from read.  With the patch, we get the correct return value from the short
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Joel Becker <Joel.Becker@oracle.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  11. 09 Jan, 2006 1 commit
  12. 29 Oct, 2005 1 commit
    • Nick Piggin's avatar
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin authored
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      PageReserved special casing is removed from get_page and put_page.
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Refcount bug fix for filemap_xip.c
      Signed-off-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  13. 24 Jun, 2005 1 commit
  14. 16 Apr, 2005 2 commits
    • Daniel McNeil's avatar
      [PATCH] Direct IO async short read fix · 29504ff3
      Daniel McNeil authored
      The direct I/O code is mapping the read request to the file system block.  If
      the file size was not on a block boundary, the result would show the the read
      reading past EOF.  This was only happening for the AIO case.  The non-AIO case
      truncates the result to match file size (in direct_io_worker).  This patch
      does the same thing for the AIO case, it truncates the result to match the
      file size if the read reads past EOF.
      When I/O completes the result can be truncated to match the file size
      without using i_size_read(), thus the aio result now matches the number of
      bytes read to the end of file.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      Let it rip!