1. 09 May, 2007 2 commits
  2. 08 May, 2007 2 commits
  3. 07 May, 2007 4 commits
  4. 16 Mar, 2007 1 commit
    • Zach Brown's avatar
      [PATCH] dio: invalidate clean pages before dio write · 65b8291c
      Zach Brown authored
      This patch fixes a user-triggerable oops that was reported by Leonid
      Ananiev as archived at http://lkml.org/lkml/2007/2/8/337
      dio writes invalidate clean pages that intersect the written region so that
      subsequent buffered reads go to disk to read the new data.  If this fails
      the interface tries to tell the caller that the cache is inconsistent by
      returning EIO.
      Before this patch we had the problem where this invalidation failure would
      clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.
      Both fs/aio.c and bio completion call aio_complete() and we reference freed
      memory, usually oopsing.
      This patch addresses this problem by invalidating before the write so that
      we can cleanly return -EIO before ->direct_IO() has had a chance to return
      There is a compromise here.  During the dio write we can fault in mmap()ed
      pages which intersect the written range with get_user_pages() if the user
      provided them for the source buffer.  This is a crazy thing to do, but we
      can make it mostly work in most cases by trying the invalidation again.
      The compromise is that we won't return an error if this second invalidation
      fails if it's an AIO write and we have -EIOCBQUEUED.
      This was tested by having two processes race performing large O_DIRECT and
      buffered ordered writes.  Within minutes ext3 would see a race between
      ext3_releasepage() and jbd holding a reference on ordered data buffers and
      would cause invalidation to fail, panicing the box.  The test can be found
      in the 'aio_dio_bugs' test group in test.kernel.org/autotest.  After this
      patch the test passes.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 16 Feb, 2007 1 commit
    • NeilBrown's avatar
      [PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to filesystem · 29dbb3fc
      NeilBrown authored
      When NFSD receives a write request, the data is typically in a number of
      1448 byte segments and writev is used to collect them together.
      Unfortunately, generic_file_buffered_write passes these to the filesystem
      one at a time, so an e.g.  32K over-write becomes a series of partial-page
      writes to each page, causing the filesystem to have to pre-read those pages
      - wasted effort.
      generic_file_buffered_write handles one segment of the vector at a time as
      it has to pre-fault in each segment to avoid deadlocks.  When writing from
      kernel-space (and nfsd does) this is not an issue, so
      generic_file_buffered_write does not need to break and iovec from nfsd into
      little pieces.
      This patch avoids the splitting when  get_fs is KERNEL_DS as it is
      from NFSd.
      This issue was introduced by commit 6527c2bd
      Acked-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Cc: Norman Weathers <norman.r.weathers@conocophillips.com>
      Cc: Vladimir V. Saveliev <vs@namesys.com>
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 11 Feb, 2007 1 commit
  7. 09 Feb, 2007 1 commit
  8. 10 Dec, 2006 1 commit
    • Zach Brown's avatar
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown authored
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  9. 08 Dec, 2006 1 commit
  10. 07 Dec, 2006 1 commit
  11. 01 Dec, 2006 1 commit
  12. 28 Oct, 2006 1 commit
  13. 20 Oct, 2006 2 commits
  14. 19 Oct, 2006 1 commit
  15. 04 Oct, 2006 1 commit
  16. 01 Oct, 2006 3 commits
  17. 30 Sep, 2006 2 commits
    • David Howells's avatar
      [PATCH] BLOCK: Make it possible to disable the block layer [try #6] · 9361401e
      David Howells authored
      Make it possible to disable the block layer.  Not all embedded devices require
      it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
      the block layer to be present.
      This patch does the following:
       (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
       (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
           an item that uses the block layer.  This includes:
           (*) Block I/O tracing.
           (*) Disk partition code.
           (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
           (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
           	 block layer to do scheduling.  Some drivers that use SCSI facilities -
           	 such as USB storage - end up disabled indirectly from this.
           (*) Various block-based device drivers, such as IDE and the old CDROM
           (*) MTD blockdev handling and FTL.
           (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
           	 taking a leaf out of JFFS2's book.
       (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
           linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
           however, still used in places, and so is still available.
       (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
           parts of linux/fs.h.
       (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
       (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
       (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
           is not enabled.
       (*) fs/no-block.c is created to hold out-of-line stubs and things that are
           required when CONFIG_BLOCK is not set:
           (*) Default blockdev file operations (to give error ENODEV on opening).
       (*) Makes some /proc changes:
           (*) /proc/devices does not list any blockdevs.
           (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
       (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
       (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
           given command other than Q_SYNC or if a special device is specified.
       (*) In init/do_mounts.c, no reference is made to the blockdev routines if
           CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.
       (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
           error ENOSYS by way of cond_syscall if so).
       (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
           CONFIG_BLOCK is not set, since they can't then happen.
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • David Howells's avatar
      [PATCH] BLOCK: Move functions out of buffer code [try #6] · cf9a2ae8
      David Howells authored
      Move some functions out of the buffering code that aren't strictly buffering
      specific.  This is a precursor to being able to disable the block layer.
       (*) Moved some stuff out of fs/buffer.c:
           (*) The file sync and general sync stuff moved to fs/sync.c.
           (*) The superblock sync stuff moved to fs/super.c.
           (*) do_invalidatepage() moved to mm/truncate.c.
           (*) try_to_release_page() moved to mm/filemap.c.
       (*) Moved some related declarations between header files:
           (*) declarations for do_invalidatepage() and try_to_release_page() moved
           	 to linux/mm.h.
           (*) __set_page_dirty_buffers() moved to linux/buffer_head.h.
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  18. 29 Sep, 2006 1 commit
  19. 27 Sep, 2006 2 commits
  20. 26 Sep, 2006 2 commits
  21. 29 Jul, 2006 1 commit
  22. 25 Jul, 2006 1 commit
    • Steven Whitehouse's avatar
      [GFS2] Alter direct I/O path · a9e5f4d0
      Steven Whitehouse authored
      As per comments received, alter the GFS2 direct I/O path so that
      it uses the standard read functions "out of the box". Needs a
      small change to one of the VFS functions. This reduces the size
      of the code quite a lot and also removes the need for one new export.
      Some more work remains to be done, but this is the bones of the
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  23. 30 Jun, 2006 3 commits
  24. 29 Jun, 2006 1 commit
    • Andrew Morton's avatar
      [PATCH] generic_file_buffered_write(): handle zero-length iovec segments · 81b0c871
      Andrew Morton authored
      The recent generic_file_write() deadlock fix caused
      generic_file_buffered_write() to loop inifinitely when presented with a
      zero-length iovec segment.  Fix.
      Note that this fix deliberately avoids calling ->prepare_write(),
      ->commit_write() etc with a zero-length write.  This is because I don't trust
      all filesystems to get that right.
      This is a cautious approach, for 2.6.17.x.  For 2.6.18 we should just go ahead
      and call ->prepare_write() and ->commit_write() with the zero length and fix
      any broken filesystems.  So I'll make that change once this code is stabilised
      and backported into 2.6.17.x.
      The reason for preferring to call ->prepare_write() and ->commit_write() with
      the zero-length segment: a zero-length segment _should_ be sufficiently
      uncommon that this is the correct way of handling it.  We don't want to
      optimise for poorly-written userspace at the expense of well-written
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: <stable@kernel.org>
      Cc: walt <wa1ter@myrealbox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  25. 28 Jun, 2006 1 commit
  26. 27 Jun, 2006 1 commit
    • Vladimir V. Saveliev's avatar
      [PATCH] generic_file_buffered_write(): deadlock on vectored write · 6527c2bd
      Vladimir V. Saveliev authored
      generic_file_buffered_write() prefaults in user pages in order to avoid
      deadlock on copying from the same page as write goes to.
      However, it looks like there is a problem when write is vectored:
      fault_in_pages_readable brings in current segment or its part (maxlen).
      OTOH, filemap_copy_from_user_iovec is called to copy number of bytes
      (bytes) which may exceed current segment, so filemap_copy_from_user_iovec
      switches to the next segment which is not brought in yet.  Pagefault is
      generated.  That causes the deadlock if pagefault is for the same page
      write goes to: page being written is locked and not uptodate, pagefault
      will deadlock trying to lock locked page.
      [akpm@osdl.org: somewhat rewritten]
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  27. 25 Jun, 2006 1 commit
    • Wu Fengguang's avatar
      [PATCH] readahead: backoff on I/O error · 76d42bd9
      Wu Fengguang authored
      Backoff readahead size exponentially on I/O error.
      Michael Tokarev <mjt@tls.msk.ru> described the problem as:
      Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
      In order to "fix" it, one have to read it and write to another CD-rom,
      or something.. or just ignore the error (if it's just a skip in a video
      stream).  Let's assume the unreadable block is number U.
      But current behavior is just insane.  An application requests block
      number N, which is before U. Kernel tries to read-ahead blocks N..U.
      Cdrom drive tries to read it, re-read it.. for some time.  Finally,
      when all the N..U-1 blocks are read, kernel returns block number N
      (as requested) to an application, successefully.
      Now an app requests block number N+1, and kernel tries to read
      blocks N+1..U+1.  Retrying again as in previous step.
      And so on, up to when an app requests block number U-1.  And when,
      finally, it requests block U, it receives read error.
      So, kernel currentry tries to re-read the same failing block as
      many times as the current readahead value (256 (times?) by default).
      This whole process already killed my cdrom drive (I posted about it
      to LKML several months ago) - literally, the drive has fried, and
      does not work anymore.  Ofcourse that problem was a bug in firmware
      (or whatever) of the drive *too*, but.. main problem with that is
      current readahead logic as described above.
      Which was confirmed by Jens Axboe <axboe@suse.de>:
      For ide-cd, it tends do only end the first part of the request on a
      medium error. So you may see a lot of repeats :/
      With this patch, retries are expected to be reduced from, say, 256, to 5.
      [akpm@osdl.org: cleanups]
      Signed-off-by: default avatarWu Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>