1. 10 Oct, 2007 3 commits
    • NeilBrown's avatar
      New function blk_req_append_bio · 3001ca77
      NeilBrown authored
      
      
      ll_back_merge_fn is currently exported to SCSI where is it used,
      together with blk_rq_bio_prep, in exactly the same way these
      functions are used in __blk_rq_map_user.
      
      So move the common code into a new function (blk_rq_append_bio), and
      don't export ll_back_merge_fn any longer.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      
      diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      3001ca77
    • NeilBrown's avatar
      Introduce rq_for_each_segment replacing rq_for_each_bio · 5705f702
      NeilBrown authored
      
      
      Every usage of rq_for_each_bio wraps a usage of
      bio_for_each_segment, so these can be combined into
      rq_for_each_segment.
      
      We define "struct req_iterator" to hold the 'bio' and 'index' that
      are needed for the double iteration.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      
      Various compile fixes by me...
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      5705f702
    • NeilBrown's avatar
      Merge blk_recount_segments into blk_recalc_rq_segments · 9dfa5283
      NeilBrown authored
      
      
      blk_recalc_rq_segments calls blk_recount_segments on each bio,
      then does some extra calculations to handle segments that overlap
      two bios.
      
      If we merge the code from blk_recount_segments into
      blk_recalc_rq_segments, we can process the whole request one bio_vec
      at a time, and not need the messy cross-bio calculations.
      
      Then blk_recount_segments can be implemented by calling
      blk_recalc_rq_segments, passing it a simple on-stack request which
      stores just the bio.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      
      diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      9dfa5283
  2. 14 Sep, 2007 1 commit
  3. 13 Sep, 2007 1 commit
    • Jens Axboe's avatar
      Fix race with shared tag queue maps · f3da54ba
      Jens Axboe authored
      There's a race condition in blk_queue_end_tag() for shared tag maps,
      users include stex (promise supertrak thingy) and qla2xxx.  The former
      at least has reported bugs in this area, not sure why we haven't seen
      any for the latter.  It could be because the window is narrow and that
      other conditions in the qla2xxx code hide this.  It's a real bug,
      though, as the stex smp users can attest.
      
      We need to ensure two things - the tag bit clearing needs to happen
      AFTER we cleared the tag pointer, as the tag bit clearing/setting is
      what protects this map.  Secondly, we need to ensure that the visibility
      of the tag pointer and tag bit clear are ordered properly.
      
      [ I removed the SMP barriers - "test_and_clear_bit()" already implies
        all the required barriers.  -- Linus ]
      
      Also see http://bugzilla.kernel.org/show_bug.cgi?id=7842
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3da54ba
  4. 11 Aug, 2007 1 commit
  5. 24 Jul, 2007 1 commit
  6. 19 Jul, 2007 1 commit
    • Paul Mundt's avatar
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt authored
      Slab destructors were no longer supported after Christoph's
      c59def9f
      
       change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      20c2df83
  7. 17 Jul, 2007 1 commit
  8. 16 Jul, 2007 4 commits
  9. 10 Jul, 2007 2 commits
    • Tejun Heo's avatar
      [BLOCK] drop unnecessary bvec rewinding from flush_dry_bio_endio · f4b09303
      Tejun Heo authored
      
      
      Barrier bios are completed twice - once after the barrier write itself
      is done and again after the whole sequence is complete.
      flush_dry_bio_endio() is for the first completion.  It doesn't really
      complete the bio.  It rewinds bvec and resets bio so that it can be
      completed again when the whole barrier sequence is complete.
      
      The bvec rewinding code has the following problems.
      
      1. The rewinding code is wrong because filesystems may pass bvec with
         non zero bv_offset.
      
      2. The block layer doesn't guarantee anything about the state of
         bvec array on request completion.  bv_offset and len are updated
         iff __end_that_request_first() completes the bvec partially.
      
      Because of #2, #1 doesn't really matter (nobody cares whether bvec is
      re-wound correctly or not) but then again by not doing unwinding at
      all, we'll always give back the same bvec to the caller as full bvec
      completion doesn't alter bvecs and the final completion is always full
      completion.
      
      Drop unnecessary rewinding code.
      
      This is spotted by Neil Brown.
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      f4b09303
    • Jens Axboe's avatar
      blk_hw_contig_segment(): bad segment size checks · 32eef964
      Jens Axboe authored
      
      
      Two bugs in there:
      
      - The virt oversize check should use the current bio hardware back
        size and the next bio front size, not the same bio. Spotted by
        Neil Brown.
      
      - The segment size check should add hw front sizes, not total bio
        sizes. Spotted by James Bottomley
      Acked-by: default avatarJames Bottomley <James.Bottomley@SteelEye.com>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      32eef964
  10. 15 Jun, 2007 1 commit
    • Tejun Heo's avatar
      block: always requeue !fs requests at the front · bc90ba09
      Tejun Heo authored
      SCSI marks internal commands with REQ_PREEMPT and push it at the front
      of the request queue using blk_execute_rq().  When entering suspended
      or frozen state, SCSI devices are quiesced using
      scsi_device_quiesce().  In quiesced state, only REQ_PREEMPT requests
      are processed.  This is how SCSI blocks other requests out while
      suspending and resuming.  As all internal commands are pushed at the
      front of the queue, this usually works.
      
      Unfortunately, this interacts badly with ordered requeueing.  To
      preserve request order on requeueing (due to busy device, active EH or
      other failures), requests are sorted according to ordered sequence on
      requeue if IO barrier is in progress.
      
      The following sequence deadlocks.
      
      1. IO barrier sequence issues.
      
      2. Suspend requested.  Queue is quiesced with part or all of IO
         barrier sequence at the front.
      
      3. During suspending or resuming, SCSI issues internal command which
         gets deferred and requeued for some reason.  As the command is
         issued after the IO barrier in #1, ordered requeueing code puts the
         request after IO barrier sequence.
      
      4. The device is ready to process requests again but still is in
         quiesced state and the first request of the queue isn't
         REQ_PREEMPT, so command processing is deadlocked -
         suspending/resuming waits for the issued request to complete while
         the request can't be processed till device is put back into
         running state by resuming.
      
      This can be fixed by always putting !fs requests at the front when
      requeueing.
      
      The following thread reports this deadlock.
      
        http://thread.gmane.org/gmane.linux.kernel/537473
      
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Acked-by: default avatarDavid Greaves <david@dgreaves.com>
      Acked-by: default avatarJeff Garzik <jeff@garzik.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc90ba09
  11. 15 May, 2007 1 commit
  12. 11 May, 2007 1 commit
    • Neil Brown's avatar
      When stacked block devices are in-use (e.g. md or dm), the recursive calls · d89d8796
      Neil Brown authored
      
      
      to generic_make_request can use up a lot of space, and we would rather they
      didn't.
      
      As generic_make_request is a void function, and as it is generally not
      expected that it will have any effect immediately, it is safe to delay any
      call to generic_make_request until there is sufficient stack space
      available.
      
      As ->bi_next is reserved for the driver to use, it can have no valid value
      when generic_make_request is called, and as __make_request implicitly
      assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
      certain that all callers set it to NULL.  We can therefore safely use
      bi_next to link pending requests together, providing we clear it before
      making the real call.
      
      So, we choose to allow each thread to only be active in one
      generic_make_request at a time.  If a subsequent (recursive) call is made,
      the bio is linked into a per-thread list, and is handled when the active
      call completes.
      
      As the list of pending bios is per-thread, there are no locking issues to
      worry about.
      
      I say above that it is "safe to delay any call...".  There are, however,
      some behaviours of a make_request_fn which would make it unsafe.  These
      include any behaviour that assumes anything will have changed after a
      recursive call to generic_make_request.
      
      These could include:
       - waiting for that call to finish and call it's bi_end_io function.
         md use to sometimes do this (marking the superblock dirty before
         completing a write) but doesn't any more
       - inspecting the bio for fields that generic_make_request might
         change, such as bi_sector or bi_bdev.  It is hard to see a good
         reason for this, and I don't think anyone actually does it.
       - inspecing the queue to see if, e.g. it is 'full' yet.  Again, I
         think this is very unlikely to be useful, or to be done.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <dm-devel@redhat.com>
      
      Alasdair G Kergon <agk@redhat.com> said:
      
       I can see nothing wrong with this in principle.
      
       For device-mapper at the moment though it's essential that, while the bio
       mappings may now get delayed, they still get processed in exactly
       the same order as they were passed to generic_make_request().
      
       My main concern is whether the timing changes implicit in this patch
       will make the rare data-corrupting races in the existing snapshot code
       more likely. (I'm working on a fix for these races, but the unfinished
       patch is already several hundred lines long.)
      
       It would be helpful if some people on this mailing list would test
       this patch in various scenarios and report back.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d89d8796
  13. 09 May, 2007 4 commits
  14. 08 May, 2007 1 commit
    • Mike Christie's avatar
      [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern() · 821de3a2
      Mike Christie authored
      
      
      I think we might just need the blk_map_kern users now. For the async
      execute I added the bounce code already and the block SG_IO has it
      atleady. I think the blk_map_kern bounce code got dropped because we
      thought the correct gfp_t would be passed in. But I think all we need is
      the patch below and all the paths are take care of. The patch is not
      tested. Patch was made against scsi-misc.
      
      The last place that is sending non sg commands may just be md/dm-emc.c
      but that is is just waiting on alasdair to take some patches that fix
      that and a bunch of junk in there including adding bounce support. If
      the patch below is ok though and dm-emc finally gets converted then it
      will have sg and bonce buffer support.
      Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      821de3a2
  15. 30 Apr, 2007 1 commit
  16. 17 Apr, 2007 1 commit
    • Alan Stern's avatar
      [SCSI] sg: cap reserved_size values at max_sectors · 44ec9542
      Alan Stern authored
      
      
      This patch (as857) modifies the SG_GET_RESERVED_SIZE and
      SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
      the device's request_queue's max_sectors value.  This will permit
      cdrecord to obtain a legal value for the maximum transfer length,
      fixing Bugzilla #7026.
      
      The patch also caps the initial reserved_size value.  There's no
      reason to have a reserved buffer larger than max_sectors, since it
      would be impossible to use the extra space.
      
      The corresponding ioctls in the block layer are modified similarly,
      and the initial value for the reserved_size is set as large as
      possible.  This will effectively make it default to max_sectors.
      Note that the actual value is meaningless anyway, since block devices
      don't have a reserved buffer.
      
      Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
      uniform way for users to determine the actual max_sectors value for
      any raw SCSI transport.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Acked-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Acked-by: default avatarDouglas Gilbert <dougg@torque.net>
      Signed-off-by: default avatarJames Bottomley <James.Bottomley@SteelEye.com>
      44ec9542
  17. 27 Mar, 2007 1 commit
    • Vasily Tarasov's avatar
      block: blk_max_pfn is somtimes wrong · f772b3d9
      Vasily Tarasov authored
      
      
      There is a small problem in handling page bounce.
      
      At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
      possible _number_ of a page frame, but the _amount_ of page frames.  For
      example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
      0xFFFF.
      
      request_queue structure has a member q->bounce_pfn and queue needs bounce
      pages for the pages _above_ this limit.  This routine is handled by
      blk_queue_bounce(), where the following check is produced:
      
      	if (q->bounce_pfn >= blk_max_pfn)
      		return;
      
      Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
      equals 0x10000.  In such situation the check above fails and for each bio
      we always fall down for iterating over pages tied to the bio.
      
      I want to notice, that for quite a big range of device drivers (ide, md,
      ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
      bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
      then the check above doesn't fail.  But for other drivers, which obtain
      reuired value from drivers, it fails.  For example sata_nv uses
      ATA_DMA_MASK or dev->dma_mask.
      
      I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
      blk_max_low_pfn.  The patch also cleanses some checks related with
      bounce_pfn.
      Signed-off-by: default avatarVasily Tarasov <vtaras@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      f772b3d9
  18. 09 Feb, 2007 1 commit
    • Neil Brown's avatar
      [PATCH] md: fix various bugs with aligned reads in RAID5 · 387bb173
      Neil Brown authored
      
      
      It is possible for raid5 to be sent a bio that is too big for an underlying
      device.  So if it is a READ that we pass stright down to a device, it will
      fail and confuse RAID5.
      
      So in 'chunk_aligned_read' we check that the bio fits within the parameters
      for the target device and if it doesn't fit, fall back on reading through
      the stripe cache and making lots of one-page requests.
      
      Note that this is the earliest time we can check against the device because
      earlier we don't have a lock on the device, so it could change underneath
      us.
      
      Also, the code for handling a retry through the cache when a read fails has
      not been tested and was badly broken.  This patch fixes that code.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Cc: "Kai" <epimetreus@fastmail.fm>
      Cc: <stable@suse.de>
      Cc: <org@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      387bb173
  19. 22 Dec, 2006 1 commit
  20. 19 Dec, 2006 5 commits
  21. 13 Dec, 2006 1 commit
  22. 12 Dec, 2006 1 commit
    • Boaz Harrosh's avatar
      [PATCH] remove blk_queue_activity_fn · 2b02a179
      Boaz Harrosh authored
      
      
      While working on bidi support at struct request level
      I have found that blk_queue_activity_fn is actually never used.
      The only user is in ide-probe.c with this code:
      
      	/* enable led activity for disk drives only */
      	if (drive->media == ide_disk && hwif->led_act)
      		blk_queue_activity_fn(q, hwif->led_act, drive);
      
      And led_act is never initialized anywhere.
      (Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
      Unless it is all for future use off course.
      (this patch is against linux-2.6-block.git as off 2006/12/4)
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      2b02a179
  23. 10 Dec, 2006 1 commit
  24. 08 Dec, 2006 1 commit
    • Akinobu Mita's avatar
      [PATCH] fault-injection capability for disk IO · c17bb495
      Akinobu Mita authored
      
      
      This patch provides fault-injection capability for disk IO.
      
      Boot option:
      
      fail_make_request=<probability>,<interval>,<space>,<times>
      
      	<interval> -- specifies the interval of failures.
      
      	<probability> -- specifies how often it should fail in percent.
      
      	<space> -- specifies the size of free space where disk IO can be issued
      		   safely in bytes.
      
      	<times> -- specifies how many times failures may happen at most.
      
      Debugfs:
      
      /debug/fail_make_request/interval
      /debug/fail_make_request/probability
      /debug/fail_make_request/specifies
      /debug/fail_make_request/times
      
      Example:
      
      	fail_make_request=10,100,0,-1
      	echo 1 > /sys/blocks/hda/hda1/make-it-fail
      
      generic_make_request() on /dev/hda1 fails once per 10 times.
      
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c17bb495
  25. 07 Dec, 2006 2 commits
  26. 01 Dec, 2006 1 commit