1. 21 Dec, 2012 25 commits
    • Mikulas Patocka's avatar
      dm: introduce per_bio_data · c0820cf5
      Mikulas Patocka authored
      
      
      Introduce a field per_bio_data_size in struct dm_target.
      
      Targets can set this field in the constructor. If a target sets this
      field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
      are allocated for each bio submitted to the target. These data can be
      used for any purpose by the target and help us improve performance by
      removing some per-target mempools.
      
      Per-bio data is accessed with dm_per_bio_data. The
      argument data_size must be the same as the value per_bio_data_size in
      dm_target.
      
      If the target has a pointer to per_bio_data, it can get a pointer to
      the bio with dm_bio_from_per_bio_data() function (data_size must be the
      same as the value passed to dm_per_bio_data).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c0820cf5
    • Mike Snitzer's avatar
      dm kcopyd: add WRITE SAME support to dm_kcopyd_zero · 70d6c400
      Mike Snitzer authored
      
      
      Add WRITE SAME support to dm-io and make it accessible to
      dm_kcopyd_zero().  dm_kcopyd_zero() provides an asynchronous interface
      whereas the blkdev_issue_write_same() interface is synchronous.
      
      WRITE SAME is a SCSI command that can be leveraged for more efficient
      zeroing of a specified logical extent of a device which supports it.
      Only a single zeroed logical block is transfered to the target for each
      WRITE SAME and the target then writes that same block across the
      specified extent.
      
      The dm thin target uses this.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      70d6c400
    • Mike Snitzer's avatar
      dm linear: add WRITE SAME support · 4f0b70b0
      Mike Snitzer authored
      
      
      The linear target can already support WRITE SAME requests so signal
      this by setting num_write_same_requests to 1.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      4f0b70b0
    • Mike Snitzer's avatar
      dm: add WRITE SAME support · 23508a96
      Mike Snitzer authored
      
      
      WRITE SAME bios have a payload that contain a single page.  When
      cloning WRITE SAME bios DM has no need to modify the bi_io_vec
      attributes (and doing so would be detrimental).  DM need only alter the
      start and end of the WRITE SAME bio accordingly.
      
      Rather than duplicate __clone_and_map_discard, factor out a common
      function that is also used by __clone_and_map_write_same.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      23508a96
    • Mike Snitzer's avatar
      dm: prepare to support WRITE SAME · d54eaa5a
      Mike Snitzer authored
      
      
      Allow targets to opt in to WRITE SAME support by setting
      'num_write_same_requests' in the dm_target structure.
      
      A dm device will only advertise WRITE SAME support if all its
      targets and all its underlying devices support it.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      d54eaa5a
    • Mikulas Patocka's avatar
      dm ioctl: use kmalloc if possible · 9c5091f2
      Mikulas Patocka authored
      
      
      If the parameter buffer is small enough, try to allocate it with kmalloc()
      rather than vmalloc().
      
      vmalloc is noticeably slower than kmalloc because it has to manipulate
      page tables.
      
      In my tests, on PA-RISC this patch speeds up activation 13 times.
      On Opteron this patch speeds up activation by 5%.
      
      This patch introduces a new function free_params() to free the
      parameters and this uses new flags that record whether or not vmalloc()
      was used and whether or not the input buffer must be wiped after use.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9c5091f2
    • Mikulas Patocka's avatar
      dm ioctl: remove PF_MEMALLOC · 5023e5cf
      Mikulas Patocka authored
      
      
      When allocating memory for the userspace ioctl data, set some
      appropriate GPF flags directly instead of using PF_MEMALLOC.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5023e5cf
    • Joe Thornber's avatar
      dm persistent data: improve improve space map block alloc failure message · 7960123f
      Joe Thornber authored
      
      
      Improve space map error message when unable to allocate a new
      metadata block.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7960123f
    • Mike Snitzer's avatar
      dm thin: use DMERR_LIMIT for errors · c397741c
      Mike Snitzer authored
      
      
      Throttle all errors logged from the IO path by dm thin.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c397741c
    • Mike Snitzer's avatar
      dm persistent data: use DMERR_LIMIT for errors · 89ddeb8c
      Mike Snitzer authored
      
      
      Nearly all of persistent-data is in the IO path so throttle error
      messages with DMERR_LIMIT to limit the amount logged when
      something has gone wrong.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      89ddeb8c
    • Mike Snitzer's avatar
      dm block manager: reinstate message when validator fails · a5bd968a
      Mike Snitzer authored
      
      
      Reinstate a useful error message when the block manager buffer validator fails.
      This was mistakenly eliminated when the block manager was converted to use
      dm-bufio.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      a5bd968a
    • Jonathan Brassow's avatar
      dm raid: round region_size to power of two · 3a0f9aae
      Jonathan Brassow authored
      
      
      If the user does not supply a bitmap region_size to the dm raid target,
      a reasonable size is computed automatically.  If this is not a power of 2,
      the md code will report an error later.
      
      This patch catches the problem early and rounds the region_size to the
      next power of two.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      3a0f9aae
    • Joe Thornber's avatar
      dm thin: cleanup dead code · 2aab3850
      Joe Thornber authored
      
      
      Remove unused @data_block parameter from cell_defer.
      Change thin_bio_map to use many returns rather than setting a variable.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      2aab3850
    • Joe Thornber's avatar
      dm thin: rename cell_defer_except to cell_defer_no_holder · f286ba0e
      Joe Thornber authored
      
      
      Rename cell_defer_except() to cell_defer_no_holder() which describes
      its function more clearly.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f286ba0e
    • Mikulas Patocka's avatar
      dm snapshot: optimize track_chunk · 9aa0c0e6
      Mikulas Patocka authored
      
      
      track_chunk is always called with interrupts enabled. Consequently, we
      do not need to save and restore interrupt state in "flags" variable.
      This patch changes spin_lock_irqsave to spin_lock_irq and
      spin_unlock_irqrestore to spin_unlock_irq.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9aa0c0e6
    • Mikulas Patocka's avatar
      dm raid: use DM_ENDIO_INCOMPLETE · 19cbbc60
      Mikulas Patocka authored
      
      
      Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      19cbbc60
    • Mikulas Patocka's avatar
      dm raid1: remove impossible mempool_alloc error test · 7c27213b
      Mikulas Patocka authored
      
      
      mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
      that tests if read_record is non-NULL is always true.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7c27213b
    • Mike Snitzer's avatar
      dm thin: emit ignore_discard in status when discards disabled · 018debea
      Mike Snitzer authored
      
      
      If "ignore_discard" is specified when creating the thin pool device then
      discard support is disabled for that device.  The pool device's status
      should reflect this fact rather than stating "no_discard_passdown"
      (which implies discards are enabled but passdown is disabled).
      Reported-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      018debea
    • Joe Thornber's avatar
      dm persistent data: fix nested btree deletion · e3cbf945
      Joe Thornber authored
      
      
      When deleting nested btrees, the code forgets to delete the innermost
      btree.  The thin-metadata code serendipitously compensates for this by
      claiming there is one extra layer in the tree.
      
      This patch corrects both problems.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      e3cbf945
    • Joe Thornber's avatar
      dm thin: wake worker when discard is prepared · 563af186
      Joe Thornber authored
      
      
      When discards are prepared it is best to directly wake the worker that
      will process them.  The worker will be woken anyway, via periodic
      commit, but there is no reason to not wake_worker here.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      563af186
    • Joe Thornber's avatar
      dm thin: fix race between simultaneous io and discards to same block · e8088073
      Joe Thornber authored
      
      
      There is a race when discard bios and non-discard bios are issued
      simultaneously to the same block.
      
      Discard support is expensive for all thin devices precisely because you
      have to be careful to quiesce the area you're discarding.  DM thin must
      handle this conflicting IO pattern (simultaneous non-discard vs discard)
      even though a sane application shouldn't be issuing such IO.
      
      The race manifests as follows:
      
      1. A non-discard bio is mapped in thin_bio_map.
         This doesn't lock out parallel activity to the same block.
      
      2. A discard bio is issued to the same block as the non-discard bio.
      
      3. The discard bio is locked in a dm_bio_prison_cell in process_discard
         to lock out parallel activity against the same block.
      
      4. The non-discard bio's mapping continues and its all_io_entry is
         incremented so the bio is accounted for in the thin pool's all_io_ds
         which is a dm_deferred_set used to track time locality of non-discard IO.
      
      5. The non-discard bio is finally locked in a dm_bio_prison_cell in
         process_bio.
      
      The race can result in deadlock, leaving the block layer hanging waiting
      for completion of a discard bio that never completes, e.g.:
      
      INFO: task ruby:15354 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
       ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
       ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
       ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
      Call Trace:
       [<ffffffff814e5a19>] schedule+0x29/0x70
       [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
       [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
       [<ffffffff814e589e>] wait_for_common+0x11e/0x190
       [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
       [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
       [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
       [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
       [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
       [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
       [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
       [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
       [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
       [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
      
      The thinp-test-suite's test_discard_random_sectors reliably hits this
      deadlock on fast SSD storage.
      
      The fix for this race is that the all_io_entry for a bio must be
      incremented whilst the dm_bio_prison_cell is held for the bio's
      associated virtual and physical blocks.  That cell locking wasn't
      occurring early enough in thin_bio_map.  This patch fixes this.
      
      Care is taken to always call the new function inc_all_io_entry() with
      the relevant cells locked, but they are generally unlocked before
      calling issue() to try to avoid holding the cells locked across
      generic_submit_request.
      
      Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
      longer the only thread that will do so.  Because of this we must be sure
      to use cell_defer_except() to release all non-holder entries, that
      were added by the other thread, because they must be deferred.
      
      This patch depends on "dm thin: replace dm_cell_release_singleton with
      cell_defer_except".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@vger.kernel.org
      e8088073
    • Joe Thornber's avatar
      dm thin: replace dm_cell_release_singleton with cell_defer_except · b7ca9c92
      Joe Thornber authored
      
      
      Change existing users of the function dm_cell_release_singleton to share
      cell_defer_except instead, and then remove the now-unused function.
      
      Everywhere that calls dm_cell_release_singleton, the bio in question
      is the holder of the cell.
      
      If there are no non-holder entries in the cell then cell_defer_except
      behaves exactly like dm_cell_release_singleton.  Conversely, if there
      *are* non-holder entries then dm_cell_release_singleton must not be used
      because those entries would need to be deferred.
      
      Consequently, it is safe to replace use of dm_cell_release_singleton
      with cell_defer_except.
      
      This patch is a pre-requisite for "dm thin: fix race between
      simultaneous io and discards to same block".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b7ca9c92
    • Mike Snitzer's avatar
      dm: disable WRITE SAME · c1a94672
      Mike Snitzer authored
      
      
      WRITE SAME bios are not yet handled correctly by device-mapper so
      disable their use on device-mapper devices by setting
      max_write_same_sectors to zero.
      
      As an example, a ciphertext device is incompatible because the data
      gets changed according to the location at which it written and so the
      dm crypt target cannot support it.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: Milan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c1a94672
    • Alasdair G Kergon's avatar
      dm ioctl: prevent unsafe change to dm_ioctl data_size · e910d7eb
      Alasdair G Kergon authored
      
      
      Abort dm ioctl processing if userspace changes the data_size parameter
      after we validated it but before we finished copying the data buffer
      from userspace.
      
      The dm ioctl parameters are processed in the following sequence:
       1. ctl_ioctl() calls copy_params();
       2. copy_params() makes a first copy of the fixed-sized portion of the
          userspace parameters into the local variable "tmp";
       3. copy_params() then validates tmp.data_size and allocates a new
          structure big enough to hold the complete data and copies the whole
          userspace buffer there;
       4. ctl_ioctl() reads userspace data the second time and copies the whole
          buffer into the pointer "param";
       5. ctl_ioctl() reads param->data_size without any validation and stores it
          in the variable "input_param_size";
       6. "input_param_size" is further used as the authoritative size of the
          kernel buffer.
      
      The problem is that userspace code could change the contents of user
      memory between steps 2 and 4.  In particular, the data_size parameter
      can be changed to an invalid value after the kernel has validated it.
      This lets userspace force the kernel to access invalid kernel memory.
      
      The fix is to ensure that the size has not changed at step 4.
      
      This patch shouldn't have a security impact because CAP_SYS_ADMIN is
      required to run this code, but it should be fixed anyway.
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      e910d7eb
    • Mikulas Patocka's avatar
      dm persistent data: rename node to btree_node · 550929fa
      Mikulas Patocka authored
      
      
      This patch fixes a compilation failure on sparc32 by renaming struct node.
      
      struct node is already defined in include/linux/node.h. On sparc32, it
      happens to be included through other dependencies and persistent-data
      doesn't compile because of conflicting declarations.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      550929fa
  2. 10 Dec, 2012 6 commits
    • Linus Torvalds's avatar
      Linux 3.7 · 29594404
      Linus Torvalds authored
      29594404
    • Florian Fainelli's avatar
      Input: matrix-keymap - provide proper module license · 55220bb3
      Florian Fainelli authored
      The matrix-keymap module is currently lacking a proper module license,
      add one so we don't have this module tainting the entire kernel.  This
      issue has been present since commit 1932811f
      
       ("Input: matrix-keymap
      - uninline and prepare for device tree support")
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      CC: stable@vger.kernel.org # v3.5+
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55220bb3
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 2c68bc72
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Netlink socket dumping had several missing verifications and checks.
      
          In particular, address comparisons in the request byte code
          interpreter could access past the end of the address in the
          inet_request_sock.
      
          Also, address family and address prefix lengths were not validated
          properly at all.
      
          This means arbitrary applications can read past the end of certain
          kernel data structures.
      
          Fixes from Neal Cardwell.
      
       2) ip_check_defrag() operates in contexts where we're in the process
          of, or about to, input the packet into the real protocols
          (specifically macvlan and AF_PACKET snooping).
      
          Unfortunately, it does a pskb_may_pull() which can modify the
          backing packet data which is not legal if the SKB is shared.  It
          very much can be shared in this context.
      
          Deal with the possibility that the SKB is segmented by using
          skb_copy_bits().
      
          Fix from Johannes Berg based upon a report by Eric Leblond.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        ipv4: ip_check_defrag must not modify skb before unsharing
        inet_diag: validate port comparison byte code to prevent unsafe reads
        inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
        inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
        inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
      2c68bc72
    • Linus Torvalds's avatar
      Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage · caf49191
      Linus Torvalds authored
      This reverts commits a5091539 and
      d7c3b937
      
      .
      
      This is a revert of a revert of a revert.  In addition, it reverts the
      even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
      original commits in linux-next.
      
      It turns out that the original patch really was bogus, and that the
      original revert was the correct thing to do after all.  We thought we
      had fixed the problem, and then reverted the revert, but the problem
      really is fundamental: waking up kswapd simply isn't the right thing to
      do, and direct reclaim sometimes simply _is_ the right thing to do.
      
      When certain allocations fail, we simply should try some direct reclaim,
      and if that fails, fail the allocation.  That's the right thing to do
      for THP allocations, which can easily fail, and the GPU allocations want
      to do that too.
      
      So starting kswapd is sometimes simply wrong, and removing the flag that
      said "don't start kswapd" was a mistake.  Let's hope we never revisit
      this mistake again - and certainly not this many times ;)
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      caf49191
    • Johannes Berg's avatar
      ipv4: ip_check_defrag must not modify skb before unsharing · 1bf3751e
      Johannes Berg authored
      
      
      ip_check_defrag() might be called from af_packet within the
      RX path where shared SKBs are used, so it must not modify
      the input SKB before it has unshared it for defragmentation.
      Use skb_copy_bits() to get the IP header and only pull in
      everything later.
      
      The same is true for the other caller in macvlan as it is
      called from dev->rx_handler which can also get a shared SKB.
      Reported-by: default avatarEric Leblond <eric@regit.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bf3751e
    • Linus Torvalds's avatar
      Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" · 31f8d42d
      Linus Torvalds authored
      This reverts commit 782fd304.
      
      We are going to reinstate the __GFP_NO_KSWAPD flag that has been
      removed, the removal reverted, and then removed again.  Making this
      commit a pointless fixup for a problem that was caused by the removal of
      __GFP_NO_KSWAPD flag.
      
      The thing is, we really don't want to wake up kswapd for THP allocations
      (because they fail quite commonly under any kind of memory pressure,
      including when there is tons of memory free), and these patches were
      just trying to fix up the underlying bug: the original removal of
      __GFP_NO_KSWAPD in commit c6543459
      
       ("mm: remove __GFP_NO_KSWAPD")
      was simply bogus.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f8d42d
  3. 09 Dec, 2012 4 commits
    • Neal Cardwell's avatar
      inet_diag: validate port comparison byte code to prevent unsafe reads · 5e1f5420
      Neal Cardwell authored
      
      
      Add logic to verify that a port comparison byte code operation
      actually has the second inet_diag_bc_op from which we read the port
      for such operations.
      
      Previously the code blindly referenced op[1] without first checking
      whether a second inet_diag_bc_op struct could fit there. So a
      malicious user could make the kernel read 4 bytes beyond the end of
      the bytecode array by claiming to have a whole port comparison byte
      code (2 inet_diag_bc_op structs) when in fact the bytecode was not
      long enough to hold both.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e1f5420
    • Neal Cardwell's avatar
      inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() · f67caec9
      Neal Cardwell authored
      
      
      Add logic to check the address family of the user-supplied conditional
      and the address family of the connection entry. We now do not do
      prefix matching of addresses from different address families (AF_INET
      vs AF_INET6), except for the previously existing support for having an
      IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
      maintains as-is).
      
      This change is needed for two reasons:
      
      (1) The addresses are different lengths, so comparing a 128-bit IPv6
      prefix match condition to a 32-bit IPv4 connection address can cause
      us to unwittingly walk off the end of the IPv4 address and read
      garbage or oops.
      
      (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
      simple bit-wise comparison of the prefixes is not meaningful, and
      would lead to bogus results (except for the IPv4-mapped IPv6 case,
      which this commit maintains).
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f67caec9
    • Neal Cardwell's avatar
      inet_diag: validate byte code to prevent oops in inet_diag_bc_run() · 405c0059
      Neal Cardwell authored
      
      
      Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
      operations.
      
      Previously we did not validate the inet_diag_hostcond, address family,
      address length, and prefix length. So a malicious user could make the
      kernel read beyond the end of the bytecode array by claiming to have a
      whole inet_diag_hostcond when the bytecode was not long enough to
      contain a whole inet_diag_hostcond of the given address family. Or
      they could make the kernel read up to about 27 bytes beyond the end of
      a connection address by passing a prefix length that exceeded the
      length of addresses of the given family.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      405c0059
    • Neal Cardwell's avatar
      inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state · 1c95df85
      Neal Cardwell authored
      
      
      Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
      instantiated for IPv4 traffic and in the SYN-RECV state were actually
      created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
      means that for such connections inet6_rsk(req) returns a pointer to a
      random spot in memory up to roughly 64KB beyond the end of the
      request_sock.
      
      With this bug, for a server using AF_INET6 TCP sockets and serving
      IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
      inet_diag_fill_req() causing an oops or the export to user space of 16
      bytes of kernel memory as a garbage IPv6 address, depending on where
      the garbage inet6_rsk(req) pointed.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c95df85
  4. 08 Dec, 2012 2 commits
    • Johannes Weiner's avatar
      mm: vmscan: fix inappropriate zone congestion clearing · ed23ec4f
      Johannes Weiner authored
      commit c702418f
      
       ("mm: vmscan: do not keep kswapd looping forever due
      to individual uncompactable zones") removed zone watermark checks from
      the compaction code in kswapd but left in the zone congestion clearing,
      which now happens unconditionally on higher order reclaim.
      
      This messes up the reclaim throttling logic for zones with
      dirty/writeback pages, where zones should only lose their congestion
      status when their watermarks have been restored.
      
      Remove the clearing from the zone compaction section entirely.  The
      preliminary zone check and the reclaim loop in kswapd will clear it if
      the zone is considered balanced.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed23ec4f
    • Linus Torvalds's avatar
      vfs: fix O_DIRECT read past end of block device · 684c9aae
      Linus Torvalds authored
      The direct-IO write path already had the i_size checks in mm/filemap.c,
      but it turns out the read path did not, and removing the block size
      checks in fs/block_dev.c (commit bbec0270
      
      : "blkdev_max_block: make
      private to fs/buffer.c") removed the magic "shrink IO to past the end of
      the device" code there.
      
      Fix it by truncating the IO to the size of the block device, like the
      write path already does.
      
      NOTE! I suspect the write path would be *much* better off doing it this
      way in fs/block_dev.c, rather than hidden deep in mm/filemap.c.  The
      mm/filemap.c code is extremely hard to follow, and has various
      conditionals on the target being a block device (ie the flag passed in
      to 'generic_write_checks()', along with a conditional update of the
      inode timestamp etc).
      
      It is also quite possible that we should treat this whole block device
      size as a "s_maxbytes" issue, and try to make the logic even more
      generic.  However, in the meantime this is the fairly minimal targeted
      fix.
      
      Noted by Milan Broz thanks to a regression test for the cryptsetup
      reencrypt tool.
      Reported-and-tested-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      684c9aae
  5. 07 Dec, 2012 3 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1b3c393c
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "Two stragglers:
      
         1) The new code that adds new flushing semantics to GRO can cause SKB
            pointer list corruption, manage the lists differently to avoid the
            OOPS.  Fix from Eric Dumazet.
      
         2) When TCP fast open does a retransmit of data in a SYN-ACK or
            similar, we update retransmit state that we shouldn't triggering a
            WARN_ON later.  Fix from Yuchung Cheng."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: gro: fix possible panic in skb_gro_receive()
        tcp: bug fix Fast Open client retransmission
      1b3c393c
    • Eric Dumazet's avatar
      net: gro: fix possible panic in skb_gro_receive() · c3c7c254
      Eric Dumazet authored
      commit 2e71a6f8
      
       (net: gro: selective flush of packets) added
      a bug for skbs using frag_list. This part of the GRO stack is rarely
      used, as it needs skb not using a page fragment for their skb->head.
      
      Most drivers do use a page fragment, but some of them use GFP_KERNEL
      allocations for the initial fill of their RX ring buffer.
      
      napi_gro_flush() overwrite skb->prev that was used for these skb to
      point to the last skb in frag_list.
      
      Fix this using a separate field in struct napi_gro_cb to point to the
      last fragment.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3c7c254
    • Yuchung Cheng's avatar
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng authored
      
      
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93b174ad