1. 01 Jul, 2016 1 commit
    • David S. Miller's avatar
      packet: Use symmetric hash for PACKET_FANOUT_HASH. · eb70db87
      David S. Miller authored
      People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
      they want packets going in both directions on a flow to hash to the
      same bucket.
      The core kernel SKB hash became non-symmetric when the ipv6 flow label
      and other entities were incorporated into the standard flow hash order
      to increase entropy.
      But there are no users of PACKET_FANOUT_HASH who want an assymetric
      hash, they all want a symmetric one.
      Therefore, use the flow dissector to compute a flat symmetric hash
      over only the protocol, addresses and ports.  This hash does not get
      installed into and override the normal skb hash, so this change has
      no effect whatsoever on the rest of the stack.
      Reported-by: default avatarEric Leblond <eric@regit.org>
      Tested-by: default avatarEric Leblond <eric@regit.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 20 May, 2016 2 commits
    • Neil Horman's avatar
      net: suppress warnings on dev_alloc_skb · 95829b3a
      Neil Horman authored
      Noticed an allocation failure in a network driver the other day on a 32 bit
      DMA-API: debugging out of memory - disabling
      bnx2fc: adapter_lookup: hba NULL
      lldpad: page allocation failure. order:0, mode:0x4120
      Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
      Call Trace:
       [<c08a4086>] ? printk+0x19/0x23
       [<c05166a4>] ? __alloc_pages_nodemask+0x664/0x830
       [<c0649d02>] ? free_object+0x82/0xa0
       [<fb4e2c9b>] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
       [<fb4e2fff>] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
       [<fb4e228c>] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
       [<fb4e3709>] ? ixgbe_configure+0x589/0xc00 [ixgbe]
       [<fb4e7be7>] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
       [<fb503ce6>] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
       [<fb4e8e54>] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
       [<fb505a9f>] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
       [<c088d80d>] ? dcb_doit+0x10ed/0x16d0
      Thought that perhaps the big splat in the logs wasn't really necessecary, as
      all call sites for dev_alloc_skb:
      a) check the return code for the function
      b) either print their own error message or have a recovery path that makes the
      warning moot.
      Fix it by modifying dev_alloc_pages to pass __GFP_NOWARN as a gfp flag to
      suppress the warning
      applies to the net tree
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Tom Herbert's avatar
      net: define gso types for IPx over IPv4 and IPv6 · 7e13318d
      Tom Herbert authored
      This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
      SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
      NETIF_F_GSO_IPXIP6. These are used to described IP in IP
      tunnel and what the outer protocol is. The inner protocol
      can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
      SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
      are removed (these are both instances of SKB_GSO_IPXIP4).
      SKB_GSO_IPXIP6 will be used when support for GSO with IP
      encapsulation over IPv6 is added.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Acked-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 02 May, 2016 1 commit
    • Eric Dumazet's avatar
      net: relax expensive skb_unclone() in iptunnel_handle_offloads() · 9580bf2e
      Eric Dumazet authored
      Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
      tunnel have to go through an expensive skb_unclone()
      Reallocating skb->head is a lot of work.
      Test should really check if a 'real clone' of the packet was done.
      TCP does not care if the original gso_type is changed while the packet
      travels in the stack.
      This adds skb_header_unclone() which is a variant of skb_clone()
      using skb_header_cloned() check instead of skb_cloned().
      This variant can probably be used from other points.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 28 Apr, 2016 1 commit
  5. 25 Apr, 2016 1 commit
    • Sowmini Varadhan's avatar
      skbuff: Add pskb_extract() helper function · 6fa01ccd
      Sowmini Varadhan authored
      A pattern of skb usage seen in modules such as RDS-TCP is to
      extract `to_copy' bytes from the received TCP segment, starting
      at some offset `off' into a new skb `clone'. This is done in
      the ->data_ready callback, where the clone skb is queued up for rx on
      the PF_RDS socket, while the parent TCP segment is returned unchanged
      back to the TCP engine.
      The existing code uses the sequence
      	clone = skb_clone(..);
      	pskb_pull(clone, off, ..);
      	pskb_trim(clone, to_copy, ..);
      with the intention of discarding the first `off' bytes. However,
      skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
      up doing a redundant memcpy of bytes that will then get discarded
      in __pskb_pull_tail().
      To avoid this inefficiency, this commit adds pskb_extract() that
      creates the clone, and memcpy's only the relevant header/frag/frag_list
      to the start of `clone'. pskb_trim() is then invoked to trim clone
      down to the requested to_copy bytes.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 14 Apr, 2016 2 commits
    • Alexander Duyck's avatar
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck authored
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Alexander Duyck's avatar
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck authored
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 05 Apr, 2016 1 commit
    • samanthakumar's avatar
      udp: enable MSG_PEEK at non-zero offset · 627d2d6b
      samanthakumar authored
      Enable peeking at UDP datagrams at the offset specified with socket
      option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
      to the end of the given datagram.
      Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f
      ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
      on peek, decrease it on regular reads.
      When peeking, always checksum the packet immediately, to avoid
      recomputation on subsequent peeks and final read.
      The socket lock is not held for the duration of udp_recvmsg, so
      peek and read operations can run concurrently. Only the last store
      to sk_peek_off is preserved.
      Signed-off-by: default avatarSam Kumar <samanthakumar@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 03 Mar, 2016 1 commit
    • Benjamin Poirier's avatar
      mld, igmp: Fix reserved tailroom calculation · 1837b2e2
      Benjamin Poirier authored
      The current reserved_tailroom calculation fails to take hlen and tlen into
      ^                                               ^
      head                                            skb_end_offset
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      ^                                               ^
      head                                            skb_end_offset
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      	reserved_tailroom = tlen
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      Fixes: 4c672e4b
       ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 01 Mar, 2016 1 commit
  10. 21 Feb, 2016 1 commit
  11. 18 Feb, 2016 1 commit
    • Alexander Duyck's avatar
      net: Optimize local checksum offload · 9e74a6da
      Alexander Duyck authored
      This patch takes advantage of several assumptions we can make about the
      headers of the frame in order to reduce overall processing overhead for
      computing the outer header checksum.
      First we can assume the entire header is in the region pointed to by
      skb->head as this is what csum_start is based on.
      Second, as a result of our first assumption, we can just call csum_partial
      instead of making a call to skb_checksum which would end up having to
      configure things so that we could walk through the frags list.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 12 Feb, 2016 2 commits
  13. 11 Feb, 2016 4 commits
    • Jesper Dangaard Brouer's avatar
      net: bulk free SKBs that were delay free'ed due to IRQ context · 15fad714
      Jesper Dangaard Brouer authored
      The network stack defers SKBs free, in-case free happens in IRQ or
      when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
      writes SKBs that were free'ed during IRQ to the softirq completion
      queue (softnet_data.completion_queue).
      These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
      in function net_tx_action().  Take advantage of this a use the skb
      defer and flush API, as we are already in softirq context.
      For modern drivers this rarely happens. Although most drivers do call
      dev_kfree_skb_any(), which detects the situation and calls
      __dev_kfree_skb_irq() when needed.  This due to netpoll can call from
      IRQ context.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Jesper Dangaard Brouer's avatar
      net: bulk free infrastructure for NAPI context, use napi_consume_skb · 795bb1c0
      Jesper Dangaard Brouer authored
      Discovered that network stack were hitting the kmem_cache/SLUB
      slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
      can speedup this slowpath.
      NAPI context is a bit special, lets take advantage of that for bulk
      free'ing SKBs.
      In NAPI context we are running in softirq, which gives us certain
      protection.  A softirq can run on several CPUs at once.  BUT the
      important part is a softirq will never preempt another softirq running
      on the same CPU.  This gives us the opportunity to access per-cpu
      variables in softirq context.
      Extend napi_alloc_cache (before only contained page_frag_cache) to be
      a struct with a small array based stack for holding SKBs.  Introduce a
      SKB defer and flush API for accessing this.
      Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
      when running in NAPI context.  A small trick to handle/detect if we
      are called from netpoll is to see if budget is 0.  In that case, we
      need to invoke dev_consume_skb_irq().
      Joint work with Alexander Duyck.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Alexander Duyck's avatar
      net: Store checksum result for offloaded GSO checksums · 08b64fcc
      Alexander Duyck authored
      This patch makes it so that we can offload the checksums for a packet up
      to a certain point and then begin computing the checksums via software.
      Setting this up is fairly straight forward as all we need to do is reset
      the values stored in csum and csum_start for the GSO context block.
      One complication for this is remote checksum offload.  In order to allow
      the inner checksums to be offloaded while computing the outer checksum
      manually we needed to have some way of indicating that the offload wasn't
      real.  In order to do that I replaced CHECKSUM_PARTIAL with
      CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
      header while skipping computing checksums for the inner headers.  We clean
      up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
      CHECKSUM_NONE once we hand the packet off to the next lower level.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Alexander Duyck's avatar
      net: Move GSO csum into SKB_GSO_CB · 76443456
      Alexander Duyck authored
      This patch moves the checksum maintained by GSO out of skb->csum and into
      the GSO context block in order to allow for us to work on outer checksums
      while maintaining the inner checksum offsets in the case of the inner
      checksum being offloaded, while the outer checksums will be computed.
      While updating the code I also did a minor cleanu-up on gso_make_checksum.
      The change is mostly to make it so that we store the values and compute the
      checksum instead of computing the checksum and then storing the values we
      needed to update.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 09 Feb, 2016 1 commit
  15. 15 Jan, 2016 1 commit
  16. 10 Jan, 2016 1 commit
    • Daniel Borkmann's avatar
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann authored
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
        [  505.144065] dummy1: hw csum failure
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63
       ("bpf: add bpf_redirect() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 15 Dec, 2015 2 commits
  18. 14 Dec, 2015 1 commit
  19. 06 Dec, 2015 1 commit
    • Rainer Weikusat's avatar
      core: enable more fine-grained datagram reception control · ea3793ee
      Rainer Weikusat authored
      The __skb_recv_datagram routine in core/ datagram.c provides a general
      skb reception factility supposed to be utilized by protocol modules
      providing datagram sockets. It encompasses both the actual recvmsg code
      and a surrounding 'sleep until data is available' loop. This is
      inconvenient if a protocol module has to use additional locking in order
      to maintain some per-socket state the generic datagram socket code is
      unaware of (as the af_unix code does). The patch below moves the recvmsg
      proper code into a new __skb_try_recv_datagram routine which doesn't
      sleep and renames wait_for_more_packets to
      __skb_wait_for_more_packets, both routines being exported interfaces. The
      original __skb_recv_datagram routine is reimplemented on top of these
      two functions such that its user-visible behaviour remains unchanged.
      Signed-off-by: default avatarRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 18 Nov, 2015 1 commit
  21. 06 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      This patch then converts a number of sites
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  22. 21 Oct, 2015 1 commit
  23. 29 Sep, 2015 1 commit
  24. 24 Sep, 2015 1 commit
    • Pravin B Shelar's avatar
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar authored
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      Following is the kernel panic msg from old kernel hitting the bug.
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: default avatarAnupam Chanda <achanda@vmware.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 14 Sep, 2015 1 commit
  26. 01 Sep, 2015 6 commits
  27. 21 Aug, 2015 1 commit
    • Michal Hocko's avatar
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko authored
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7
       ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Debugged-by: default avatarVlastimil Babka <vbabka@suse.com>
      Debugged-by: default avatarJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  28. 10 Aug, 2015 1 commit