1. 11 Jan, 2016 1 commit
  2. 10 Jan, 2016 1 commit
    • Hannes Frederic Sowa's avatar
      udp: restrict offloads to one namespace · 787d7ac3
      Hannes Frederic Sowa authored
      udp tunnel offloads tend to aggregate datagrams based on inner
      headers. gro engine gets notified by tunnel implementations about
      possible offloads. The match is solely based on the port number.
      Imagine a tunnel bound to port 53, the offloading will look into all
      DNS packets and tries to aggregate them based on the inner data found
      within. This could lead to data corruption and malformed DNS packets.
      While this patch minimizes the problem and helps an administrator to find
      the issue by querying ip tunnel/fou, a better way would be to match on
      the specific destination ip address so if a user space socket is bound
      to the same address it will conflict.
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 06 Jan, 2016 1 commit
    • Yuchung Cheng's avatar
      tcp: fix zero cwnd in tcp_cwnd_reduction · 8b8a321f
      Yuchung Cheng authored
      Patch 3759824d ("tcp: PRR uses CRB mode by default and SS mode
      conditionally") introduced a bug that cwnd may become 0 when both
      inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
      to a div-by-zero if the connection starts another cwnd reduction
      phase by setting tp->prior_cwnd to the current cwnd (0) in
      To prevent this we skip PRR operation when nothing is acked or
      sacked. Then cwnd must be positive in all cases as long as ssthresh
      is positive:
      1) The proportional reduction mode
         inflight > ssthresh > 0
      2) The reduction bound mode
        a) inflight == ssthresh > 0
        b) inflight < ssthresh
           sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh
      Therefore in all cases inflight and sndcnt can not both be 0.
      We check invalid tp->prior_cwnd to avoid potential div0 bugs.
      In reality this bug is triggered only with a sequence of less common
      events.  For example, the connection is terminating an ECN-triggered
      cwnd reduction with an inflight 0, then it receives reordered/old
      ACKs or DSACKs from prior transmission (which acks nothing). Or the
      connection is in fast recovery stage that marks everything lost,
      but fails to retransmit due to local issues, then receives data
      packets from other end which acks nothing.
      Fixes: 3759824d
       ("tcp: PRR uses CRB mode by default and SS mode conditionally")
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 04 Jan, 2016 1 commit
    • David Ahern's avatar
      net: Propagate lookup failure in l3mdev_get_saddr to caller · b5bdacf3
      David Ahern authored
      Commands run in a vrf context are not failing as expected on a route lookup:
          root@kenny:~# ip ro ls table vrf-red
          unreachable default
          root@kenny:~# ping -I vrf-red -c1 -w1
          ping: Warning: source address might be selected on device other than vrf-red.
          PING ( from vrf-red: 56(84) bytes of data.
          --- ping statistics ---
          2 packets transmitted, 0 received, 100% packet loss, time 999ms
      Since the vrf table does not have a route for the ping
      should have failed. The saddr lookup causes a full VRF table lookup.
      Propogating a lookup failure to the user allows the command to fail as
          root@kenny:~# ping -I vrf-red -c1 -w1
          connect: No route to host
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 18 Dec, 2015 1 commit
  6. 17 Dec, 2015 1 commit
  7. 16 Dec, 2015 1 commit
  8. 14 Dec, 2015 2 commits
    • Eric Dumazet's avatar
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet authored
      David Wilder reported crashes caused by dst reuse.
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      Can probably be backported back to linux-3.6 kernels
      Reported-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Hannes Frederic Sowa's avatar
      net: add validation for the socket syscall protocol argument · 79462ad0
      Hannes Frederic Sowa authored
      郭永刚 reported that one could simply crash the kernel as root by
      using a simple program:
      	int socket_fd;
      	struct sockaddr_in addr;
      	addr.sin_port = 0;
      	addr.sin_addr.s_addr = INADDR_ANY;
      	addr.sin_family = 10;
      	socket_fd = socket(10,3,0x40000000);
      	connect(socket_fd , &addr,16);
      AF_INET, AF_INET6 sockets actually only support 8-bit protocol
      identifiers. inet_sock's skc_protocol field thus is sized accordingly,
      thus larger protocol identifiers simply cut off the higher bits and
      store a zero in the protocol fields.
      This could lead to e.g. NULL function pointer because as a result of
      the cut off inet_num is zero and we call down to inet_autobind, which
      is NULL for raw sockets.
      kernel: Call Trace:
      kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
      kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
      kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
      kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
      kernel:  [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
      kernel:  [<ffffffff81645e0e>] SyS_connect+0xe/0x10
      kernel:  [<ffffffff81779515>] tracesys_phase2+0x84/0x89
      I found no particular commit which introduced this problem.
      CVE: CVE-2015-8543
      Cc: Cong Wang <cwang@twopensource.com>
      Reported-by: default avatar郭永刚 <guoyonggang@360.cn>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 13 Dec, 2015 1 commit
  10. 10 Dec, 2015 1 commit
  11. 03 Dec, 2015 1 commit
    • Andrew Lunn's avatar
      ipv4: igmp: Allow removing groups from a removed interface · 4eba7bb1
      Andrew Lunn authored
      When a multicast group is joined on a socket, a struct ip_mc_socklist
      is appended to the sockets mc_list containing information about the
      joined group.
      If the interface is hot unplugged, this entry becomes stale. Prior to
      commit 52ad353a
       ("igmp: fix the problem when mc leave group") it
      was possible to remove the stale entry by performing a
      IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
      the interface. However, this fix enforces that the interface must
      still exist. Thus with time, the number of stale entries grows, until
      sysctl_igmp_max_memberships is reached and then it is not possible to
      join and more groups.
      The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
      performed without specifying the interface, either by ifindex or ip
      address. However here we do supply one of these. So loosen the
      restriction on device existence to only apply when the interface has
      not been specified. This then restores the ability to clean up the
      stale entries.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Fixes: 52ad353a
       "(igmp: fix the problem when mc leave group")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 01 Dec, 2015 1 commit
    • Eric Dumazet's avatar
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet authored
      This patch is a cleanup to make following patch easier to
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      To ease backports, we rename both constants.
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 30 Nov, 2015 1 commit
  14. 24 Nov, 2015 1 commit
  15. 22 Nov, 2015 1 commit
    • Nikolay Aleksandrov's avatar
      net: ipmr: fix static mfc/dev leaks on table destruction · 0e615e96
      Nikolay Aleksandrov authored
      When destroying an mrt table the static mfc entries and the static
      devices are kept, which leads to devices that can never be destroyed
      (because of refcnt taken) and leaked memory, for example:
      unreferenced object 0xffff880034c144c0 (size 192):
        comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
        hex dump (first 32 bytes):
          98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
          ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
          [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
          [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
          [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
          [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
          [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
          [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
          [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
          [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
          [<ffffffffffffffff>] 0xffffffffffffffff
      Make sure that everything is cleaned on netns destruction.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 20 Nov, 2015 3 commits
  17. 18 Nov, 2015 2 commits
    • Eric Dumazet's avatar
      tcp: md5: fix lockdep annotation · 1b8e6a01
      Eric Dumazet authored
      When a passive TCP is created, we eventually call tcp_md5_do_add()
      with sk pointing to the child. It is not owner by the user yet (we
      will add this socket into listener accept queue a bit later anyway)
      But we do own the spinlock, so amend the lockdep annotation to avoid
      following splat :
      [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
      [ 8451.090932]
      [ 8451.090932] other info that might help us debug this:
      [ 8451.090932]
      [ 8451.090934]
      [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
      [ 8451.090936] 3 locks held by socket_sockopt_/214795:
      [ 8451.090936]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
      [ 8451.090947]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
      [ 8451.090952]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
      [ 8451.090958]
      [ 8451.090958] stack backtrace:
      [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_
      [ 8451.091215] Call Trace:
      [ 8451.091216]  <IRQ>  [<ffffffff856fb29c>] dump_stack+0x55/0x76
      [ 8451.091229]  [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
      [ 8451.091235]  [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
      [ 8451.091239]  [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
      [ 8451.091242]  [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
      [ 8451.091246]  [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
      [ 8451.091249]  [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
      [ 8451.091253]  [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
      [ 8451.091256]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091260]  [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
      [ 8451.091263]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091267]  [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
      [ 8451.091270]  [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
      [ 8451.091273]  [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
      [ 8451.091277]  [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90
      Fixes: a8afca03
       ("tcp: md5: protects md5sig_info with RCU")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • stephen hemminger's avatar
  18. 16 Nov, 2015 1 commit
  19. 15 Nov, 2015 1 commit
  20. 09 Nov, 2015 1 commit
  21. 05 Nov, 2015 3 commits
  22. 04 Nov, 2015 3 commits
    • David Ahern's avatar
      net: Fix prefsrc lookups · e1b8d903
      David Ahern authored
      A bug report (https://bugzilla.kernel.org/show_bug.cgi?id=107071) noted
      that the follwoing ip command is failing with v4.3:
          $ ip route add dev bond0.250 table vlan_250 src
          RTNETLINK answers: Invalid argument
      021dd3b8 changed the lookup of the given preferred source address to
      use the table id passed in, but this assumes the local entries are in the
      given table which is not necessarily true for non-VRF use cases. When
      validating the preferred source fallback to the local table on failure.
      Fixes: 021dd3b8
       ("net: Add routes to the table associated with the device")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • WANG Cong's avatar
      ipv4: fix a potential deadlock in mcast getsockopt() path · 87e9f031
      WANG Cong authored
      Sasha reported the following lockdep warning:
        Possible unsafe locking scenario:
              CPU0                    CPU1
              ----                    ----
      This is due to that for IP_MSFILTER and MCAST_MSFILTER, we take
      rtnl lock before the socket lock in setsockopt() path, but take
      the socket lock before rtnl lock in getsockopt() path. All the
      rest optnames are setsockopt()-only.
      Fix this by aligning the getsockopt() path with the setsockopt()
      path, so that all mcast socket path would be locked in the same
      Note, IPv6 part is different where rtnl lock is not held.
      Fixes: 54ff9ef3
       ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}")
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • WANG Cong's avatar
      ipv4: disable BH when changing ip local port range · 4ee3bd4a
      WANG Cong authored
      This fixes the following lockdep warning:
       [ INFO: inconsistent lock state ]
       4.3.0-rc7+ #1197 Not tainted
       inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage.
       sysctl/1019 [HC0[0]:SC0[0]:HE1:SE1] takes:
        (&(&net->ipv4.ip_local_ports.lock)->seqcount){+.+-..}, at: [<ffffffff81921de7>] ipv4_local_port_range+0xb4/0x12a
       {IN-SOFTIRQ-R} state was registered at:
         [<ffffffff810bd682>] __lock_acquire+0x2f6/0xdf0
         [<ffffffff810be6d5>] lock_acquire+0x11c/0x1a4
         [<ffffffff818e599c>] inet_get_local_port_range+0x4e/0xae
         [<ffffffff8166e8e3>] udp_flow_src_port.constprop.40+0x23/0x116
         [<ffffffff81671cb9>] vxlan_xmit_one+0x219/0xa6a
         [<ffffffff81672f75>] vxlan_xmit+0xa6b/0xaa5
         [<ffffffff817f2deb>] dev_hard_start_xmit+0x2ae/0x465
         [<ffffffff817f35ed>] __dev_queue_xmit+0x531/0x633
         [<ffffffff817f3702>] dev_queue_xmit_sk+0x13/0x15
         [<ffffffff818004a5>] neigh_resolve_output+0x12f/0x14d
         [<ffffffff81959cfa>] ip6_finish_output2+0x344/0x39f
         [<ffffffff8195bf58>] ip6_finish_output+0x88/0x8e
         [<ffffffff8195bfef>] ip6_output+0x91/0xe5
         [<ffffffff819792ae>] dst_output_sk+0x47/0x4c
         [<ffffffff81979392>] NF_HOOK_THRESH.constprop.30+0x38/0x82
         [<ffffffff8197981e>] mld_sendpack+0x189/0x266
         [<ffffffff8197b28b>] mld_ifc_timer_expire+0x1ef/0x223
         [<ffffffff810de581>] call_timer_fn+0xfb/0x28c
         [<ffffffff810ded1e>] run_timer_softirq+0x1c7/0x1f1
      Fixes: b8f1a556
       ("udp: Add function to make source port for UDP tunnels")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 03 Nov, 2015 1 commit
    • Dan Streetman's avatar
      xfrm: dst_entries_init() per-net dst_ops · a8a572a6
      Dan Streetman authored
      Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops
      templates; their dst_entries counters will never be used.  Move the
      xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to
      xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init
      and dst_entries_destroy for each net namespace.
      The ipv4 and ipv6 xfrms each create dst_ops template, and perform
      dst_entries_init on the templates.  The template values are copied to each
      net namespace's xfrm.xfrm*_dst_ops.  The problem there is the dst_ops
      pcpuc_entries field is a percpu counter and cannot be used correctly by
      simply copying it to another object.
      The result of this is a very subtle bug; changes to the dst entries
      counter from one net namespace may sometimes get applied to a different
      net namespace dst entries counter.  This is because of how the percpu
      counter works; it has a main count field as well as a pointer to the
      percpu variables.  Each net namespace maintains its own main count
      variable, but all point to one set of percpu variables.  When any net
      namespace happens to change one of the percpu variables to outside its
      small batch range, its count is moved to the net namespace's main count
      variable.  So with multiple net namespaces operating concurrently, the
      dst_ops entries counter can stray from the actual value that it should
      be; if counts are consistently moved from one net namespace to another
      (which my testing showed is likely), then one net namespace winds up
      with a negative dst_ops count while another winds up with a continually
      increasing count, eventually reaching its gc_thresh limit, which causes
      all new traffic on the net namespace to fail with -ENOBUFS.
      Signed-off-by: default avatarDan Streetman <dan.streetman@canonical.com>
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
  24. 02 Nov, 2015 4 commits
    • Eric Dumazet's avatar
      net: fix percpu memory leaks · 1d6119ba
      Eric Dumazet authored
      This patch fixes following problems :
      1) percpu_counter_init() can return an error, therefore
        init_frag_mem_limit() must propagate this error so that
        inet_frags_init_net() can do the same up to its callers.
      2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
         properly and free the percpu_counter.
      Without this fix, we leave freed object in percpu_counters
      global list (if CONFIG_HOTPLUG_CPU) leading to crashes.
      This bug was detected by KASAN and syzkaller tool
      Fixes: 6d7b857d
       ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      net: make skb_set_owner_w() more robust · 9e17f8a4
      Eric Dumazet authored
      skb_set_owner_w() is called from various places that assume
      skb->sk always point to a full blown socket (as it changes
      We'd like to attach skb to request sockets, and in the future
      to timewait sockets as well. For these kind of pseudo sockets,
      we need to take a traditional refcount and use sock_edemux()
      as the destructor.
      It is now time to un-inline skb_set_owner_w(), being too big.
      Fixes: ca6fb065
       ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Bisected-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Ani Sinha's avatar
      ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context. · 44f49dd8
      Ani Sinha authored
      Fixes the following kernel BUG :
      BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
      caller is __this_cpu_preempt_check+0x13/0x15
      CPU: 0 PID: 2758 Comm: bash Tainted: P           O   3.18.19 #2
       ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
       0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
       ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
      Call Trace:
      [<ffffffff81482b2a>] dump_stack+0x52/0x80
      [<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1
      [<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15
      [<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c
      [<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e
      [<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810e6974>] ? pollwake+0x4d/0x51
      [<ffffffff81058ac0>] ? default_wake_function+0x0/0xf
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810613d9>] ? __wake_up_common+0x45/0x77
      [<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32
      [<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53
      [<ffffffff8139a519>] ? sock_def_readable+0x71/0x75
      [<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55
      [<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41
      [<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86
      [<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d
      [<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b
      [<ffffffff810d5738>] ? new_sync_read+0x82/0xaa
      [<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99
      [<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32
      [<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f
      [<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf
      [<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3
      [<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e
      Signed-off-by: default avatarAni Sinha <ani@arista.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Paolo Abeni's avatar
      ipv4: use l4 hash for locally generated multipath flows · 9920e48b
      Paolo Abeni authored
      This patch changes how the multipath hash is computed for locally
      generated flows: now the hash comprises l4 information.
      This allows better utilization of the available paths when the existing
      flows have the same source IP and the same destination IP: with l3 hash,
      even when multiple connections are in place simultaneously, a single path
      will be used, while with l4 hash we can use all the available paths.
      v2 changes:
      - use get_hash_from_flowi4() instead of implementing just another l4 hash
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 01 Nov, 2015 4 commits
    • Julian Anastasov's avatar
      ipv4: update RTNH_F_LINKDOWN flag on UP event · c9b3292e
      Julian Anastasov authored
      When nexthop is part of multipath route we should clear the
      LINKDOWN flag when link goes UP or when first address is added.
      This is needed because we always set LINKDOWN flag when DEAD flag
      was set but now on UP the nexthop is not dead anymore. Examples when
      LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:
      - link goes down (LINKDOWN is set), then link goes UP and device
      shows carrier OK but LINKDOWN remains set
      - last address is deleted (LINKDOWN is set), then address is
      added and device shows carrier OK but LINKDOWN remains set
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 up
      here add a multipath route where one nexthop is for dummy0:
      ip route add nexthop dummy0 nexthop SOME_OTHER_DEVICE
      ifconfig dummy0 down
      ifconfig dummy0 up
      now ip route shows nexthop that is not dead. Now set the sysctl var:
      echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown
      now ip route will show a dead nexthop because the forgotten
      RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.
      Fixes: 8a3d0316
       ("net: track link-status of ipv4 nexthops")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Julian Anastasov's avatar
      ipv4: fix to not remove local route on link down · 4f823def
      Julian Anastasov authored
      When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
      we should not delete the local routes if the local address
      is still present. The confusion comes from the fact that both
      fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
      constant. Fix it by returning back the variable 'force'.
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 up
      ifconfig dummy0 down
      ip route list table local | grep dummy | grep host
      local dev dummy0  proto kernel  scope host  src
      Fixes: 8a3d0316
       ("net: track link-status of ipv4 nexthops")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Hannes Frederic Sowa's avatar
      ipv4: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment · dbd3393c
      Hannes Frederic Sowa authored
      CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
      of those warn about them once and handle them gracefully by recalculating
      the checksum.
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Hannes Frederic Sowa's avatar
      ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets · d749c9cb
      Hannes Frederic Sowa authored
      We cannot reliable calculate packet size on MSG_MORE corked sockets
      and thus cannot decide if they are going to be fragmented later on,
      so better not use CHECKSUM_PARTIAL in the first place.
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  26. 27 Oct, 2015 1 commit