1. 12 Oct, 2015 1 commit
    • Paolo Abeni's avatar
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni authored
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      The host will then ignore said redirect, due to RFC 1122 section,
      and will continue to use the wrong next-op.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 05 Oct, 2015 1 commit
  3. 29 Sep, 2015 1 commit
  4. 25 Sep, 2015 1 commit
    • David Ahern's avatar
      net: Fix panic in icmp_route_lookup · bdb06cbf
      David Ahern authored
      Andrey reported a panic:
      [ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
      [ 7249.865559] IP: [<c16afeca>] icmp_route_lookup+0xaa/0x320
      [ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
      [ 7249.865637] Oops: 0000 [#1]
      [ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
      4.3.0-999-generic #201509220155
      [ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014  08/02/2006
      [ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
      [ 7249.866949] EIP: 0060:[<c16afeca>] EFLAGS: 00210246 CPU: 0
      [ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
      [ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
      [ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
      [ 7249.867077]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      [ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
      [ 7249.867141] Stack:
      [ 7249.867165]  320310ee 00000000 00000042 320310ee 00000000 c1aeca00
      f3920240 f0c69180
      [ 7249.867268]  f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
      e659266c f483ba54
      [ 7249.867361]  8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
      c16b0e00 00000064
      [ 7249.867448] Call Trace:
      [ 7249.867494]  [<f855058b>] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
      [ 7249.867534]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
      [ 7249.867576]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
      [ 7249.867615]  [<c16b0e00>] ? icmp_send+0xa0/0x380
      [ 7249.867648]  [<c16b102f>] icmp_send+0x2cf/0x380
      [ 7249.867681]  [<f89c8126>] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
      [ 7249.867714]  [<f89cd0da>] reject_tg+0x7a/0x9f [ipt_REJECT]
      [ 7249.867746]  [<f88c29a7>] ipt_do_table+0x317/0x70c [ip_tables]
      [ 7249.867780]  [<f895e0a6>] ? __nf_conntrack_find_get+0x166/0x3b0
      [ 7249.867838]  [<f895eea8>] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
      [ 7249.867889]  [<f84c0035>] iptable_filter_hook+0x35/0x80 [iptable_filter]
      [ 7249.867933]  [<c16776a1>] nf_iterate+0x71/0x80
      [ 7249.867970]  [<c1677715>] nf_hook_slow+0x65/0xc0
      [ 7249.868002]  [<c1681811>] __ip_local_out_sk+0xc1/0xd0
      [ 7249.868034]  [<c1680f30>] ? ip_forward_options+0x1a0/0x1a0
      [ 7249.868066]  [<c1681836>] ip_local_out_sk+0x16/0x30
      [ 7249.868097]  [<c1684054>] ip_send_skb+0x14/0x80
      [ 7249.868129]  [<c16840f4>] ip_push_pending_frames+0x34/0x40
      [ 7249.868163]  [<c16844a2>] ip_send_unicast_reply+0x282/0x310
      [ 7249.868196]  [<c16a0863>] tcp_v4_send_reset+0x1b3/0x380
      [ 7249.868227]  [<c16a1b63>] tcp_v4_rcv+0x323/0x990
      [ 7249.868257]  [<c16776a1>] ? nf_iterate+0x71/0x80
      [ 7249.868289]  [<c167dc2b>] ip_local_deliver_finish+0x8b/0x230
      [ 7249.868322]  [<c167df4c>] ip_local_deliver+0x4c/0xa0
      [ 7249.868353]  [<c167dba0>] ? ip_rcv_finish+0x390/0x390
      [ 7249.868384]  [<c167d88c>] ip_rcv_finish+0x7c/0x390
      [ 7249.868415]  [<c167e280>] ip_rcv+0x2e0/0x420
      Prior to the VRF change the oif was not set in the flow struct, so the
      VRF support should really have only added the vrf_master_ifindex lookup.
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Cc: Andrey Melnikov <temnota.am@gmail.com>
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 28 Aug, 2015 1 commit
  6. 19 Aug, 2015 1 commit
    • Nikolay Aleksandrov's avatar
      vrf: vrf_master_ifindex_rcu is not always called with rcu read lock · 18041e31
      Nikolay Aleksandrov authored
      While running net-next I hit this:
      [  634.073119] ===============================
      [  634.073150] [ INFO: suspicious RCU usage. ]
      [  634.073182] 4.2.0-rc6+ #45 Not tainted
      [  634.073213] -------------------------------
      [  634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check()
      [  634.073274]
                     other info that might help us debug this:
      [  634.073307]
                     rcu_scheduler_active = 1, debug_locks = 1
      [  634.073338] 2 locks held by swapper/0/0:
      [  634.073369]  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>]
      [  634.073412]  #1:  (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>]
      [  634.073450]
                     stack backtrace:
      [  634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45
      [  634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
      VirtualBox 12/01/2006
      [  634.073545]  0000000000000000 0593ba8242d9ace4 ffff88002fc03b48
      [  634.073612]  0000000000000000 ffffffff81e12500 ffff88002fc03b78
      [  634.073642]  0000000000000000 ffff88002ec4e600 ffffffff81f00f80
      [  634.073669] Call Trace:
      [  634.073694]  <IRQ>  [<ffffffff81803f1b>] dump_stack+0x4c/0x65
      [  634.073728]  [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100
      [  634.073763]  [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0
      [  634.073793]  [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0
      [  634.073818]  [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0
      [  634.073844]  [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0
      [  634.073873]  [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0
      [  634.073899]  [<ffffffff8174bdda>] arp_error_report+0x3a/0x80
      [  634.073926]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.073952]  [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110
      [  634.073984]  [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290
      [  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.074013]  [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480
      [  634.074013]  [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480
      [  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.074013]  [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430
      [  634.074013]  [<ffffffff810af50e>] __do_softirq+0xde/0x630
      [  634.074013]  [<ffffffff810afc97>] irq_exit+0x117/0x120
      [  634.074013]  [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60
      [  634.074013]  [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80
      [  634.074013]  <EOI>  [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10
      [  634.074013]  [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10
      [  634.074013]  [<ffffffff81027d43>] default_idle+0x23/0x200
      [  634.074013]  [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20
      [  634.074013]  [<ffffffff810f89ba>] default_idle_call+0x2a/0x40
      [  634.074013]  [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0
      [  634.074013]  [<ffffffff817f9cad>] rest_init+0x13d/0x150
      [  634.074013]  [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9
      [  634.074013]  [<ffffffff81f68120>] ?
      [  634.074013]  [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c
      [  634.074013]  [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d
      It would seem vrf_master_ifindex_rcu() can be called without RCU held in
      other contexts as well so introduce a new helper which acquires rcu and
      returns the ifindex.
      Also add curly braces around both the "if" and "else" parts as per the
      style guide.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 13 Aug, 2015 2 commits
  8. 21 Jul, 2015 1 commit
  9. 03 Apr, 2015 1 commit
  10. 31 Jan, 2015 1 commit
  11. 18 Nov, 2014 1 commit
  12. 11 Nov, 2014 1 commit
    • Joe Perches's avatar
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches authored
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      All messages are still ratelimited.
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 23 Sep, 2014 1 commit
    • Eric Dumazet's avatar
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet authored
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 02 Aug, 2014 1 commit
  15. 31 Jul, 2014 1 commit
  16. 07 Jul, 2014 1 commit
    • Edward Allcutt's avatar
      ipv4: icmp: Fix pMTU handling for rare case · 68b7107b
      Edward Allcutt authored
      Some older router implementations still send Fragmentation Needed
      errors with the Next-Hop MTU field set to zero. This is explicitly
      described as an eventuality that hosts must deal with by the
      standard (RFC 1191) since older standards specified that those
      bits must be zero.
      Linux had a generic (for all of IPv4) implementation of the algorithm
      described in the RFC for searching a list of MTU plateaus for a good
      value. Commit 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      removed this as part of the changes to remove the routing cache.
      Subsequently any Fragmentation Needed packet with a zero Next-Hop
      MTU has been discarded without being passed to the per-protocol
      handlers or notifying userspace for raw sockets.
      When there is a router which does not implement RFC 1191 on an
      MTU limited path then this results in stalled connections since
      large packets are discarded and the local protocols are not
      notified so they never attempt to lower the pMTU.
      One example I have seen is an OpenBSD router terminating IPSec
      tunnels. It's worth pointing out that this case is distinct from
      the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
      since the commit in question dismissed that as a valid concern.
      All of the per-protocols handlers implement the simple approach from
      RFC 1191 of immediately falling back to the minimum value. Although
      this is sub-optimal it is vastly preferable to connections hanging
      Remove the Next-Hop MTU != 0 check and allow such packets
      to follow the normal path.
      Fixes: 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      Signed-off-by: default avatarEdward Allcutt <edward.allcutt@openmarket.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 13 May, 2014 1 commit
    • Lorenzo Colitti's avatar
      net: add a sysctl to reflect the fwmark on replies · e110861f
      Lorenzo Colitti authored
      Kernel-originated IP packets that have no user socket associated
      with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
      are emitted with a mark of zero. Add a sysctl to make them have
      the same mark as the packet they are replying to.
      This allows an administrator that wishes to do so to use
      mark-based routing, firewalling, etc. for these replies by
      marking the original packets inbound.
      Tested using user-mode linux:
       - ICMP/ICMPv6 echo replies and errors.
       - TCP RST packets (IPv4 and IPv6).
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 08 May, 2014 1 commit
  19. 13 Jan, 2014 1 commit
    • Hannes Frederic Sowa's avatar
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa authored
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 18 Dec, 2013 2 commits
  21. 28 Sep, 2013 1 commit
    • Francesco Fusco's avatar
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco authored
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  22. 03 Jun, 2013 2 commits
  23. 29 May, 2013 1 commit
  24. 25 May, 2013 1 commit
    • Lorenzo Colitti's avatar
      net: ipv6: Add IPv6 support to the ping socket. · 6d0bfe22
      Lorenzo Colitti authored
      This adds the ability to send ICMPv6 echo requests without a
      raw socket. The equivalent ability for ICMPv4 was added in
      Instead of having separate code paths for IPv4 and IPv6, make
      most of the code in net/ipv4/ping.c dual-stack and only add a
      few IPv6-specific bits (like the protocol definition) to a new
      net/ipv6/ping.c. Hopefully this will reduce divergence and/or
      duplication of bugs in the future.
      - Setting options via ancillary data (e.g., using IPV6_PKTINFO
        to specify the outgoing interface) is not yet supported.
      - There are no separate security settings for IPv4 and IPv6;
        everything is controlled by /proc/net/ipv4/ping_group_range.
      - The proc interface does not yet display IPv6 ping sockets
      Tested with a patched copy of ping6 and using raw socket calls.
      Compiles and works with all of CONFIG_IPV6={n,m,y}.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 29 Apr, 2013 1 commit
  26. 22 Feb, 2013 1 commit
  27. 26 Nov, 2012 1 commit
  28. 23 Jul, 2012 2 commits
    • David S. Miller's avatar
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller authored
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David S. Miller's avatar
      ipv4: Really ignore ICMP address requests/replies. · 838942a5
      David S. Miller authored
      Alexey removed kernel side support for requests, and the
      only thing we do for replies is log a message if something
      doesn't look right.
      As Alexey's comment indicates, this belongs in userspace (if
      anywhere), and thus we can safely just get rid of this code.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 12 Jul, 2012 1 commit
  30. 11 Jul, 2012 4 commits
  31. 10 Jul, 2012 1 commit
  32. 28 Jun, 2012 1 commit
    • David S. Miller's avatar
      ipv4: Create and use fib_compute_spec_dst() helper. · 35ebf65e
      David S. Miller authored
      The specific destination is the host we direct unicast replies to.
      Usually this is the original packet source address, but if we are
      responding to a multicast or broadcast packet we have to use something
      Specifically we must use the source address we would use if we were to
      send a packet to the unicast source of the original packet.
      The routing cache precomputes this value, but we want to remove that
      precomputation because it creates a hard dependency on the expensive
      rpfilter source address validation which we'd like to make cheaper.
      There are only three places where this matters:
      1) ICMP replies.
      2) pktinfo CMSG
      3) IP options
      Now there will be no real users of rt->rt_spec_dst and we can simply
      remove it altogether.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  33. 19 Jun, 2012 1 commit