1. 25 Jun, 2012 1 commit
  2. 01 Jun, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet authored
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      After patch, not a single SYNACK is dropped.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 17 May, 2012 1 commit
  4. 15 May, 2012 1 commit
  5. 04 May, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet authored
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 26 Apr, 2012 1 commit
    • Eric Dumazet's avatar
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet authored
      Quoting Tore Anderson from :
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      With help from David S. Miller
      Reported-by: default avatarTore Anderson <tore@fud.no>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: default avatarTore Anderson <tore@fud.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 23 Apr, 2012 3 commits
    • Eric Dumazet's avatar
      tcp: sk_add_backlog() is too agressive for TCP · da882c1f
      Eric Dumazet authored
      While investigating TCP performance problems on 10Gb+ links, we found a
      tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
      in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
      one ACK every two MSS segments.
      A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
      value (87380), allowing a too small backlog.
      A TCP ACK, even being small, can consume nearly same truesize space than
      outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
      is fast to compute.
      Performance results on netperf, single flow, receiver with disabled
      GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
      increments at sender.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      net: add a limit parameter to sk_add_backlog() · f545a38f
      Eric Dumazet authored
      sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
      memory limit. We need to make this limit a parameter for TCP use.
      No functional change expected in this patch, all callers still using the
      old sk_rcvbuf limit.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David S. Miller's avatar
      tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation. · ac807fa8
      David S. Miller authored
      net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock':
      net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable]
      net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock':
      net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable]
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 22 Apr, 2012 1 commit
  9. 21 Apr, 2012 1 commit
    • Neal Cardwell's avatar
      tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() · 900f65d3
      Neal Cardwell authored
      This commit moves the (substantial) common code shared between
      tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
      independent function, tcp_init_sock().
      Centralizing this functionality should help avoid drift issues,
      e.g. where the IPv4 side is updated without a corresponding update to
      IPv6. There was already some drift: IPv4 initialized snd_cwnd to
      TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
      2 (in this case it should not matter, since snd_cwnd is also
      initialized in tcp_init_metrics(), but the general risks and
      maintenance overhead remain).
      When diffing the old and new code, note that new tcp_init_sock()
      function uses the order of steps from the tcp_v4_init_sock()
      implementation (the order is slightly different in
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  10. 19 Apr, 2012 1 commit
  11. 05 Apr, 2012 1 commit
  12. 12 Feb, 2012 1 commit
    • Jiri Benc's avatar
      net: implement IP_RECVTOS for IP_PKTOPTIONS · 4c507d28
      Jiri Benc authored
      Currently, it is not easily possible to get TOS/DSCP value of packets from
      an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
      with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
      not actually implemented for TOS, though.
      This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
      (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
      the TOS field is stored from the other party's ACK.
      This is needed for proxies which require DSCP transparency. One such example
      is at http://zph.bratcheda.org/.
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 01 Feb, 2012 2 commits
    • Shawn Lu's avatar
      tcp: md5: RST: getting md5 key from listener · 658ddaaf
      Shawn Lu authored
      TCP RST mechanism is broken in TCP md5(RFC2385). When
      connection is gone, md5 key is lost, sending RST
      without md5 hash is deem to ignored by peer. This can
      be a problem since RST help protocal like bgp to fast
      recove from peer crash.
      In most case, users of tcp md5, such as bgp and ldp,
      have listener on both sides to accept connection from peer.
      md5 keys for peers are saved in listening socket.
      There are two cases in finding md5 key when connection is
      1.Passive receive RST: The message is send to well known port,
      tcp will associate it with listner. md5 key is gotten from
      2.Active receive RST (no sock): The message is send to ative
      side, there is no socket associated with the message. In this
      case, finding listener from source port, then find md5 key from
      we are not loosing sercuriy here:
      packet is checked with md5 hash. No RST is generated
      if md5 hash doesn't match or no md5 key can be found.
      Signed-off-by: default avatarShawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      tcp: md5: protects md5sig_info with RCU · a8afca03
      Eric Dumazet authored
      This patch makes sure we use appropriate memory barriers before
      publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
      tcp_v4_send_reset() without holding socket lock (upcoming patch from
      Shawn Lu)
      Note we also need to respect rcu grace period before its freeing, since
      we can free socket without this grace period thanks to
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Shawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 31 Jan, 2012 2 commits
  15. 22 Jan, 2012 1 commit
  16. 12 Dec, 2011 3 commits
  17. 23 Nov, 2011 1 commit
  18. 22 Nov, 2011 1 commit
  19. 01 Nov, 2011 1 commit
  20. 26 Oct, 2011 1 commit
  21. 24 Oct, 2011 1 commit
  22. 21 Oct, 2011 1 commit
  23. 04 Oct, 2011 1 commit
  24. 28 Sep, 2011 1 commit
  25. 27 Sep, 2011 1 commit
  26. 15 Sep, 2011 1 commit
    • Eric Dumazet's avatar
      tcp: Change possible SYN flooding messages · 946cedcc
      Eric Dumazet authored
      "Possible SYN flooding on port xxxx " messages can fill logs on servers.
      Change logic to log the message only once per listener, and add two new
      SNMP counters to track :
      TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
      TCPReqQFullDrop : number of times a SYN request was dropped because
      syncookies were not enabled.
      Based on a prior patch from Tom Herbert, and suggestions from David.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  27. 17 Aug, 2011 1 commit
  28. 06 Aug, 2011 1 commit
    • David S. Miller's avatar
      net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea
      David S. Miller authored
      Computers have become a lot faster since we compromised on the
      partial MD4 hash which we use currently for performance reasons.
      MD5 is a much safer choice, and is inline with both RFC1948 and
      other ISS generators (OpenBSD, Solaris, etc.)
      Furthermore, only having 24-bits of the sequence number be truly
      unpredictable is a very serious limitation.  So the periodic
      regeneration and 8-bit counter have been removed.  We compute and
      use a full 32-bit sequence number.
      For ipv6, DCCP was found to use a 32-bit truncated initial sequence
      number (it needs 43-bits) and that is fixed here as well.
      Reported-by: default avatarDan Kaminsky <dan@doxpara.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 17 Jun, 2011 1 commit
    • Eric Dumazet's avatar
      net: rfs: enable RFS before first data packet is received · 1eddcead
      Eric Dumazet authored
      Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
      > From: Ben Hutchings <bhutchings@solarflare.com>
      > Date: Fri, 17 Jun 2011 00:50:46 +0100
      > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
      > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
      > >>  			goto discard;
      > >>
      > >>  		if (nsk != sk) {
      > >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
      > >>  			if (tcp_child_process(sk, nsk, skb)) {
      > >>  				rsk = nsk;
      > >>  				goto reset;
      > >>
      > >
      > > I haven't tried this, but it looks reasonable to me.
      > >
      > > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
      > Indeed ipv6 side needs the same fix.
      > Eric please add that part and resubmit.  And in fact I might stick
      > this into net-2.6 instead of net-next-2.6
      OK, here is the net-2.6 based one then, thanks !
      [PATCH v2] net: rfs: enable RFS before first data packet is received
      First packet received on a passive tcp flow is not correctly RFS
      One sock_rps_record_flow() call is missing in inet_accept()
      But before that, we also must record rxhash when child socket is setup.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: default avatarDavid S. Miller <davem@conan.davemloft.net>
  30. 08 Jun, 2011 1 commit
    • Jerry Chu's avatar
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu authored
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  31. 23 May, 2011 1 commit
    • Dan Rosenberg's avatar
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg authored
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: default avatarDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 28 Apr, 2011 1 commit
    • Eric Dumazet's avatar
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet authored
      We lack proper synchronization to manipulate inet->opt ip_options
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      Another thread can change inet->opt pointer and free old one under us.
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  33. 22 Apr, 2011 1 commit
  34. 06 Apr, 2011 1 commit
    • Neil Horman's avatar
      ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2) · 47482f13
      Neil Horman authored
      properly record sk_rxhash in ipv6 sockets (v2)
      Noticed while working on another project that flows to sockets which I had open
      on a test systems weren't getting steered properly when I had RFS enabled.
      Looking more closely I found that:
      1) The affected sockets were all ipv6
      2) They weren't getting steered because sk->sk_rxhash was never set from the
      incomming skbs on that socket.
      This was occuring because there are several points in the IPv4 tcp and udp code
      which save the rxhash value when a new connection is established.  Those calls
      to sock_rps_save_rxhash were never added to the corresponding ipv6 code paths.
      This patch adds those calls.  Tested by myself to properly enable RFS
      functionalty on ipv6.
      Change notes:
      	Filtered UDP to only arm RFS on bound sockets (Eric Dumazet)
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>