1. 20 Aug, 2012 1 commit
  2. 09 Aug, 2012 1 commit
    • Eric Dumazet's avatar
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet authored
      commit 5d299f3d
       (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 06 Aug, 2012 1 commit
  4. 31 Jul, 2012 1 commit
    • Andrew Morton's avatar
      memcg: rename config variables · c255a458
      Andrew Morton authored
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 30 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet authored
      commit c6cffba4
       (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      crashes happen on tcp workloads if routes are added/deleted at the same
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      With this patch, my machines survive various benchmarks.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 27 Jul, 2012 1 commit
  7. 24 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: early_demux fixes · 9cb429d6
      Eric Dumazet authored
      1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
         and fix a potential bug if skb->head was reallocated
         (iph & th pointers were not reloaded)
      TCP stack will pull/check headers anyway.
      2) must reload iph in ip_rcv_finish() after early_demux()
       call since skb->head might have changed.
      3) skb->dev->ifindex can be now replaced by skb->skb_iif
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 23 Jul, 2012 2 commits
    • David S. Miller's avatar
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller authored
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet authored
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      We had a similar case fixed in commit a34a101e
       (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 20 Jul, 2012 1 commit
  10. 19 Jul, 2012 3 commits
    • Yuchung Cheng's avatar
      net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) · cf60af03
      Yuchung Cheng authored
      sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
      and write(2). The application should replace connect() with it to
      send data in the opening SYN packet.
      For blocking socket, sendmsg() blocks until all the data are buffered
      locally and the handshake is completed like connect() call. It
      returns similar errno like connect() if the TCP handshake fails.
      For non-blocking socket, it returns the number of bytes queued (and
      transmitted in the SYN-data packet) if cookie is available. If cookie
      is not available, it transmits a data-less SYN packet with Fast Open
      cookie request option and returns -EINPROGRESS like connect().
      Using MSG_FASTOPEN on connecting or connected socket will result in
      simlar errno like repeating connect() calls. Therefore the application
      should only use this flag on new sockets.
      The buffer size of sendmsg() is independent of the MSS of the connection.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Yuchung Cheng's avatar
      net-tcp: Fast Open base · 2100c8d2
      Yuchung Cheng authored
      This patch impelements the common code for both the client and server.
      1. TCP Fast Open option processing. Since Fast Open does not have an
         option number assigned by IANA yet, it shares the experiment option
         code 254 by implementing draft-ietf-tcpm-experimental-options
         with a 16 bits magic number 0xF989. This enables global experiments
         without clashing the scarce(2) experimental options available for TCP.
         When the draft status becomes standard (maybe), the client should
         switch to the new option number assigned while the server supports
         both numbers for transistion.
      2. The new sysctl tcp_fastopen
      3. A place holder init function
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      ipv4: tcp: remove per net tcp_sock · be9f4a44
      Eric Dumazet authored
      tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
      per network namespace.
      This leads to bad behavior on multiqueue NICS, because many cpus
      contend for the socket lock and once socket lock is acquired, extra
      false sharing on various socket fields slow down the operations.
      To better resist to attacks, we use a percpu socket. Each cpu can
      run without contention, using appropriate memory (local node)
      Additional features :
      1) We also mirror the queue_mapping of the incoming skb, so that
      answers use the same queue if possible.
      2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
      3) We now limit the number of in-flight RST/ACK [1] packets
      per cpu, instead of per namespace, and we honor the sysctl_wmem_default
      limit dynamically. (Prior to this patch, sysctl_wmem_default value was
      copied at boot time, so any further change would not affect tcp_sock
      [1] These packets are only generated when no socket was matched for
      the incoming packet.
      Reported-by: default avatarBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 17 Jul, 2012 1 commit
    • David S. Miller's avatar
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller authored
      This will be used so that we can compose a full flow key.
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 16 Jul, 2012 1 commit
    • David S. Miller's avatar
      ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f
      David S. Miller authored
      This abstracts away the call to dst_ops->update_pmtu() so that we can
      transparently handle the fact that, in the future, the dst itself can
      be invalidated by the PMTU update (when we have non-host routes cached
      in sockets).
      So we try to rebuild the socket cached route after the method
      invocation if necessary.
      This isn't used by SCTP because it needs to cache dsts per-transport,
      and thus will need it's own local version of this helper.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 12 Jul, 2012 1 commit
  14. 11 Jul, 2012 2 commits
    • David S. Miller's avatar
    • Eric Dumazet's avatar
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet authored
      This introduce TSQ (TCP Small Queues)
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 10 Jul, 2012 3 commits
  16. 28 Jun, 2012 1 commit
  17. 27 Jun, 2012 3 commits
    • David S. Miller's avatar
      ipv4: Kill early demux method return value. · 160eb5a6
      David S. Miller authored
      It's completely unnecessary.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David S. Miller's avatar
      Revert "ipv4: tcp: dont cache unconfirmed intput dst" · c10237e0
      David S. Miller authored
      This reverts commit c074da28
      This change has several unwanted side effects:
      1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
         thus never create a real cached route.
      2) All TCP traffic will use DST_NOCACHE and never use the routing
         cache at all.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      ipv4: tcp: dont cache unconfirmed intput dst · c074da28
      Eric Dumazet authored
      DDOS synflood attacks hit badly IP route cache.
      On typical machines, this cache is allowed to hold up to 8 Millions dst
      entries, 256 bytes for each, for a total of 2GB of memory.
      rt_garbage_collect() triggers and tries to cleanup things.
      Eventually route cache is disabled but machine is under fire and might
      OOM and crash.
      This patch exploits the new TCP early demux, to set a nocache
      boolean in case incoming TCP frame is for a not yet ESTABLISHED or
      TIMEWAIT socket.
      This 'nocache' boolean is then used in case dst entry is not found in
      route cache, to create an unhashed dst entry (DST_NOCACHE)
      SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
      output dst for syncookies), so after this patch, a machine is able to
      absorb a DDOS synflood attack without polluting its IP route cache.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 24 Jun, 2012 1 commit
  19. 22 Jun, 2012 1 commit
  20. 21 Jun, 2012 1 commit
  21. 19 Jun, 2012 1 commit
    • David S. Miller's avatar
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller authored
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      But we can optimize this down to one demux for certain kinds of local
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      If a TCP socket is established then it's identity is fully specified.
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      Once we move to established state, we cache the receive packet's input
      route to use later.
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  22. 09 Jun, 2012 3 commits
    • David S. Miller's avatar
      [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. · 2397849b
      David S. Miller authored
      Since it's guarenteed that we will access the inetpeer if we're trying
      to do timewait recycling and TCP options were enabled on the
      connection, just cache the peer in the timewait socket.
      In the future, inetpeer lookups will be context dependent (per routing
      realm), and this helps facilitate that as well.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David S. Miller's avatar
      tcp: Get rid of inetpeer special cases. · 4670fd81
      David S. Miller authored
      The get_peer method TCP uses is full of special cases that make no
      sense accommodating, and it also gets in the way of doing more
      reasonable things here.
      First of all, if the socket doesn't have a usable cached route, there
      is no sense in trying to optimize timewait recycling.
      Likewise for the case where we have IP options, such as SRR enabled,
      that make the IP header destination address (and thus the destination
      address of the route key) differ from that of the connection's
      destination address.
      Just return a NULL peer in these cases, and thus we're also able to
      get rid of the clumsy inetpeer release logic.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David S. Miller's avatar
      inet: Create and use rt{,6}_get_peer_create(). · fbfe95a4
      David S. Miller authored
      There's a lot of places that open-code rt{,6}_get_peer() only because
      they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
      for their sake.
      There were also a few spots open-coding plain rt{,6}_get_peer() and
      those are transformed here as well.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 08 Jun, 2012 1 commit
  24. 04 Jun, 2012 1 commit
  25. 01 Jun, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet authored
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      After patch, not a single SYNACK is dropped.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  26. 17 May, 2012 1 commit
  27. 15 May, 2012 1 commit
  28. 04 May, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet authored
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 23 Apr, 2012 2 commits
    • Eric Dumazet's avatar
      tcp: sk_add_backlog() is too agressive for TCP · da882c1f
      Eric Dumazet authored
      While investigating TCP performance problems on 10Gb+ links, we found a
      tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
      in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
      one ACK every two MSS segments.
      A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
      value (87380), allowing a too small backlog.
      A TCP ACK, even being small, can consume nearly same truesize space than
      outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
      is fast to compute.
      Performance results on netperf, single flow, receiver with disabled
      GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
      increments at sender.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      net: add a limit parameter to sk_add_backlog() · f545a38f
      Eric Dumazet authored
      sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
      memory limit. We need to make this limit a parameter for TCP use.
      No functional change expected in this patch, all callers still using the
      old sk_rcvbuf limit.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>