1. 10 Jul, 2013 1 commit
  2. 19 Jun, 2013 1 commit
  3. 10 Jun, 2013 1 commit
  4. 07 Jun, 2013 1 commit
  5. 20 May, 2013 1 commit
    • Eric Dumazet's avatar
      tcp: md5: remove spinlock usage in fast path · 71cea17e
      Eric Dumazet authored
      
      
      TCP md5 code uses per cpu variables but protects access to them with
      a shared spinlock, which is a contention point.
      
      [ tcp_md5sig_pool_lock is locked twice per incoming packet ]
      
      Makes things much simpler, by allocating crypto structures once, first
      time a socket needs md5 keys, and not deallocating them as they are
      really small.
      
      Next step would be to allow crypto allocations being done in a NUMA
      aware way.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71cea17e
  6. 29 Apr, 2013 1 commit
  7. 09 Apr, 2013 1 commit
    • Al Viro's avatar
      procfs: new helper - PDE_DATA(inode) · d9dda78b
      Al Viro authored
      
      
      The only part of proc_dir_entry the code outside of fs/proc
      really cares about is PDE(inode)->data.  Provide a helper
      for that; static inline for now, eventually will be moved
      to fs/proc, along with the knowledge of struct proc_dir_entry
      layout.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d9dda78b
  8. 18 Mar, 2013 1 commit
    • Eric Dumazet's avatar
      tcp: dont handle MTU reduction on LISTEN socket · 0d4f0608
      Eric Dumazet authored
      When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
      LISTEN socket, and this socket is currently owned by the user, we
      set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.
      
      This is bad because if we clone the parent before it had a chance to
      clear the flag, the child inherits the tsq_flags value, and next
      tcp_release_cb() on the child will decrement sk_refcnt.
      
      Result is that we might free a live TCP socket, as reported by
      Dormando.
      
      IPv4: Attempt to release TCP socket in state 1
      
      Fix this issue by testing sk_state against TCP_LISTEN early, so that we
      set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)
      
      This bug was introduced in commit 563d34d0
      
      
      (tcp: dont drop MTU reduction indications)
      Reported-by: default avatardormando <dormando@rydia.net>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d4f0608
  9. 17 Mar, 2013 1 commit
    • Christoph Paasch's avatar
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch authored
      
      
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a2c6181
  10. 12 Mar, 2013 1 commit
    • Nandita Dukkipati's avatar
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati authored
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      . The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  11. 07 Mar, 2013 1 commit
  12. 27 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      
      
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  13. 18 Feb, 2013 1 commit
  14. 13 Feb, 2013 1 commit
    • Andrey Vagin's avatar
      tcp: send packets with a socket timestamp · ee684b6f
      Andrey Vagin authored
      
      
      A socket timestamp is a sum of the global tcp_time_stamp and
      a per-socket offset.
      
      A socket offset is added in places where externally visible
      tcp timestamp option is parsed/initialized.
      
      Connections in the SYN_RECV state are not supported, global
      tcp_time_stamp is used for them, because repair mode doesn't support
      this state. In a future it can be implemented by the similar way
      as for TIME_WAIT sockets.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee684b6f
  15. 04 Feb, 2013 1 commit
  16. 29 Jan, 2013 1 commit
  17. 23 Jan, 2013 1 commit
    • Tom Herbert's avatar
      soreuseport: TCP/IPv4 implementation · da5e3630
      Tom Herbert authored
      
      
      Allow multiple listener sockets to bind to the same port.
      
      Motivation for soresuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      uniform.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da5e3630
  18. 20 Jan, 2013 1 commit
  19. 06 Jan, 2013 1 commit
  20. 14 Dec, 2012 1 commit
    • Christoph Paasch's avatar
      inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock · e337e24d
      Christoph Paasch authored
      If in either of the above functions inet_csk_route_child_sock() or
      __inet_inherit_port() fails, the newsk will not be freed:
      
      unreferenced object 0xffff88022e8a92c0 (size 1592):
        comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
        hex dump (first 32 bytes):
          0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
          02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
          [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
          [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
          [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
          [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
          [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
          [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
          [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
          [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
          [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
          [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
          [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
          [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
          [<ffffffff814cee68>] ip_rcv+0x217/0x267
          [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
          [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
      
      This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
      a single sock_put() is not enough to free the memory. Additionally, things
      like xfrm, memcg, cookie_values,... may have been initialized.
      We have to free them properly.
      
      This is fixed by forcing a call to tcp_done(), ending up in
      inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
      because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
      xfrm,...
      
      Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
      force it entering inet_csk_destroy_sock. To avoid the warning in
      inet_csk_destroy_sock, inet_num has to be set to 0.
      As inet_csk_destroy_sock does a dec on orphan_count, we first have to
      increase it.
      
      Calling tcp_done() allows us to remove the calls to
      tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
      
      A similar approach is taken for dccp by calling dccp_done().
      
      This is in the kernel since 093d2823
      
       (tproxy: fix hash locking issue
      when using port redirection in __inet_inherit_port()), thus since
      version >= 2.6.37.
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e337e24d
  21. 22 Nov, 2012 1 commit
  22. 03 Nov, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet authored
      
      
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6c022a4
  23. 31 Oct, 2012 1 commit
  24. 23 Oct, 2012 1 commit
  25. 22 Oct, 2012 1 commit
  26. 12 Oct, 2012 1 commit
    • Alexey Kuznetsov's avatar
      tcp: resets are misrouted · 4c675258
      Alexey Kuznetsov authored
      After commit e2446eaa
      
       ("tcp_v4_send_reset: binding oif to iif in no
      sock case").. tcp resets are always lost, when routing is asymmetric.
      Yes, backing out that patch will result in misrouting of resets for
      dead connections which used interface binding when were alive, but we
      actually cannot do anything here.  What's died that's died and correct
      handling normal unbound connections is obviously a priority.
      
      Comment to comment:
      > This has few benefits:
      >   1. tcp_v6_send_reset already did that.
      
      It was done to route resets for IPv6 link local addresses. It was a
      mistake to do so for global addresses. The patch fixes this as well.
      
      Actually, the problem appears to be even more serious than guaranteed
      loss of resets.  As reported by Sergey Soloviev <sol@eqv.ru>, those
      misrouted resets create a lot of arp traffic and huge amount of
      unresolved arp entires putting down to knees NAT firewalls which use
      asymmetric routing.
      Signed-off-by: default avatarAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      4c675258
  27. 01 Oct, 2012 1 commit
  28. 27 Sep, 2012 1 commit
  29. 24 Sep, 2012 1 commit
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      
      
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5640f768
  30. 22 Sep, 2012 2 commits
    • Neal Cardwell's avatar
      tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS · 016818d0
      Neal Cardwell authored
      
      
      When taking SYNACK RTT samples for servers using TCP Fast Open, fix
      the code to ensure that we only call tcp_valid_rtt_meas() after we
      receive the ACK that completes the 3-way handshake.
      
      Previously we were always taking an RTT sample in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
      to take the RTT sample.
      
      To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
      before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
      already ensures that we only take a SYNACK RTT sample if snt_synack is
      non-zero. To be careful, we only take a snt_synack timestamp when
      a SYNACK transmit or retransmit succeeds.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      016818d0
    • Neal Cardwell's avatar
      tcp: extract code to compute SYNACK RTT · 623df484
      Neal Cardwell authored
      
      
      In preparation for adding another spot where we compute the SYNACK
      RTT, extract this code so that it can be shared.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      623df484
  31. 31 Aug, 2012 2 commits
    • Jerry Chu's avatar
      tcp: TCP Fast Open Server - main code path · 168a8f58
      Jerry Chu authored
      
      
      This patch adds the main processing path to complete the TFO server
      patches.
      
      A TFO request (i.e., SYN+data packet with a TFO cookie option) first
      gets processed in tcp_v4_conn_request(). If it passes the various TFO
      checks by tcp_fastopen_check(), a child socket will be created right
      away to be accepted by applications, rather than waiting for the 3WHS
      to finish.
      
      In additon to the use of TFO cookie, a simple max_qlen based scheme
      is put in place to fend off spoofed TFO attack.
      
      When a valid ACK comes back to tcp_rcv_state_process(), it will cause
      the state of the child socket to switch from either TCP_SYN_RECV to
      TCP_ESTABLISHED, or TCP_FIN_WAIT1 to TCP_FIN_WAIT2. At this time
      retransmission will resume for any unack'ed (data, FIN,...) segments.
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      168a8f58
    • Jerry Chu's avatar
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu authored
      
      
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8336886f
  32. 21 Aug, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: fix possible socket refcount problem · 144d56e9
      Eric Dumazet authored
      Commit 6f458dfb
      
       (tcp: improve latencies of timer triggered events)
      added bug leading to following trace :
      
      [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.131726]
      [ 2866.132188] =========================
      [ 2866.132281] [ BUG: held lock freed! ]
      [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
      [ 2866.132281] -------------------------
      [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
      [ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281] 4 locks held by kworker/0:1/652:
      [ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]
      [ 2866.132281] stack backtrace:
      [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
      [ 2866.132281] Call Trace:
      [ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
      [ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
      [ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
      [ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
      [ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
      [ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
      [ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
      [ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
      [ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
      [ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
      [ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
      [ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
      [ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
      [ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
      [ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
      [ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
      [ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
      [ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
      [ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
      [ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
      [ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
      [ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
      [ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
      [ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
      [ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
      [ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
      [ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
      [ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
      [ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
      [ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
      [ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
      [ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
      [ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
      [ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
      [ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
      [ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
      [ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
      [ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
      [ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
      [ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
      [ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
      [ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
      [ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
      [ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
      [ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
      [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.309689] =============================================================================
      [ 2866.310254] BUG TCP (Not tainted): Object already free
      [ 2866.310254] -----------------------------------------------------------------------------
      [ 2866.310254]
      
      The bug comes from the fact that timer set in sk_reset_timer() can run
      before we actually do the sock_hold(). socket refcount reaches zero and
      we free the socket too soon.
      
      timer handler is not allowed to reduce socket refcnt if socket is owned
      by the user, or we need to change sk_reset_timer() implementation.
      
      We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
      or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
      
      Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
      was used instead of TCP_DELACK_TIMER_DEFERRED.
      
      For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
      even if not fired from a timer.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Tested-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      144d56e9
  33. 20 Aug, 2012 1 commit
  34. 14 Aug, 2012 1 commit
  35. 09 Aug, 2012 2 commits
    • Eric Dumazet's avatar
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet authored
      commit 5d299f3d
      
       (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      uhci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      (null)
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      0000000000000000
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      ffff88024827ad00
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      0000000000000000
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      ffff880254735380
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      7fffffffffff0218
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      knlGS:0000000000000000
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      00000000000007f0
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      ffff880245fc7d68
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      ffff880245fc7d18
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      00000000000000ff
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63d02d15
    • Eric Dumazet's avatar
      time: jiffies_delta_to_clock_t() helper to the rescue · a399a805
      Eric Dumazet authored
      Various /proc/net files sometimes report crazy timer values, expressed
      in clock_t units.
      
      This happens when an expired timer delta (expires - jiffies) is passed
      to jiffies_to_clock_t().
      
      This function has an overflow in :
      
      return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
      
      commit cbbc719f
      
       (time: Change jiffies_to_clock_t() argument type
      to unsigned long) only got around the problem.
      
      As we cant output negative values in /proc/net/tcp without breaking
      various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
      that caps the negative delta to a 0 value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: hank <pyu@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a399a805
  36. 06 Aug, 2012 1 commit
  37. 31 Jul, 2012 1 commit
    • Andrew Morton's avatar
      memcg: rename config variables · c255a458
      Andrew Morton authored
      
      
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c255a458