1. 12 May, 2016 2 commits
  2. 26 Apr, 2016 2 commits
  3. 25 Apr, 2016 6 commits
  4. 24 Apr, 2016 1 commit
    • Eric Dumazet's avatar
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet authored
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 23 Apr, 2016 6 commits
  6. 21 Apr, 2016 5 commits
  7. 20 Apr, 2016 2 commits
  8. 19 Apr, 2016 2 commits
  9. 18 Apr, 2016 2 commits
  10. 17 Apr, 2016 1 commit
  11. 16 Apr, 2016 2 commits
    • Alexander Duyck's avatar
      ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skb · aed069df
      Alexander Duyck authored
      This patch updates the IP tunnel core function iptunnel_handle_offloads so
      that we return an int and do not free the skb inside the function.  This
      actually allows us to clean up several paths in several tunnels so that we
      can free the skb at one point in the path without having to have a
      secondary path if we are supporting tunnel offloads.
      In addition it should resolve some double-free issues I have found in the
      tunnels paths as I believe it is possible for us to end up triggering such
      an event in the case of fou or gue.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Hannes Frederic Sowa's avatar
      vxlan: synchronously and race-free destruction of vxlan sockets · 0412bd93
      Hannes Frederic Sowa authored
      Due to the fact that the udp socket is destructed asynchronously in a
      work queue, we have some nondeterministic behavior during shutdown of
      vxlan tunnels and creating new ones. Fix this by keeping the destruction
      process synchronous in regards to the user space process so IFF_UP can
      be reliably set.
      udp_tunnel_sock_release destroys vs->sock->sk if reference counter
      indicates so. We expect to have the same lifetime of vxlan_sock and
      vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
      only destruct the whole socket after we can be sure it cannot be found
      by searching vxlan_net->sock_list.
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jiri Benc <jbenc@redhat.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 15 Apr, 2016 5 commits
    • Xin Long's avatar
      sctp: export some apis or variables for sctp_diag and reuse some for proc · 626d16f5
      Xin Long authored
      For some main variables in sctp.ko, we couldn't export it to other modules,
      so we have to define some api to access them.
      It will include sctp transport and endpoint's traversal.
      There are some transport traversal functions for sctp_diag, we can also
      use it for sctp_proc. cause they have the similar situation to traversal
      - rhashtable_walk_init need the parameter gfp, because of recent upstrem
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Xin Long's avatar
      sctp: add sctp_info dump api for sctp_diag · 52c52a61
      Xin Long authored
      sctp_diag will dump some important details of sctp's assoc or ep, we use
      sctp_info to describe them,  sctp_get_sctp_info to get them, and export
      it to sctp_diag.ko.
      - we will not use list_for_each_safe in sctp_get_sctp_info, cause
        all the callers of it will use lock_sock.
      - fix the holes in struct sctp_info with __reserved* field.
        because sctp_diag is a new feature, and sctp_info is just for now,
        it may be changed in the future.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Marcelo Ricardo Leitner's avatar
      sctp: simplify sk_receive_queue locking · 311b2177
      Marcelo Ricardo Leitner authored
      SCTP already serializes access to rcvbuf through its sock lock:
      sctp_recvmsg takes it right in the start and release at the end, while
      rx path will also take the lock before doing any socket processing. On
      sctp_rcv() it will check if there is an user using the socket and, if
      there is, it will queue incoming packets to the backlog. The backlog
      processing will do the same. Even timers will do such check and
      re-schedule if an user is using the socket.
      Simplifying this will allow us to remove sctp_skb_list_tail and get ride
      of some expensive lockings.  The lists that it is used on are also
      mangled with functions like __skb_queue_tail and __skb_unlink in the
      same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
      sctp_close() will also purge those while using only the sock lock.
      Therefore the lockings performed by sctp_skb_list_tail() are not
      necessary. This patch removes this function and replaces its calls with
      just skb_queue_splice_tail_init() instead.
      The biggest gain is at sctp_ulpq_tail_event(), because the events always
      contain a list, even if it's queueing a single skb and this was
      triggering expensive calls to spin_lock_irqsave/_irqrestore for every
      data chunk received.
      As SCTP will deliver each data chunk on a corresponding recvmsg, the
      more effective the change will be.
      Before this patch, with chunks with 30 bytes:
      netperf -t SCTP_STREAM -H -cC -l 60 -- -m 30 -S 400000
      400000 -s 400000 400000
      on a 10Gbit link with 1500 MTU:
      SCTP STREAM TEST from ( port 0 AF_INET to () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      425984 425984     30    60.00       137.45   7.34     7.36     52.504  52.608
      With it:
      SCTP STREAM TEST from ( port 0 AF_INET to () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      425984 425984     30    60.00       179.10   7.97     6.70     43.740  36.788
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      tcp: do not mess with listener sk_wmem_alloc · b3d05147
      Eric Dumazet authored
      When removing sk_refcnt manipulation on synflood, I missed that
      using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
      transitioned to 0.
      We should hold sk_refcnt instead, but this is a big deal under attack.
      (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
      In this patch, I chose to not attach a socket to syncookies skb.
      Performance is now 5 Mpps instead of 3.2 Mpps.
      Following patch will remove last known false sharing in
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Jiri Pirko's avatar
      devlink: fix sb register stub in case devlink is disabled · de33efd0
      Jiri Pirko authored
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Fixes: bf797471 ("devlink: add shared buffer configuration")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 14 Apr, 2016 4 commits
    • Craig Gallek's avatar
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek authored
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Martin KaFai Lau's avatar
      ipv6: udp: Do a route lookup and update during release_cb · e646b657
      Martin KaFai Lau authored
      This patch adds a release_cb for UDPv6.  It does a route lookup
      and updates sk->sk_dst_cache if it is needed.  It picks up the
      left-over job from ip6_sk_update_pmtu() if the sk was owned
      by user during the pmtu update.
      It takes a rcu_read_lock to protect the __sk_dst_get() operations
      because another thread may do ip6_dst_store() without taking the
      sk lock (e.g. sendmsg).
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reported-by: default avatarWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Martin KaFai Lau's avatar
      ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update · 33c162a9
      Martin KaFai Lau authored
      There is a case in connected UDP socket such that
      getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
      sequence could be the following:
      1. Create a connected UDP socket
      2. Send some datagrams out
      3. Receive a ICMPV6_PKT_TOOBIG
      4. No new outgoing datagrams to trigger the sk_dst_check()
         logic to update the sk->sk_dst_cache.
      5. getsockopt(IPV6_MTU) returns the mtu from the invalid
         sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
      This patch updates the sk->sk_dst_cache for a connected datagram sk
      during pmtu-update code path.
      Note that the sk->sk_v6_daddr is used to do the route lookup
      instead of skb->data (i.e. iph).  It is because a UDP socket can become
      connected after sending out some datagrams in un-connected state.  or
      It can be connected multiple times to different destinations.  Hence,
      iph may not be related to where sk is currently connected to.
      It is done under '!sock_owned_by_user(sk)' condition because
      the user may make another ip6_datagram_connect()  (i.e changing
      the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
      code path.
      For the sock_owned_by_user(sk) == true case, the next patch will
      introduce a release_cb() which will update the sk->sk_dst_cache.
      Server (Connected UDP Socket):
      Route Details:
      [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
      2fac::/64 dev eth0  proto kernel  metric 256  pref medium
      2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium
      A simple python code to create a connected UDP socket:
      import socket
      import errno
      HOST = '2fac::1'
      PORT = 8080
      s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
      s.bind((HOST, PORT))
      s.connect(('2fac:face::face', 53))
      while True:
      	data = s.recv(1024)
          except socket.error as se:
      	if se.errno == errno.EMSGSIZE:
      		pmtu = s.getsockopt(41, 24)
      		print("PMTU:%d" % pmtu)
      Python program output after getting a ICMPV6_PKT_TOOBIG:
      [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
      Cache routes after recieving TOOBIG:
      [root@arch-fb-vm1 ~]# ip -6 r show table cache
      2fac:face::face via 2fac::face dev eth0  metric 0
          cache  expires 463sec mtu 1300 pref medium
      Client (Send the ICMPV6_PKT_TOOBIG):
      scapy is used to generate the TOOBIG message.  Here is the scapy script I have
      >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
      1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
      >>> sendp(p, iface='qemubr0')
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reported-by: default avatarWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Jiri Pirko's avatar
      devlink: implement shared buffer occupancy monitoring interface · df38dafd
      Jiri Pirko authored
      User needs to monitor shared buffer occupancy. For that, he issues a
      snapshot command in order to instruct hardware to catch current and
      maximal occupancy values, and clear command in order to clear the
      historical maximal values.
      Also port-pool and tc-pool-bind command response messages are extended to
      carry occupancy values.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>