1. 17 Sep, 2016 1 commit
    • Eric Dumazet's avatar
      net: avoid sk_forward_alloc overflows · 20c64d5c
      Eric Dumazet authored
      A malicious TCP receiver, sending SACK, can force the sender to split
      skbs in write queue and increase its memory usage.
      
      Then, when socket is closed and its write queue purged, we might
      overflow sk_forward_alloc (It becomes negative)
      
      sk_mem_reclaim() does nothing in this case, and more than 2GB
      are leaked from TCP perspective (tcp_memory_allocated is not changed)
      
      Then warnings trigger from inet_sock_destruct() and
      sk_stream_kill_queues() seeing a not zero sk_forward_alloc
      
      All TCP stack can be stuck because TCP is under memory pressure.
      
      A simple fix is to preemptively reclaim from sk_mem_uncharge().
      
      This makes sure a socket wont have more than 2 MB forward allocated,
      after burst and idle period.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20c64d5c
  2. 13 Jul, 2016 1 commit
    • Willem de Bruijn's avatar
      dccp: limit sk_filter trim to payload · 4f0c40d9
      Willem de Bruijn authored
      Dccp verifies packet integrity, including length, at initial rcv in
      dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
      
      A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
      skb_copy_datagram_msg interprets this as a negative value, so
      (correctly) fails with EFAULT. The negative length is reported in
      ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
      
      Introduce an sk_receive_skb variant that caps how small a filter
      program can trim packets, and call this in dccp with the header
      length. Excessively trimmed packets are now processed normally and
      queued for reception as 0B payloads.
      
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f0c40d9
  3. 20 May, 2016 1 commit
  4. 04 May, 2016 1 commit
  5. 03 May, 2016 1 commit
    • Eric Dumazet's avatar
      net: add __sock_wfree() helper · 1d2077ac
      Eric Dumazet authored
      Hosts sending lot of ACK packets exhibit high sock_wfree() cost
      because of cache line miss to test SOCK_USE_WRITE_QUEUE
      
      We could move this flag close to sk_wmem_alloc but it is better
      to perform the atomic_sub_and_test() on a clean cache line,
      as it avoid one extra bus transaction.
      
      skb_orphan_partial() can also have a fast track for packets that either
      are TCP acks, or already went through another skb_orphan_partial()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d2077ac
  6. 02 May, 2016 1 commit
    • Eric Dumazet's avatar
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet authored
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d41a69f1
  7. 27 Apr, 2016 2 commits
  8. 25 Apr, 2016 2 commits
  9. 14 Apr, 2016 1 commit
    • Craig Gallek's avatar
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek authored
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d894ba18
  10. 13 Apr, 2016 2 commits
    • Denys Vlasenko's avatar
      net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put · f9a7cbbf
      Denys Vlasenko authored
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Arguably, gcc should do better, but gcc people aren't willing
      to invest time into it, asking to use __always_inline instead.
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      the following functions get deinlined many times.
      
      netif_tx_stop_queue: 207 copies, 590 calls:
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      netif_tx_start_queue: 47 copies, 111 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      sock_hold: 39 copies, 124 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 87 80 00 00 00    lock incl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      __sock_put: 6 copies, 13 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 8f 80 00 00 00    lock decl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~2.5k:
      
          text      data      bss       dec     hex filename
      56719876  56364551 36196352 149280779 8e5d80b vmlinux_before
      56717440  56364551 36196352 149278343 8e5ce87 vmlinux
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9a7cbbf
    • Hannes Frederic Sowa's avatar
      sock: tigthen lockdep checks for sock_owned_by_user · fafc4e1e
      Hannes Frederic Sowa authored
      sock_owned_by_user should not be used without socket lock held. It seems
      to be a common practice to check .owned before lock reclassification, so
      provide a little help to abstract this check away.
      
      Cc: linux-cifs@vger.kernel.org
      Cc: linux-bluetooth@vger.kernel.org
      Cc: linux-nfs@vger.kernel.org
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fafc4e1e
  11. 07 Apr, 2016 4 commits
  12. 05 Apr, 2016 3 commits
  13. 04 Apr, 2016 6 commits
    • Eric Dumazet's avatar
      tcp: increment sk_drops for dropped rx packets · 532182cd
      Eric Dumazet authored
      Now ss can report sk_drops, we can instruct TCP to increment
      this per socket counter when it drops an incoming frame, to refine
      monitoring and debugging.
      
      Following patch takes care of listeners drops.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      532182cd
    • Eric Dumazet's avatar
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet authored
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • Eric Dumazet's avatar
      net: add SOCK_RCU_FREE socket flag · a4298e45
      Eric Dumazet authored
      We want a generic way to insert an RCU grace period before socket
      freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
      much overhead.
      
      SLAB_DESTROY_BY_RCU strict rules force us to take a reference
      on the socket sk_refcnt, and it is a performance problem for UDP
      encapsulation, or TCP synflood behavior, as many CPUs might
      attempt the atomic operations on a shared sk_refcnt
      
      UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
      lookup can use traditional RCU rules, without refcount changes.
      They can set the flag only once hashed and visible by other cpus.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Tested-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4298e45
    • Soheil Hassas Yeganeh's avatar
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh authored
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c14ac945
    • Soheil Hassas Yeganeh's avatar
      sock: accept SO_TIMESTAMPING flags in socket cmsg · 3dd17e63
      Soheil Hassas Yeganeh authored
      Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
      as a basis to accept timestamping requests per write.
      
      This implementation only accepts TX recording flags (i.e.,
      SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
      SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
      control messages. Users need to set reporting flags (e.g.,
      SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
      
      This commit adds a tsflags field in sockcm_cookie which is
      set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
      bits in sockcm_cookie.tsflags allowing the control message
      to override the recording behavior per write, yet maintaining
      the value of other flags.
      
      This patch implements validating the control message and setting
      tsflags in struct sockcm_cookie. Next commits in this series will
      actually implement timestamping per write for different protocols.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dd17e63
    • Willem de Bruijn's avatar
      sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop · 39771b12
      Willem de Bruijn authored
      To process cmsg's of the SOL_SOCKET level in addition to
      cmsgs of another level, protocols can call sock_cmsg_send().
      This causes a double walk on the cmsghdr list, one for SOL_SOCKET
      and one for the other level.
      
      Extract the inner demultiplex logic from the loop that walks the list,
      to allow having this called directly from a walker in the protocol
      specific code.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39771b12
  14. 11 Feb, 2016 1 commit
  15. 21 Jan, 2016 1 commit
  16. 14 Jan, 2016 7 commits
  17. 04 Jan, 2016 1 commit
    • Craig Gallek's avatar
      soreuseport: define reuseport groups · ef456144
      Craig Gallek authored
      struct sock_reuseport is an optional shared structure referenced by each
      socket belonging to a reuseport group.  When a socket is bound to an
      address/port not yet in use and the reuseport flag has been set, the
      structure will be allocated and attached to the newly bound socket.
      When subsequent calls to bind are made for the same address/port, the
      shared structure will be updated to include the new socket and the
      newly bound socket will reference the group structure.
      
      Usually, when an incoming packet was destined for a reuseport group,
      all sockets in the same group needed to be considered before a
      dispatching decision was made.  With this structure, an appropriate
      socket can be found after looking up just one socket in the group.
      
      This shared structure will also allow for more complicated decisions to
      be made when selecting a socket (eg a BPF filter).
      
      This work is based off a similar implementation written by
      Ying Cai <ycai@google.com> for implementing policy-based reuseport
      selection.
      Signed-off-by: default avatarCraig Gallek <kraig@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef456144
  18. 16 Dec, 2015 1 commit
  19. 15 Dec, 2015 2 commits
  20. 14 Dec, 2015 1 commit
    • Eric Dumazet's avatar
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet authored
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5037e9ef