1. 31 Jul, 2013 8 commits
    • Joe Perches's avatar
      addrconf.h: Remove extern function prototypes · e8e54d3c
      Joe Perches authored
      There are a mix of function prototypes with and without extern
      in the kernel sources.  Standardize on not using extern for
      function prototypes.
      
      Function prototypes don't need to be written with extern.
      extern is assumed by the compiler.  Its use is as unnecessary as
      using auto to declare automatic/local variables in a block.
      
      Reflow modified prototypes to 80 columns.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8e54d3c
    • Stefan Tomanek's avatar
      fib_rules: add .suppress operation · 7764a45a
      Stefan Tomanek authored
      This change adds a new operation to the fib_rules_ops struct; it allows the
      suppression of routing decisions if certain criteria are not met by its
      results.
      
      The first implemented constraint is a minimum prefix length added to the
      structures of routing rules. If a rule is added with a minimum prefix length
      >0, only routes meeting this threshold will be considered. Any other (more
      general) routing table entries will be ignored.
      
      When configuring a system with multiple network uplinks and default routes, it
      is often convinient to reference the main routing table multiple times - but
      omitting the default route. Using this patch and a modified "ip" utility, this
      can be achieved by using the following command sequence:
      
        $ ip route add table secuplink default via 10.42.23.1
      
        $ ip rule add pref 100            table main prefixlength 1
        $ ip rule add pref 150 fwmark 0xA table secuplink
      
      With this setup, packets marked 0xA will be processed by the additional routing
      table "secuplink", but only if no suitable route in the main routing table can
      be found. By using a minimal prefixlength of 1, the default route (/0) of the
      table "main" is hidden to packets processed by rule 100; packets traveling to
      destinations with more specific routing entries are processed as usual.
      Signed-off-by: default avatarStefan Tomanek <stefan.tomanek@wertarbyte.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7764a45a
    • Joe Perches's avatar
      net: Remove extern from include/net/ scheduling prototypes · 5c15257f
      Joe Perches authored
      There are a mix of function prototypes with and without extern
      in the kernel sources.  Standardize on not using extern for
      function prototypes.
      
      Function prototypes don't need to be written with extern.
      extern is assumed by the compiler.  Its use is as unnecessary as
      using auto to declare automatic/local variables in a block.
      
      Reflow modified prototypes to 80 columns.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c15257f
    • Eric Dumazet's avatar
      net: skb_orphan() changes · c34a7612
      Eric Dumazet authored
      It is illegal to set skb->sk without corresponding destructor.
      
      Its therefore safe for skb_orphan() to not clear skb->sk if
      skb->destructor is not set.
      
      Also avoid clearing skb->destructor if already NULL.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c34a7612
    • Eric Dumazet's avatar
      netem: Introduce skb_orphan_partial() helper · f2f872f9
      Eric Dumazet authored
      Commit 547669d4 ("tcp: xps: fix reordering issues") added
      unexpected reorders in case netem is used in a MQ setup for high
      performance test bed.
      
      ETH=eth0
      tc qd del dev $ETH root 2>/dev/null
      tc qd add dev $ETH root handle 1: mq
      for i in `seq 1 32`
      do
       tc qd add dev $ETH parent 1:$i netem delay 100ms
      done
      
      As all tcp packets are orphaned by netem, TCP stack believes it can
      set skb->ooo_okay on all packets.
      
      In order to allow producers to send more packets, we want to
      keep sk_wmem_alloc from reaching sk_sndbuf limit.
      
      We can do that by accounting one byte per skb in netem queues,
      so that TCP stack is not fooled too much.
      
      Tested:
      
      With above MQ/netem setup, scaling number of concurrent flows gives
      linear results and no reorders/retransmits
      
      lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
       do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
      n:1 198.46
      n:10 2002.69
      n:20 4000.98
      n:30 6006.35
      n:40 8020.93
      n:50 10032.3
      n:60 12081.9
      n:70 13971.3
      n:80 16009.7
      n:90 17117.3
      n:100 17425.5
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2f872f9
    • fan.du's avatar
      net: split rt_genid for ipv4 and ipv6 · ca4c3fc2
      fan.du authored
      Current net name space has only one genid for both IPv4 and IPv6, it has below
      drawbacks:
      
      - Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
      - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
        entries even when the policy is only applied for one address family.
      
      Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
      separately in a fine granularity.
      Signed-off-by: default avatarFan Du <fan.du@windriver.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca4c3fc2
    • Dmitry Popov's avatar
      tcp: Remove unused tcpct declarations and comments · c0155b2d
      Dmitry Popov authored
      Remove declaration, 4 defines and confusing comment that are no longer used
      since 1a2c6181 ("tcp: Remove TCPCT").
      Signed-off-by: default avatarDmitry Popov <dp@highloadlab.com>
      Acked-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0155b2d
    • Jacob Keller's avatar
      PCI: Add function to obtain minimum link width and speed · 81377c8d
      Jacob Keller authored
      A PCI Express device can potentially report a link width and speed which it will
      not properly fulfill due to being plugged into a slower link higher in the
      chain. This function walks up the PCI bus chain and calculates the minimum link
      width and speed of this entire chain. This can be useful to enable a device to
      determine if it has enough bandwidth for optimum functionality.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      81377c8d
  2. 30 Jul, 2013 4 commits
  3. 29 Jul, 2013 1 commit
  4. 27 Jul, 2013 5 commits
  5. 24 Jul, 2013 4 commits
    • Thomas Gleixner's avatar
      net: Make devnet_rename_seq static · 18afa4b0
      Thomas Gleixner authored
      No users outside net/core/dev.c.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18afa4b0
    • Eric Dumazet's avatar
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet authored
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • Eric Dumazet's avatar
      net: add sk_stream_is_writeable() helper · 64dc6130
      Eric Dumazet authored
      Several call sites use the hardcoded following condition :
      
      sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
      
      Lets use a helper because TCP_NOTSENT_LOWAT support will change this
      condition for TCP sockets.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64dc6130
    • Daniel Borkmann's avatar
      net: sctp: trivial: update mailing list address · 91705c61
      Daniel Borkmann authored
      The SCTP mailing list address to send patches or questions
      to is linux-sctp@vger.kernel.org and not
      lksctp-developers@lists.sourceforge.net anymore. Therefore,
      update all occurences.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91705c61
  6. 23 Jul, 2013 3 commits
  7. 22 Jul, 2013 4 commits
  8. 18 Jul, 2013 1 commit
    • Eric Dumazet's avatar
      vlan: mask vlan prio bits · d4b812de
      Eric Dumazet authored
      In commit 48cc32d3
      ("vlan: don't deliver frames for unknown vlans to protocols")
      Florian made sure we set pkt_type to PACKET_OTHERHOST
      if the vlan id is set and we could find a vlan device for this
      particular id.
      
      But we also have a problem if prio bits are set.
      
      Steinar reported an issue on a router receiving IPv6 frames with a
      vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device,
      because skb->vlan_tci is set.
      
      Forwarded frame is completely corrupted : We can see (8100:4000)
      being inserted in the middle of IPv6 source address :
      
      16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 >
      9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64
             0x0000:  0000 0029 8000 c7c3 7103 0001 a0ae e651
             0x0010:  0000 0000 ccce 0b00 0000 0000 1011 1213
             0x0020:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
             0x0030:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233
      
      It seems we are not really ready to properly cope with this right now.
      
      We can probably do better in future kernels :
      vlan_get_ingress_priority() should be a netdev property instead of
      a per vlan_dev one.
      
      For stable kernels, lets clear vlan_tci to fix the bugs.
      Reported-by: default avatarSteinar H. Gunderson <sesse@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4b812de
  9. 16 Jul, 2013 10 commits