1. 26 Nov, 2008 3 commits
  2. 25 Nov, 2008 14 commits
  3. 24 Nov, 2008 16 commits
    • Ilpo Järvinen's avatar
      tcp: handle shift/merge of cloned skbs too · 0ace2856
      Ilpo Järvinen authored
      This caused me to get repeatably:
      
        tcpdump: pcap_loop: recvfrom: Bad address
      
      Happens occassionally when I tcpdump my for-looped test xfers:
        while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done
      
      Rest of the relevant commands:
        ethtool -K eth0 tso off
        tc qdisc add dev eth0 root netem drop 4%
        tcpdump -n -s0 -i eth0 -w sacklog.all
      
      Running net-next under kvm, connection goes to the same host
      (basically just out of kvm). The connection itself works ok
      and data gets sent without corruption even with a large
      number of tests while tcpdump fails usually within less than
      5 tests.
      
      Whether it only happens because of this change or not, I
      don't know for sure but it's the only thing with which
      I've seen that error. The non-cloned variant works w/o it
      for much longer time. I'm yet to debug where the error
      actually comes from.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ace2856
    • Ilpo Järvinen's avatar
      111cc8b9
    • Ilpo Järvinen's avatar
      tcp: Make shifting not clear the hints · 92ee76b6
      Ilpo Järvinen authored
      The earlier version was just very basic one which is "playing
      safe" by always clearing the hints. However, clearing of a hint
      is extremely costly operation with large windows, so it must be
      avoided at all cost whenever possible, there is a way with
      shifting too achieve not-clearing.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92ee76b6
    • Ilpo Järvinen's avatar
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen authored
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      832d11c5
    • Ilpo Järvinen's avatar
      tcp: make tcp_sacktag_one able to handle partial skb too · f58b22fd
      Ilpo Järvinen authored
      This is preparatory work for SACK combiner patch which may
      have to count TCP state changes for only a part of the skb
      because it will intentionally avoids splitting skb to SACKed
      and not sacked parts.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f58b22fd
    • Ilpo Järvinen's avatar
      tcp: Make SACK code to split only at mss boundaries · adb92db8
      Ilpo Järvinen authored
      Sadly enough, this adds possible divide though we try to avoid
      it by checking one mss as common case.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adb92db8
    • Ilpo Järvinen's avatar
      tcp: more aggressive skipping · e8bae275
      Ilpo Järvinen authored
      I knew already when rewriting the sacktag that this condition
      was too conservative, change it now since it prevent lot of
      useless work (especially in the sack shifter decision code
      that is being added by a later patch). This shouldn't change
      anything really, just save some processing regardless of the
      shifter.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8bae275
    • Ilpo Järvinen's avatar
    • Ilpo Järvinen's avatar
      tcp: collapse more than two on retransmission · 4a17fc3a
      Ilpo Järvinen authored
      I always had thought that collapsing up to two at a time was
      intentional decision to avoid excessive processing if 1 byte
      sized skbs are to be combined for a full mtu, and consecutive
      retransmissions would make the size of the retransmittee
      double each round anyway, but some recent discussion made me
      to understand that was not the case. Thus make collapse work
      more and wait less.
      
      It would be possible to take advantage of the shifting
      machinery (added in the later patch) in the case of paged
      data but that can be implemented on top of this change.
      
      tcp_skb_is_last check is now provided by the loop.
      
      I tested a bit (ss-after-idle-off, fill 4096x4096B xfer,
      10s sleep + 4096 x 1byte writes while dropping them for
      some a while with netem):
      
      . 16774097:16775545(1448) ack 1 win 46
      . 16775545:16776993(1448) ack 1 win 46
      . ack 16759617 win 2399
      P 16776993:16777217(224) ack 1 win 46
      . ack 16762513 win 2399
      . ack 16765409 win 2399
      . ack 16768305 win 2399
      . ack 16771201 win 2399
      . ack 16774097 win 2399
      . ack 16776993 win 2399
      . ack 16777217 win 2399
      P 16777217:16777257(40) ack 1 win 46
      . ack 16777257 win 2399
      P 16777257:16778705(1448) ack 1 win 46
      P 16778705:16780153(1448) ack 1 win 46
      FP 16780153:16781313(1160) ack 1 win 46
      . ack 16778705 win 2399
      . ack 16780153 win 2399
      F 1:1(0) ack 16781314 win 2399
      
      While without drop-all period I get this:
      
      . 16773585:16775033(1448) ack 1 win 46
      . ack 16764897 win 9367
      . ack 16767793 win 9367
      . ack 16770689 win 9367
      . ack 16773585 win 9367
      . 16775033:16776481(1448) ack 1 win 46
      P 16776481:16777217(736) ack 1 win 46
      . ack 16776481 win 9367
      . ack 16777217 win 9367
      P 16777217:16777218(1) ack 1 win 46
      P 16777218:16777219(1) ack 1 win 46
      P 16777219:16777220(1) ack 1 win 46
        ...
      P 16777247:16777248(1) ack 1 win 46
      . ack 16777218 win 9367
      . ack 16777219 win 9367
        ...
      . ack 16777233 win 9367
      . ack 16777248 win 9367
      P 16777248:16778696(1448) ack 1 win 46
      P 16778696:16780144(1448) ack 1 win 46
      FP 16780144:16781313(1169) ack 1 win 46
      . ack 16780144 win 9367
      F 1:1(0) ack 16781314 win 9367
      
      The window seems to be 30-40 segments, which were successfully
      combined into: P 16777217:16777257(40) ack 1 win 46
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a17fc3a
    • Eric Dumazet's avatar
      net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() · a21bba94
      Eric Dumazet authored
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_push_pending_frames() steal the refcount its
      callers had to take when filling inet->cork.dst.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a21bba94
    • Eric Dumazet's avatar
      net: avoid a pair of dst_hold()/dst_release() in ip_append_data() · 2e77d89b
      Eric Dumazet authored
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_append_data() eventually steal the refcount its
      callers had to take on the dst entry.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e77d89b
    • Jarek Poplawski's avatar
      net: gen_estimator: Fix gen_kill_estimator() lookups · 4db0acf3
      Jarek Poplawski authored
      gen_kill_estimator() linear lists lookups are very slow, and e.g. while
      deleting a large number of HTB classes soft lockups were reported. Here
      is another try to fix this problem: this time internally, with rbtree,
      so similarly to Jamal's hashing idea IIRC. (Looking for next hits could
      be still optimized, but it's really fast as it is.)
      Reported-by: default avatarBadalian Vyacheslav <slavon@bigtelecom.ru>
      Reported-by: default avatarDenys Fedoryshchenko <denys@visp.net.lb>
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4db0acf3
    • Patrick McHardy's avatar
      pkt_sched: sch_drr: fix drr_dequeue loop() · 3f0947c3
      Patrick McHardy authored
      Jarek Poplawski points out:
      
      If all child qdiscs of sch_drr are non-work-conserving (e.g. sch_tbf)
      drr_dequeue() will busy-loop waiting for skbs instead of leaving the
      job for a watchdog. Checking for list_empty() in each loop isn't
      necessary either, because this can never be true except the first time.
      
      Using non-work-conserving qdiscs as children of DRR makes no sense,
      simply bail out in that case.
      Reported-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f0947c3
    • Eric Dumazet's avatar
      net: Make sure BHs are disabled in sock_prot_inuse_add() · 3755810c
      Eric Dumazet authored
      There is still a call to sock_prot_inuse_add() in af_netlink
      while in a preemptable section. Add explicit BH disable around
      this call.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3755810c
    • Eric Dumazet's avatar
      net: Make sure BHs are disabled in sock_prot_inuse_add() · 920de804
      Eric Dumazet authored
      The rule of calling sock_prot_inuse_add() is that BHs must
      be disabled.  Some new calls were added where this was not
      true and this tiggers warnings as reported by Ilpo.
      
      Fix this by adding explicit BH disabling around those call sites,
      or moving sock_prot_inuse_add() call inside an existing BH disabled
      section.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      920de804
    • Eric Dumazet's avatar
      eth: Declare an optimized compare_ether_addr_64bits() function · 1f87e235
      Eric Dumazet authored
      Linus mentioned we could try to perform long word operations, even
      on potentially unaligned addresses, on x86 at least. David mentioned
      the HAVE_EFFICIENT_UNALIGNED_ACCESS test to handle this on all
      arches that have efficient unailgned accesses.
      
      I tried this idea and got nice assembly on 32 bits:
      
      158:   33 82 38 01 00 00       xor    0x138(%edx),%eax
      15e:   33 8a 34 01 00 00       xor    0x134(%edx),%ecx
      164:   c1 e0 10                shl    $0x10,%eax
      167:   09 c1                   or     %eax,%ecx
      169:   74 0b                   je     176 <eth_type_trans+0x87>
      
      And very nice assembly on 64 bits of course (one xor, one shl)
      
      Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %,
      expected since we remove 8 instructions on a fast path.
      
      This patch implements a compare_ether_addr_64bits() function, that
      uses the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ifdef to efficiently
      perform the 6 bytes comparison on all capable arches.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f87e235
  4. 23 Nov, 2008 7 commits
    • David S. Miller's avatar
      net: Make sure BHs are disabled in sock_prot_inuse_add() · 6f756a8c
      David S. Miller authored
      The rule of calling sock_prot_inuse_add() is that BHs must
      be disabled.  Some new calls were added where this was not
      true and this tiggers warnings as reported by Ilpo.
      
      Fix this by adding explicit BH disabling around those call sites.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f756a8c
    • Alexey Dobriyan's avatar
      net: fix tunnels in netns after ndo_ changes · be77e593
      Alexey Dobriyan authored
      dev_net_set() should be the very first thing after alloc_netdev().
      
      "ndo_" changes turned simple assignment (which is OK to do before netns
      assignment) into quite non-trivial operation (which is not OK, init_net was
      used). This leads to incomplete initialisation of tunnel device in netns.
      
      BUG: unable to handle kernel NULL pointer dereference at 00000004
      IP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f
      *pde = 00000000 
      Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
      last sysfs file: /sys/class/net/lo/operstate
      
      Pid: 10, comm: netns Not tainted (2.6.28-rc6 #1) 
      EIP: 0060:[<c02efdb5>] EFLAGS: 00010246 CPU: 0
      EIP is at ip6_tnl_exit_net+0x37/0x4f
      EAX: 00000000 EBX: 00000020 ECX: 00000000 EDX: 00000003
      ESI: c5caef30 EDI: c782bbe8 EBP: c7909f50 ESP: c7909f48
       DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
      Process netns (pid: 10, ti=c7908000 task=c7905780 task.ti=c7908000)
      Stack:
       c03e75e0 c7390bc8 c7909f60 c0245448 c7390bd8 c7390bf0 c7909fa8 c012577a
       00000000 00000002 00000000 c0125736 c782bbe8 c7909f90 c0308fe3 c782bc04
       c7390bd4 c0245406 c084b718 c04f0770 c03ad785 c782bbe8 c782bc04 c782bc0c
      Call Trace:
       [<c0245448>] ? cleanup_net+0x42/0x82
       [<c012577a>] ? run_workqueue+0xd6/0x1ae
       [<c0125736>] ? run_workqueue+0x92/0x1ae
       [<c0308fe3>] ? schedule+0x275/0x285
       [<c0245406>] ? cleanup_net+0x0/0x82
       [<c0125ae1>] ? worker_thread+0x81/0x8d
       [<c0128344>] ? autoremove_wake_function+0x0/0x33
       [<c0125a60>] ? worker_thread+0x0/0x8d
       [<c012815c>] ? kthread+0x39/0x5e
       [<c0128123>] ? kthread+0x0/0x5e
       [<c0103b9f>] ? kernel_thread_helper+0x7/0x10
      Code: db e8 05 ff ff ff 89 c6 e8 dc 04 f6 ff eb 08 8b 40 04 e8 38 89 f5 ff 8b 44 9e 04 85 c0 75 f0 43 83 fb 20 75 f2 8b 86 84 00 00 00 <8b> 40 04 e8 1c 89 f5 ff e8 98 04 f6 ff 89 f0 e8 f8 63 e6 ff 5b 
      EIP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f SS:ESP 0068:c7909f48
      ---[ end trace 6c2f2328fccd3e0c ]---
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be77e593
    • Eric Dumazet's avatar
      net: Convert TCP/DCCP listening hash tables to use RCU · c25eb3bf
      Eric Dumazet authored
      This is the last step to be able to perform full RCU lookups
      in __inet_lookup() : After established/timewait tables, we
      add RCU lookups to listening hash table.
      
      The only trick here is that a socket of a given type (TCP ipv4,
      TCP ipv6, ...) can now flight between two different tables
      (established and listening) during a RCU grace period, so we
      must use different 'nulls' end-of-chain values for two tables.
      
      We define a large value :
      
      #define LISTENING_NULLS_BASE (1U << 29)
      
      So that slots in listening table are guaranteed to have different
      end-of-chain values than slots in established table. A reader can
      still detect it finished its lookup in the right chain.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c25eb3bf
    • Gerrit Renker's avatar
      dccp: Header option insertion routine for feature-negotiation · 8c862c23
      Gerrit Renker authored
      The patch extends existing code:
       * Confirm options divide into the confirmed value plus an optional preference
         list for SP values. Previously only the preference list was echoed for SP
         values, now the confirmed value is added as per RFC 4340, 6.1;
       * length and sanity checks are added to avoid illegal memory (or NULL) access.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c862c23
    • Gerrit Renker's avatar
      dccp: Support for Mandatory options · d3710566
      Gerrit Renker authored
      Support for Mandatory options is provided by this patch, which will
      be used by subsequent feature-negotiation patches.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3710566
    • Gerrit Renker's avatar
      dccp: Increase the scope of variable-length htonl/ntohl functions · 02fa460e
      Gerrit Renker authored
      This extends the scope of two available functions,
      encode|decode_value_var, to work up to 6 (8) bytes, to match maximum
      requirements in the RFC.
      
      These functions are going to be used both by general option processing
      and feature negotiation code, hence declarations have been put into
      feat.h.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02fa460e
    • Gerrit Renker's avatar
      dccp: API to query the current TX/RX CCID · 71c262a3
      Gerrit Renker authored
      This provides function to query the current TX/RX CCID dynamically,
      without reliance on the minisock value, using dynamic information
      available in the currently loaded CCID module.
      
      This query function is then used to
       (a) provide the getsockopt part for getting/setting CCIDs via sockopts;
       (b) replace the current test for "which CCID is in use" in probe.c.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71c262a3