1. 23 Sep, 2008 1 commit
  2. 21 Sep, 2008 1 commit
  3. 20 Sep, 2008 8 commits
  4. 09 Sep, 2008 1 commit
  5. 03 Sep, 2008 1 commit
  6. 23 Aug, 2008 3 commits
  7. 25 Jul, 2008 1 commit
  8. 23 Jul, 2008 1 commit
    • David S. Miller's avatar
      tcp: Clear probes_out more aggressively in tcp_ack(). · 4b53fb67
      David S. Miller authored
      This is based upon an excellent bug report from Eric Dumazet.
      tcp_ack() should clear ->icsk_probes_out even if there are packets
      outstanding.  Otherwise if we get a sequence of ACKs while we do have
      packets outstanding over and over again, we'll never clear the
      probes_out value and eventually think the connection is too sick and
      we'll reset it.
      This appears to be some "optimization" added to tcp_ack() in the 2.4.x
      timeframe.  In 2.2.x, probes_out is pretty much always cleared by
      Here is Eric's original report:
      Apparently, we can in some situations reset TCP connections in a couple of seconds when some frames are lost.
      In order to reproduce the problem, please try the following program on linux-2.6.25.*
      Setup some iptables rules to allow two frames per second sent on loopback interface to tcp destination port 12000
      iptables -N SLOWLO
      iptables -A SLOWLO -m hashlimit --hashlimit 2 --hashlimit-burst 1 --hashlimit-mode dstip --hashlimit-name slow2 -j ACCEPT
      iptables -A SLOWLO -j DROP
      iptables -A OUTPUT -o lo -p tcp --dport 12000 -j SLOWLO
      Then run the attached program and see the output :
      # ./loop
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,1)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,3)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,5)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,7)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,9)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,200ms,11)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,201ms,13)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                                              timer:(persist,188ms,15)
      write(): Connection timed out
      wrote 890 bytes but was interrupted after 9 seconds
      ESTAB      0      0         
      Exiting read() because no data available (4000 ms timeout).
      read 860 bytes
      While this tcp session makes progress (sending frames with 50 bytes of payload, every 500ms), linux tcp stack decides to reset it, when tcp_retries 2 is reached (default value : 15)
      tcpdump :
      15:30:28.856695 IP > S 33788768:33788768(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
      15:30:28.856711 IP > S 33899253:33899253(0) ack 33788769 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
      15:30:29.356947 IP > P 1:61(60) ack 1 win 257
      15:30:29.356966 IP > . ack 61 win 257
      15:30:29.866415 IP > P 61:111(50) ack 1 win 257
      15:30:29.866427 IP > . ack 111 win 257
      15:30:30.366516 IP > P 111:161(50) ack 1 win 257
      15:30:30.366527 IP > . ack 161 win 257
      15:30:30.876196 IP > P 161:211(50) ack 1 win 257
      15:30:30.876207 IP > . ack 211 win 257
      15:30:31.376282 IP > P 211:261(50) ack 1 win 257
      15:30:31.376290 IP > . ack 261 win 257
      15:30:31.885619 IP > P 261:311(50) ack 1 win 257
      15:30:31.885631 IP > . ack 311 win 257
      15:30:32.385705 IP > P 311:361(50) ack 1 win 257
      15:30:32.385715 IP > . ack 361 win 257
      15:30:32.895249 IP > P 361:411(50) ack 1 win 257
      15:30:32.895266 IP > . ack 411 win 257
      15:30:33.395341 IP > P 411:461(50) ack 1 win 257
      15:30:33.395351 IP > . ack 461 win 257
      15:30:33.918085 IP > P 461:511(50) ack 1 win 257
      15:30:33.918096 IP > . ack 511 win 257
      15:30:34.418163 IP > P 511:561(50) ack 1 win 257
      15:30:34.418172 IP > . ack 561 win 257
      15:30:34.927685 IP > P 561:611(50) ack 1 win 257
      15:30:34.927698 IP > . ack 611 win 257
      15:30:35.427757 IP > P 611:661(50) ack 1 win 257
      15:30:35.427766 IP > . ack 661 win 257
      15:30:35.937359 IP > P 661:711(50) ack 1 win 257
      15:30:35.937376 IP > . ack 711 win 257
      15:30:36.437451 IP > P 711:761(50) ack 1 win 257
      15:30:36.437464 IP > . ack 761 win 257
      15:30:36.947022 IP > P 761:811(50) ack 1 win 257
      15:30:36.947039 IP > . ack 811 win 257
      15:30:37.447135 IP > P 811:861(50) ack 1 win 257
      15:30:37.447203 IP > . ack 861 win 257
      15:30:41.448171 IP > F 1:1(0) ack 861 win 257
      15:30:41.448189 IP > R 33789629:33789629(0) win 0
      Source of program :
       * small producer/consumer program.
       * setup a listener on
       * Forks a child
       *   child connect to, and sends 10 bytes on this tcp socket every 100 ms
       * Father accepts connection, and read all data
      #include <sys/types.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <time.h>
      #include <sys/poll.h>
      int port = 12000;
      char buffer[4096];
      int main(int argc, char *argv[])
              int lfd = socket(AF_INET, SOCK_STREAM, 0);
              struct sockaddr_in socket_address;
              time_t t0, t1;
              int on = 1, sfd, res;
              unsigned long total = 0;
              socklen_t alen = sizeof(socket_address);
              pid_t pid;
              socket_address.sin_family = AF_INET;
              socket_address.sin_port = htons(port);
              socket_address.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
              if (lfd == -1) {
                      return 1;
              setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(int));
              if (bind(lfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                      return 1;
              if (listen(lfd, 1) == -1) {
                      return 1;
              pid = fork();
              if (pid == 0) {
                      int i, cfd = socket(AF_INET, SOCK_STREAM, 0);
                      if (connect(cfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                              return 1;
                      for (i = 0 ; ;) {
                              res = write(cfd, "blablabla\n", 10);
                              if (res > 0) total += res;
                              else if (res == -1) {
                              } else break;
                              if (++i == 10) {
                                      system("ss -on dst");
                                      i = 0;
                      fprintf(stderr, "wrote %lu bytes but was interrupted after %g seconds\n", total, difftime(t1, t0));
                      system("ss -on | grep");
                      return 0;
              sfd = accept(lfd, (struct sockaddr *)&socket_address, &alen);
              if (sfd == -1) {
                      return 1;
              while (1) {
                      struct pollfd pfd[1];
                      pfd[0].fd = sfd;
                      pfd[0].events = POLLIN;
                      if (poll(pfd, 1, 4000) == 0) {
                              fprintf(stderr, "Exiting read() because no data available (4000 ms timeout).\n");
                      res = read(sfd, buffer, sizeof(buffer));
                      if (res > 0) total += res;
                      else if (res == 0) break;
                      else perror("read()");
              fprintf(stderr, "read %lu bytes\n", total);
              return 0;
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 19 Jul, 2008 2 commits
  10. 16 Jul, 2008 3 commits
  11. 03 Jul, 2008 1 commit
    • Pavel Emelyanov's avatar
      tcp: de-bloat a bit with factoring NET_INC_STATS_BH out · 40b215e5
      Pavel Emelyanov authored
      There are some places in TCP that select one MIB index to
      bump snmp statistics like this:
      	if (<something>)
      	else if (<something_else>)
      or in a more tricky but still similar way.
      On the other hand, this NET_INC_STATS_BH is a camouflaged
      increment of percpu variable, which is not that small.
      Factoring those cases out de-bloats 235 bytes on non-preemptible
      i386 config and drives parts of the code into 80 columns.
      add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
      function                                     old     new   delta
      tcp_fastretrans_alert                       1437    1424     -13
      tcp_dsack_set                                137     124     -13
      tcp_xmit_retransmit_queue                    690     676     -14
      tcp_try_undo_recovery                        283     265     -18
      tcp_sacktag_write_queue                     1550    1515     -35
      tcp_update_reordering                        162     106     -56
      tcp_retransmit_timer                         990     904     -86
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 12 Jun, 2008 1 commit
    • David S. Miller's avatar
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller authored
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      Next is a problem noticed by Vitaliy Gusev, he noted:
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      So revert this thing for now.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 11 Jun, 2008 2 commits
  14. 04 Jun, 2008 2 commits
    • Ilpo Järvinen's avatar
      tcp: fix skb vs fack_count out-of-sync condition · a6604471
      Ilpo Järvinen authored
      This bug is able to corrupt fackets_out in very rare cases.
      In order for this to cause corruption:
        1) DSACK in the middle of previous SACK block must be generated.
        2) In order to take that particular branch, part or all of the
           DSACKed segment must already be SACKed so that we have that
           in cache in the first place.
        3) The new info must be top enough so that fackets_out will be
           updated on this iteration.
      ...then fack_count is updated while skb wasn't, then we walk again
      that particular segment thus updating fack_count twice for
      a single skb and finally that value is assigned to fackets_out
      by tcp_sacktag_one.
      It is safe to call tcp_sacktag_one just once for a segment (at
      DSACK), no need to call again for plain SACK.
      Potential problem of the miscount are limited to premature entry
      to recovery and to inflated reordering metric (which could even
      cancel each other out in the most the luckiest scenarios :-)).
      Both are quite insignificant in worst case too and there exists
      also code to reset them (fackets_out once sacked_out becomes zero
      and reordering metric on RTO).
      This has been reported by a number of people, because it occurred
      quite rarely, it has been very evasive. Andy Furniss was able to
      get it to occur couple of times so that a bit more info was
      collected about the problem using a debug patch, though it still
      required lot of checking around. Thanks also to others who have
      tried to help here.
      This is listed as Bugzilla #10346. The bug was introduced by
      me in commit 68f8353b
       ([TCP]: Rewrite SACK block processing & 
      sack_recv_cache use), I probably thought back then that there's
      need to scan that entry twice or didn't dare to make it go
      through it just once there. Going through twice would have
      required restoring fack_count after the walk but as noted above,
      I chose to drop the additional walk step altogether here.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Ilpo Järvinen's avatar
      tcp: Fix inconsistency source (CA_Open only when !tcp_left_out(tp)) · 8aca6cb1
      Ilpo Järvinen authored
      It is possible that this skip path causes TCP to end up into an
      invalid state where ca_state was left to CA_Open while some
      segments already came into sacked_out. If next valid ACK doesn't
      contain new SACK information TCP fails to enter into
      tcp_fastretrans_alert(). Thus at least high_seq is set
      incorrectly to a too high seqno because some new data segments
      could be sent in between (and also, limited transmit is not
      being correctly invoked there). Reordering in both directions
      can easily cause this situation to occur.
      I guess we would want to use tcp_moderate_cwnd(tp) there as well
      as it may be possible to use this to trigger oversized burst to
      network by sending an old ACK with huge amount of SACK info, but
      I'm a bit unsure about its effects (mainly to FlightSize), so to
      be on the safe side I just currently fixed it minimally to keep
      TCP's state consistent (obviously, such nasty ACKs have been
      possible this far). Though it seems that FlightSize is already
      underestimated by some amount, so probably on the long term we
      might want to trigger recovery there too, if appropriate, to make
      FlightSize calculation to resemble reality at the time when the
      losses where discovered (but such change scares me too much now
      and requires some more thinking anyway how to do that as it
      likely involves some code shuffling).
      This bug was found by Brian Vowell while running my TCP debug
      patch to find cause of another TCP issue (fackets_out
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 13 May, 2008 2 commits
    • Ilpo Järvinen's avatar
      tcp FRTO: work-around inorder receivers · 79d44516
      Ilpo Järvinen authored
      If receiver consumes segments successfully only in-order, FRTO
      fallback to conventional recovery produces RTO loop because
      FRTO's forward transmissions will always get dropped and need to
      be resent, yet by default they're not marked as lost (which are
      the only segments we will retransmit in CA_Loss).
      Price to pay about this is occassionally unnecessarily
      retransmitting the forward transmission(s). SACK blocks help
      a bit to avoid this, so it's mainly a concern for NewReno case
      though SACK is not fully immune either.
      This change has a side-effect of fixing SACKFRTO problem where
      it didn't have snd_nxt of the RTO time available anymore when
      fallback become necessary (this problem would have only occured
      when RTO would occur for two or more segments and ECE arrives
      in step 3; no need to figure out how to fix that unless the
      TODO item of selective behavior is considered in future).
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: default avatarDamon L. Chesser <damon@damtek.com>
      Tested-by: default avatarDamon L. Chesser <damon@damtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Ilpo Järvinen's avatar
      tcp FRTO: Fix fallback to conventional recovery · a1c1f281
      Ilpo Järvinen authored
      It seems that commit 009a2e3e
       ("[TCP] FRTO: Improve
      interoperability with other undo_marker users") run into
      another land-mine which caused fallback to conventional
      recovery to break:
      1. Cumulative ACK arrives after FRTO retransmission
      2. tcp_try_to_open sees zero retrans_out, clears retrans_stamp
         which should be kept like in CA_Loss state it would be
      3. undo_marker change allowed tcp_packet_delayed to return
         true because of the cleared retrans_stamp once FRTO is
         terminated causing LossUndo to occur, which means all loss
         markings FRTO made are reverted.
      This means that the conventional recovery basically recovered
      one loss per RTT, which is not that efficient. It was quite
      unobvious that the undo_marker change broken something like
      this, I had a quite long session to track it down because of
      the non-intuitiviness of the bug (luckily I had a trivial
      reproducer at hand and I was also able to learn to use kprobes
      in the process as well :-)).
      This together with the NewReno+FRTO fix and FRTO in-order
      workaround this fixes Damon's problems, this and the first
      mentioned are enough to fix Bugzilla #10063.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: default avatarDamon L. Chesser <damon@damtek.com>
      Tested-by: default avatarDamon L. Chesser <damon@damtek.com>
      Tested-by: default avatarSebastian Hyrwall <zibbe@cisko.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 08 May, 2008 1 commit
    • Ilpo Järvinen's avatar
      tcp FRTO: SACK variant is errorneously used with NewReno · 62ab2227
      Ilpo Järvinen authored
      Note: there's actually another bug in FRTO's SACK variant, which
      is the causing failure in NewReno case because of the error
      that's fixed here. I'll fix the SACK case separately (it's
      a separate bug really, though related, but in order to fix that
      I need to audit tp->snd_nxt usage a bit).
      There were two places where SACK variant of FRTO is getting
      incorrectly used even if SACK wasn't negotiated by the TCP flow.
      This leads to incorrect setting of frto_highmark with NewReno
      if a previous recovery was interrupted by another RTO.
      An eventual fallback to conventional recovery then incorrectly
      considers one or couple of segments as forward transmissions
      though they weren't, which then are not LOST marked during
      fallback making them "non-retransmittable" until the next RTO.
      In a bad case, those segments are really lost and are the only
      one left in the window. Thus TCP needs another RTO to continue.
      The next FRTO, however, could again repeat the same events
      making the progress of the TCP flow extremely slow.
      In order for these events to occur at all, FRTO must occur
      again in FRTOs step 3 while the key segments must be lost as
      well, which is not too likely in practice. It seems to most
      frequently with some small devices such as network printers
      that *seem* to accept TCP segments only in-order. In cases
      were key segments weren't lost, things get automatically
      resolved because those wrongly marked segments don't need to be
      retransmitted in order to continue.
      I found a reproducer after digging up relevant reports (few
      reports in total, none at netdev or lkml I know of), some
      cases seemed to indicate middlebox issues which seems now
      to be a false assumption some people had made. Bugzilla
      #10063 _might_ be related. Damon L. Chesser <damon@damtek.com>
      had a reproducable case and was kind enough to tcpdump it
      for me. With the tcpdump log it was quite trivial to figure
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 04 May, 2008 1 commit
  18. 02 May, 2008 1 commit
  19. 27 Apr, 2008 1 commit
    • Evgeniy Polyakov's avatar
      tcp: Fix slab corruption with ipv6 and tcp6fuzz · 9ae27e0a
      Evgeniy Polyakov authored
      From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
      This fixes a regression added by ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established")
      tcp_v6_do_rcv()->tcp_rcv_established(), the latter goes to step5, where
      eventually skb can be freed via tcp_data_queue() (drop: label), then if
      check for tcp_defer_accept_check() returns true and thus
      tcp_rcv_established() returns -1, which forces tcp_v6_do_rcv() to jump
      to reset: label, which in turn will pass through discard: label and free
      the same skb again.
      Tested by Eric Sesterhenn.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-By: default avatarPatrick McManus <mcmanus@ducksong.com>
  20. 21 Apr, 2008 1 commit
  21. 15 Apr, 2008 2 commits
  22. 14 Apr, 2008 3 commits