1. 17 Jun, 2006 2 commits
  2. 04 May, 2006 1 commit
    • Herbert Xu's avatar
      [TCP]: Fix sock_orphan dead lock · 75c2d907
      Herbert Xu authored
      
      
      Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
      locks.  For example, the inet_diag code holds sk_callback_lock without
      disabling BH.  If an inbound packet arrives during that admittedly tiny
      window, it will cause a dead lock on bh_lock_sock.  Another possible
      path would be through sock_wfree if the network device driver frees the
      tx skb in process context with BH enabled.
      
      We can fix this by moving sock_orphan out of bh_lock_sock.
      
      The tricky bit is to work out when we need to destroy the socket
      ourselves and when it has already been destroyed by someone else.
      
      By moving sock_orphan before the release_sock we can solve this
      problem.  This is because as long as we own the socket lock its
      state cannot change.
      
      So we simply record the socket state before the release_sock
      and then check the state again after we regain the socket lock.
      If the socket state has transitioned to TCP_CLOSE in the time being,
      we know that the socket has been destroyed.  Otherwise the socket is
      still ours to keep.
      
      Note that I've also moved the increment on the orphan count forward.
      This may look like a problem as we're increasing it even if the socket
      is just about to be destroyed where it'll be decreased again.  However,
      this simply enlarges a window that already exists.  This also changes
      the orphan count test by one.
      
      Considering what the orphan count is meant to do this is no big deal.
      
      This problem was discoverd by Ingo Molnar using his lock validator.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75c2d907
  3. 25 Mar, 2006 3 commits
  4. 20 Mar, 2006 3 commits
  5. 03 Jan, 2006 2 commits
  6. 29 Nov, 2005 2 commits
    • Arjan van de Ven's avatar
      [NET]: Add const markers to various variables. · 9b5b5cff
      Arjan van de Ven authored
      
      
      the patch below marks various variables const in net/; the goal is to
      move them to the .rodata section so that they can't false-share
      cachelines with things that get written to, as well as potentially
      helping gcc a bit with optimisations.  (these were found using a gcc
      patch to warn about such variables)
      Signed-off-by: default avatarArjan van de Ven <arjan@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b5b5cff
    • Mike Stroyan's avatar
      [IPV4] tcp/route: Another look at hash table sizes · 18955cfc
      Mike Stroyan authored
      
      
        The tcp_ehash hash table gets too big on systems with really big memory.
      It is worse on systems with pages larger than 4KB.  It wastes memory that
      could be better used.  It also makes the netstat command slow because reading
      /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.
      
        The default value should not be larger for larger page sizes.  It seems
      that the effect of page size is an unintended error dating back a long
      time.  I also wonder if the default value really should be a larger
      fraction of memory for systems with more memory.  While systems with
      really big ram can afford more space for hash tables, it is not clear to
      me that they benefit from increasing the allocation ratio for this table.
      
        The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
      mm/page_alloc.c:alloc_large_system_hash.
      
      tcp_init calls alloc_large_system_hash passing parameters-
          bucketsize=sizeof(struct tcp_ehash_bucket)
          numentries=thash_entries
          scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
          limit=0
      
      On i386, PAGE_SHIFT is 12 for a page size of 4K
      On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K
      
      The num_physpages test above makes the allocation take a larger fraction
      of the total memory on systems with larger memory.  The threshold size
      for a i386 system is 512MB.  For an ia64 system with 16KB pages the
      threshold is 2GB.
      
      For smaller memory systems-
      On i386, scale = (27 - 12) = 15
      On ia64, scale = (27 - 14) = 13
      For larger memory systems-
      On i386, scale = (25 - 12) = 13
      On ia64, scale = (25 - 14) = 11
      
        For the rest of this discussion, I'll just track the larger memory case.
      
        The default behavior has numentries=thash_entries=0, so the allocated
      size is determined by either scale or by the default limit of 1/16 of
      total memory.
      
      In alloc_large_system_hash-
      |	numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
      |	numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
      |	numentries >>= 20 - PAGE_SHIFT;
      |	numentries <<= 20 - PAGE_SHIFT;
      
        At this point, numentries is pages for all of memory, rounded up to the
      nearest megabyte boundary.
      
      |	/* limit to 1 bucket per 2^scale bytes of low memory */
      |	if (scale > PAGE_SHIFT)
      |		numentries >>= (scale - PAGE_SHIFT);
      |	else
      |		numentries <<= (PAGE_SHIFT - scale);
      
      On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
      bytes of total memory.
      On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
      bytes of total memory.
      
      |        log2qty = long_log2(numentries);
      |
      |        do {
      |                size = bucketsize << log2qty;
      
      bucketsize is 16, so size is 16 times numentries, rounded
      down to a power of two.
      
      On i386, size is 1/512 of bytes of total memory.
      On ia64, size is 1/128 of bytes of total memory.
      
      For smaller systems the results are
      On i386, size is 1/2048 of bytes of total memory.
      On ia64, size is 1/512 of bytes of total memory.
      
        The large page effect can be removed by just replacing
      the use of PAGE_SHIFT with a constant of 12 in the calls to
      alloc_large_system_hash.  That makes them more like the other uses of
      that function from fs/inode.c and fs/dcache.c
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18955cfc
  7. 10 Nov, 2005 2 commits
  8. 05 Nov, 2005 1 commit
  9. 05 Sep, 2005 1 commit
  10. 01 Sep, 2005 2 commits
  11. 29 Aug, 2005 17 commits
  12. 23 Aug, 2005 1 commit
    • David S. Miller's avatar
      [TCP]: Unconditionally clear TCP_NAGLE_PUSH in skb_entail(). · 89ebd197
      David S. Miller authored
      
      
      Intention of this bit is to force pushing of the existing
      send queue when TCP_CORK or TCP_NODELAY state changes via
      setsockopt().
      
      But it's easy to create a situation where the bit never
      clears.  For example, if the send queue starts empty:
      
      1) set TCP_NODELAY
      2) clear TCP_NODELAY
      3) set TCP_CORK
      4) do small write()
      
      The current code will leave TCP_NAGLE_PUSH set after that
      sequence.  Unconditionally clearing the bit when new data
      is added via skb_entail() solves the problem.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89ebd197
  13. 08 Jul, 2005 1 commit
  14. 05 Jul, 2005 2 commits
    • David S. Miller's avatar
      [TCP]: Move to new TSO segmenting scheme. · c1b4a7e6
      David S. Miller authored
      
      
      Make TSO segment transmit size decisions at send time not earlier.
      
      The basic scheme is that we try to build as large a TSO frame as
      possible when pulling in the user data, but the size of the TSO frame
      output to the card is determined at transmit time.
      
      This is guided by tp->xmit_size_goal.  It is always set to a multiple
      of MSS and tells sendmsg/sendpage how large an SKB to try and build.
      
      Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
      necessary and conditions warrant.  These routines can also decide to
      "defer" in order to wait for more ACKs to arrive and thus allow larger
      TSO frames to be emitted.
      
      A general observation is that TSO elongates the pipe, thus requiring a
      larger congestion window and larger buffering especially at the sender
      side.  Therefore, it is important that applications 1) get a large
      enough socket send buffer (this is accomplished by our dynamic send
      buffer expansion code) 2) do large enough writes.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1b4a7e6
    • David S. Miller's avatar
      [TCP]: Fix send-side cpu utiliziation regression. · b4e26f5e
      David S. Miller authored
      
      
      Only put user data purely to pages when doing TSO.
      
      The extra page allocations cause two problems:
      
      1) Add the overhead of the page allocations themselves.
      2) Make us do small user copies when we get to the end
         of the TCP socket cache page.
      
      It is still beneficial to purely use pages for TSO,
      so we will do it for that case.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4e26f5e