1. 21 Nov, 2008 1 commit
  2. 20 Nov, 2008 1 commit
  3. 16 Nov, 2008 1 commit
    • Eric Dumazet's avatar
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet authored
      
      
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ab5aee7
  4. 16 Oct, 2008 1 commit
  5. 28 Aug, 2008 1 commit
  6. 11 Jun, 2008 1 commit
  7. 31 Jan, 2008 3 commits
  8. 28 Jan, 2008 1 commit
  9. 02 Dec, 2007 1 commit
  10. 29 Nov, 2007 1 commit
    • Pavel Emelyanov's avatar
      [INET]: Fix inet_diag register vs rcv race · 07693198
      Pavel Emelyanov authored
      
      
      The following race is possible when one cpu unregisters the handler
      while other one is trying to receive a message and call this one:
      
      CPU1:                                                 CPU2:
      inet_diag_rcv()                                       inet_diag_unregister()
        mutex_lock(&inet_diag_mutex);
        netlink_rcv_skb(skb, &inet_diag_rcv_msg);
          if (inet_diag_table[nlh->nlmsg_type] == 
                                     NULL) /* false handler is still registered */
          ...
          netlink_dump_start(idiagnl, skb, nlh,
                                 inet_diag_dump, NULL);
                 cb = kzalloc(sizeof(*cb), GFP_KERNEL);
                         /* sleep here freeing memory 
                          * or preempt
                          * or sleep later on nlk->cb_mutex
                          */
                                                               spin_lock(&inet_diag_register_lock);
                                                               inet_diag_table[type] = NULL;
          ...                                                  spin_unlock(&inet_diag_register_lock);
                                                               synchronize_rcu();
                                                               /* CPU1 is sleeping - RCU quiescent
                                                                * state is passed
                                                                */
                                                               return;
          /* inet_diag_dump is finally called: */
          inet_diag_dump()
            handler = inet_diag_table[cb->nlh->nlmsg_type];
            BUG_ON(handler == NULL); 
            /* OOPS! While we slept the unregister has set
             * handler to NULL :(
             */
      
      Grep showed, that the register/unregister functions are called
      from init/fini module callbacks for tcp_/dccp_diag, so it's OK
      to use the inet_diag_mutex to synchronize manipulations with the
      inet_diag_table and the access to it.
      
      Besides, as Herbert pointed out, asynchronous dumps should hold 
      this mutex as well, and thus, we provide the mutex as cb_mutex one.
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      07693198
  11. 07 Nov, 2007 1 commit
    • Eric Dumazet's avatar
      [INET]: Remove per bucket rwlock in tcp/dccp ehash table. · 230140cf
      Eric Dumazet authored
      As done two years ago on IP route cache table (commit
      22c047cc
      
      ) , we can avoid using one
      lock per hash bucket for the huge TCP/DCCP hash tables.
      
      On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
      litle performance differences. (we hit a different cache line for the
      rwlock, but then the bucket cache line have a better sharing factor
      among cpus, since we dirty it less often). For netstat or ss commands
      that want a full scan of hash table, we perform fewer memory accesses.
      
      Using a 'small' table of hashed rwlocks should be more than enough to
      provide correct SMP concurrency between different buckets, without
      using too much memory. Sizing of this table depends on
      num_possible_cpus() and various CONFIG settings.
      
      This patch provides some locking abstraction that may ease a future
      work using a different model for TCP/DCCP table.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      230140cf
  12. 22 Oct, 2007 1 commit
  13. 10 Oct, 2007 4 commits
    • Denis V. Lunev's avatar
      [NET]: make netlink user -> kernel interface synchronious · cd40b7d3
      Denis V. Lunev authored
      
      
      This patch make processing netlink user -> kernel messages synchronious.
      This change was inspired by the talk with Alexey Kuznetsov about current
      netlink messages processing. He says that he was badly wrong when introduced 
      asynchronious user -> kernel communication.
      
      The call netlink_unicast is the only path to send message to the kernel
      netlink socket. But, unfortunately, it is also used to send data to the
      user.
      
      Before this change the user message has been attached to the socket queue
      and sk->sk_data_ready was called. The process has been blocked until all
      pending messages were processed. The bad thing is that this processing
      may occur in the arbitrary process context.
      
      This patch changes nlk->data_ready callback to get 1 skb and force packet
      processing right in the netlink_unicast.
      
      Kernel -> user path in netlink_unicast remains untouched.
      
      EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
      drop, but the process remains in the cycle until the message will be fully
      processed. So, there is no need to use this kludges now.
      Signed-off-by: default avatarDenis V. Lunev <den@openvz.org>
      Acked-by: default avatarAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd40b7d3
    • Herbert Xu's avatar
      [NETLINK]: Avoid pointer in netlink_run_queue · 0cfad075
      Herbert Xu authored
      
      
      I was looking at Patrick's fix to inet_diag and it occured
      to me that we're using a pointer argument to return values
      unnecessarily in netlink_run_queue.  Changing it to return
      the value will allow the compiler to generate better code
      since the value won't have to be memory-backed.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cfad075
    • Eric W. Biederman's avatar
      [NET]: Support multiple network namespaces with netlink · b4b51029
      Eric W. Biederman authored
      
      
      Each netlink socket will live in exactly one network namespace,
      this includes the controlling kernel sockets.
      
      This patch updates all of the existing netlink protocols
      to only support the initial network namespace.  Request
      by clients in other namespaces will get -ECONREFUSED.
      As they would if the kernel did not have the support for
      that netlink protocol compiled in.
      
      As each netlink protocol is updated to be multiple network
      namespace safe it can register multiple kernel sockets
      to acquire a presence in the rest of the network namespaces.
      
      The implementation in af_netlink is a simple filter implementation
      at hash table insertion and hash table look up time.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4b51029
    • Ilpo Järvinen's avatar
      [NET]: DIV_ROUND_UP cleanup (part two) · 172589cc
      Ilpo Järvinen authored
      
      
      Hopefully captured all single statement cases under net/. I'm
      not too sure if there is some policy about #includes that are
      "guaranteed" (ie., in the current tree) to be available through
      some other #included header, so I just added linux/kernel.h to
      each changed file that didn't #include it previously.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      172589cc
  14. 11 Sep, 2007 1 commit
    • Patrick McHardy's avatar
      [INET_DIAG]: Fix oops in netlink_rcv_skb · 0a9c7301
      Patrick McHardy authored
      
      
      netlink_run_queue() doesn't handle multiple processes processing the
      queue concurrently. Serialize queue processing in inet_diag to fix
      a oops in netlink_rcv_skb caused by netlink_run_queue passing a
      NULL for the skb.
      
      BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054
      [349587.500454]  printing eip:
      [349587.500457] c03318ae
      [349587.500459] *pde = 00000000
      [349587.500464] Oops: 0000 [#1]
      [349587.500466] PREEMPT SMP
      [349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
      [349587.500483] CPU:    0
      [349587.500485] EIP:    0060:[<c03318ae>]    Not tainted VLI
      [349587.500487] EFLAGS: 00010246   (2.6.22.3 #1)
      [349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
      [349587.500506] eax: 00000000   ebx: 00000000   ecx: c148d2a0   edx: c0398819
      [349587.500510] esi: 00000000   edi: c0398819   ebp: c7a21c8c   esp: c7a21c80
      [349587.500517] ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
      [349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000)
      [349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c
      [349587.500542]        f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c
      [349587.500558]        f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff
      [349587.500578] Call Trace:
      [349587.500581]  [<c010361a>] show_trace_log_lvl+0x1c/0x33
      [349587.500591]  [<c01036d4>] show_stack_log_lvl+0x8d/0xaa
      [349587.500595]  [<c010390e>] show_registers+0x1cb/0x321
      [349587.500604]  [<c0103bff>] die+0x112/0x1e1
      [349587.500607]  [<c01132d2>] do_page_fault+0x229/0x565
      [349587.500618]  [<c03c8d3a>] error_code+0x72/0x78
      [349587.500625]  [<c0331962>] netlink_run_queue+0x40/0x76
      [349587.500632]  [<c03988e3>] inet_diag_rcv+0x1f/0x2c
      [349587.500639]  [<c03312a5>] netlink_data_ready+0x57/0x59
      [349587.500643]  [<c0330681>] netlink_sendskb+0x24/0x45
      [349587.500651]  [<c03307c6>] netlink_unicast+0x100/0x116
      [349587.500656]  [<c0330f83>] netlink_sendmsg+0x1c2/0x280
      [349587.500664]  [<c02fcce9>] sock_sendmsg+0xba/0xd5
      [349587.500671]  [<c02fe4d1>] sys_sendmsg+0x17b/0x1e8
      [349587.500676]  [<c02fe92d>] sys_socketcall+0x230/0x24d
      [349587.500684]  [<c01028d2>] syscall_call+0x7/0xb
      [349587.500691]  =======================
      [349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f
      
      Reported by Athanasius <link@miggy.org>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a9c7301
  15. 25 Apr, 2007 6 commits
  16. 11 Feb, 2007 1 commit
  17. 08 Feb, 2007 2 commits
    • Eric Dumazet's avatar
      [NET]: change layout of ehash table · dbca9b27
      Eric Dumazet authored
      
      
      ehash table layout is currently this one :
      
      First half of this table is used by sockets not in TIME_WAIT state
      Second half of it is used by sockets in TIME_WAIT state.
      
      This is non optimal because of for a given hash or socket, the two chain heads 
      are located in separate cache lines.
      Moreover the locks of the second half are never used.
      
      If instead of this halving, we use two list heads in inet_ehash_bucket instead 
      of only one, we probably can avoid one cache miss, and reduce ram usage, 
      particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, 
      CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep 
      together related chains to speedup lookups and socket state change.
      
      In this patch I did not try to align struct inet_ehash_bucket, but a future 
      patch could try to make this structure have a convenient size (a power of two 
      or a multiple of L1_CACHE_SIZE).
      I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so 
      maybe we dont need to scratch our heads to align the bucket...
      
      Note : In case struct inet_ehash_bucket is not a power of two, we could 
      probably change alloc_large_system_hash() (in case it use __get_free_pages()) 
      to free the unused space. It currently allocates a big zone, but the last 
      quarter of it could be freed. Again, this should be a temporary 'problem'.
      
      Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbca9b27
    • Patrick McHardy's avatar
      [NETLINK]: Don't BUG on undersized allocations · 26932566
      Patrick McHardy authored
      
      
      Currently netlink users BUG when the allocated skb for an event
      notification is undersized. While this is certainly a kernel bug,
      its not critical and crashing the kernel is too drastic, especially
      when considering that these errors have appeared multiple times in
      the past and it BUGs even if no listeners are present.
      
      This patch replaces BUG by WARN_ON and changes the notification
      functions to inform potential listeners of undersized allocations
      using a unique error code (EMSGSIZE).
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26932566
  18. 28 Sep, 2006 1 commit
  19. 21 Jul, 2006 1 commit
  20. 30 Jun, 2006 1 commit
  21. 09 Jan, 2006 4 commits
  22. 03 Jan, 2006 2 commits
  23. 09 Nov, 2005 1 commit
  24. 29 Aug, 2005 2 commits
    • Patrick McHardy's avatar
    • Arnaldo Carvalho de Melo's avatar
      [INET_DIAG]: Move the tcp_diag interface to the proper place · 17b085ea
      Arnaldo Carvalho de Melo authored
      
      
      With this the previous setup is back, i.e. tcp_diag can be built as a module,
      as dccp_diag and both share the infrastructure available in inet_diag.
      
      If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
      built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
      selected static or as a module, if CONFIG_INET_DIAG is y, being statically
      linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
      built in the same manner as CONFIG_IP_DCCP.
      
      Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
      iproute2 for UDP sockets as well.
      
      Ah, just to show an example of this new infrastructure working for DCCP :-)
      
      [root@qemu ~]# ./ss -dane
      State      Recv-Q Send-Q Local Address:Port  Peer Address:Port
      LISTEN     0      0                  *:5001             *:*     ino:942 sk:cfd503a0
      ESTAB      0      0          127.0.0.1:5001     127.0.0.1:32770 ino:943 sk:cfd50a60
      ESTAB      0      0          127.0.0.1:32770    127.0.0.1:5001  ino:947 sk:cfd50700
      TIME-WAIT  0      0          127.0.0.1:32769    127.0.0.1:5001  timer:(timewait,3.430ms,0) ino:0 sk:cf209620
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@mandriva.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17b085ea