1. 30 Aug, 2012 1 commit
    • Pablo Neira Ayuso's avatar
      netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation · 3f509c68
      Pablo Neira Ayuso authored
      
      
      We're hitting bug while trying to reinsert an already existing
      expectation:
      
      kernel BUG at kernel/timer.c:895!
      invalid opcode: 0000 [#1] SMP
      [...]
      Call Trace:
       <IRQ>
       [<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack]
       [<ffffffff812d423a>] ? in4_pton+0x72/0x131
       [<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip]
       [<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip]
       [<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip]
       [<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c
       [<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip]
      
      We have to remove the RTP expectation if the RTCP expectation hits EBUSY
      since we keep trying with other ports until we succeed.
      Reported-by: default avatarRafal Fitt <rafalf@aplusc.com.pl>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3f509c68
  2. 20 Aug, 2012 1 commit
  3. 14 Aug, 2012 1 commit
  4. 10 Aug, 2012 3 commits
  5. 09 Aug, 2012 3 commits
    • Eric Dumazet's avatar
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet authored
      commit 5d299f3d
      
       (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      uhci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      (null)
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      0000000000000000
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      ffff88024827ad00
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      0000000000000000
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      ffff880254735380
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      7fffffffffff0218
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      knlGS:0000000000000000
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      00000000000007f0
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      ffff880245fc7d68
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      ffff880245fc7d18
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      00000000000000ff
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63d02d15
    • Eric Dumazet's avatar
      ipv4: tcp: unicast_sock should not land outside of TCP stack · 3a7c384f
      Eric Dumazet authored
      commit be9f4a44
      
       (ipv4: tcp: remove per net tcp_sock) added a
      selinux regression, reported and bisected by John Stultz
      
      selinux_ip_postroute_compat() expect to find a valid sk->sk_security
      pointer, but this field is NULL for unicast_sock
      
      It turns out that unicast_sock are really temporary stuff to be able
      to reuse  part of IP stack (ip_append_data()/ip_push_pending_frames())
      
      Fact is that frames sent by ip_send_unicast_reply() should be orphaned
      to not fool LSM.
      
      Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
      fake socket at all. I'll probably implement tcp_v4_send_response() to
      remove these unicast_sock in linux-3.7
      Reported-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Bisected-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a7c384f
    • Eric Dumazet's avatar
      tcp: must free metrics at net dismantle · 36471012
      Eric Dumazet authored
      
      
      We currently leak all tcp metrics at struct net dismantle time.
      
      tcp_net_metrics_exit() frees the hash table, we must first
      iterate it to free all metrics.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36471012
  6. 08 Aug, 2012 1 commit
  7. 06 Aug, 2012 3 commits
  8. 02 Aug, 2012 2 commits
  9. 31 Jul, 2012 7 commits
    • Mel Gorman's avatar
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman authored
      
      
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • Mel Gorman's avatar
      net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket · 99a1dec7
      Mel Gorman authored
      
      
      Introduce sk_gfp_atomic(), this function allows to inject sock specific
      flags to each sock related allocation.  It is only used on allocation
      paths that may be required for writing pages back to network storage.
      
      [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99a1dec7
    • Andrew Morton's avatar
      memcg: rename config variables · c255a458
      Andrew Morton authored
      
      
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
    • David S. Miller's avatar
      ipv4: Properly purge netdev references on uncached routes. · caacf05e
      David S. Miller authored
      
      
      When a device is unregistered, we have to purge all of the
      references to it that may exist in the entire system.
      
      If a route is uncached, we currently have no way of accomplishing
      this.
      
      So create a global list that is scanned when a network device goes
      down.  This mirrors the logic in net/core/dst.c's dst_ifdown().
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      caacf05e
    • David S. Miller's avatar
      c5038a83
    • Eric Dumazet's avatar
      ipv4: percpu nh_rth_output cache · d26b3a7c
      Eric Dumazet authored
      
      
      Input path is mostly run under RCU and doesnt touch dst refcnt
      
      But output path on forwarding or UDP workloads hits
      badly dst refcount, and we have lot of false sharing, for example
      in ipv4_mtu() when reading rt->rt_pmtu
      
      Using a percpu cache for nh_rth_output gives a nice performance
      increase at a small cost.
      
      24 udpflood test on my 24 cpu machine (dummy0 output device)
      (each process sends 1.000.000 udp frames, 24 processes are started)
      
      before : 5.24 s
      after : 2.06 s
      For reference, time on linux-3.5 : 6.60 s
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d26b3a7c
    • Eric Dumazet's avatar
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet authored
      commit 404e0a8b
      
       (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54764bb6
  10. 30 Jul, 2012 5 commits
  11. 27 Jul, 2012 3 commits
    • Jiri Kosina's avatar
      tcp: perform DMA to userspace only if there is a task waiting for it · 59ea33a6
      Jiri Kosina authored
      Back in 2006, commit 1a2449a8
      
       ("[I/OAT]: TCP recv offload to I/OAT")
      added support for receive offloading to IOAT dma engine if available.
      
      The code in tcp_rcv_established() tries to perform early DMA copy if
      applicable. It however does so without checking whether the userspace
      task is actually expecting the data in the buffer.
      
      This is not a problem under normal circumstances, but there is a corner
      case where this doesn't work -- and that's when MSG_TRUNC flag to
      recvmsg() is used.
      
      If the IOAT dma engine is not used, the code properly checks whether
      there is a valid ucopy.task and the socket is owned by userspace, but
      misses the check in the dmaengine case.
      
      This problem can be observed in real trivially -- for example 'tbench' is a
      good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
      IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
      have been already early-copied in tcp_rcv_established() using dma engine.
      
      This patch introduces the same check we are performing in the simple
      iovec copy case to the IOAT case as well. It fixes the indefinite
      recvmsg(MSG_TRUNC) hangs.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59ea33a6
    • Eric Dumazet's avatar
      ipv4: fix TCP early demux · 505fbcf0
      Eric Dumazet authored
      commit 92101b3b
      
       (ipv4: Prepare for change of rt->rt_iif encoding.)
      invalidated TCP early demux, because rx_dst_ifindex is not properly
      initialized and checked.
      
      Also remove the use of inet_iif(skb) in favor or skb->skb_iif
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      505fbcf0
    • Hangbin Liu's avatar
      tcp: Add TCP_USER_TIMEOUT negative value check · 42493570
      Hangbin Liu authored
      TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But
      patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75
      
      ) didn't check the negative
      values. If a user assign -1 to it, the socket will set successfully and wait
      for 4294967295 miliseconds. This patch add a negative value check to avoid
      this issue.
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42493570
  12. 26 Jul, 2012 2 commits
  13. 25 Jul, 2012 1 commit
  14. 24 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: early_demux fixes · 9cb429d6
      Eric Dumazet authored
      
      
      1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
         and fix a potential bug if skb->head was reallocated
         (iph & th pointers were not reloaded)
      
      TCP stack will pull/check headers anyway.
      
      2) must reload iph in ip_rcv_finish() after early_demux()
       call since skb->head might have changed.
      
      3) skb->dev->ifindex can be now replaced by skb->skb_iif
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cb429d6
  15. 23 Jul, 2012 6 commits