1. 25 May, 2016 1 commit
    • Eric W. Biederman's avatar
      netfilter: nf_queue: Make the queue_handler pernet · dc3ee32e
      Eric W. Biederman authored
      Florian Weber reported:
      > Under full load (unshare() in loop -> OOM conditions) we can
      > get kernel panic:
      >
      > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      > IP: [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
      > [..]
      > task: ffff88012dfa3840 ti: ffff88012dffc000 task.ti: ffff88012dffc000
      > RIP: 0010:[<ffffffff81476c85>]  [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
      > RSP: 0000:ffff88012dfffd80  EFLAGS: 00010206
      > RAX: 0000000000000008 RBX: ffffffff81add0c0 RCX: ffff88013fd80000
      > [..]
      > Call Trace:
      >  [<ffffffff81474d98>] nf_queue_nf_hook_drop+0x18/0x20
      >  [<ffffffff814738eb>] nf_unregister_net_hook+0xdb/0x150
      >  [<ffffffff8147398f>] netfilter_net_exit+0x2f/0x60
      >  [<ffffffff8141b088>] ops_exit_list.isra.4+0x38/0x60
      >  [<ffffffff8141b652>] setup_net+0xc2/0x120
      >  [<ffffffff8141bd09>] copy_net_ns+0x79/0x120
      >  [<ffffffff8106965b>] create_new_namespaces+0x11b/0x1e0
      >  [<ffffffff810698a7>] unshare_nsproxy_namespaces+0x57/0xa0
      >  [<ffffffff8104baa2>] SyS_unshare+0x1b2/0x340
      >  [<ffffffff81608276>] entry_SYSCALL_64_fastpath+0x1e/0xa8
      > Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47
      >
      
      The simple fix for this requires a new pernet variable for struct
      nf_queue that indicates when it is safe to use the dynamically
      allocated nf_queue state.
      
      As we need a variable anyway make nf_register_queue_handler and
      nf_unregister_queue_handler pernet.  This allows the existing logic of
      when it is safe to use the state from the nfnetlink_queue module to be
      reused with no changes except for making it per net.
      
      The syncrhonize_rcu from nf_unregister_queue_handler is moved to a new
      function nfnl_queue_net_exit_batch so that the worst case of having a
      syncrhonize_rcu in the pernet exit path is not experienced in batch
      mode.
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      dc3ee32e
  2. 09 May, 2016 2 commits
  3. 06 May, 2016 1 commit
  4. 05 May, 2016 1 commit
  5. 25 Apr, 2016 1 commit
  6. 11 Apr, 2016 1 commit
    • David Ahern's avatar
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern authored
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
      
      Example:
      
                           [h2]                   [h3]   15.0.0.5
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
               12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
                                      [TOR3]
                                        3|
                                         |
                                        [h1]  12.0.0.1
      
      host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
      
          root@h1:~# ip ro ls
          ...
          12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
          15.0.0.0/16
                  nexthop via 12.0.0.2  dev swp1 weight 1
                  nexthop via 12.0.0.3  dev swp1 weight 1
          ...
      
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
      ssh 15.0.0.5 gets the other. Connections that attempt to use the
      12.0.0.2 nexthop fail since that neighbor is not reachable:
      
          root@h1:~# ip neigh show
          ...
          12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
          12.0.0.2 dev swp1  FAILED
          ...
      
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6db4494
  7. 17 Mar, 2016 1 commit
  8. 08 Mar, 2016 2 commits
  9. 16 Feb, 2016 3 commits
  10. 11 Feb, 2016 4 commits
  11. 07 Feb, 2016 9 commits
  12. 10 Jan, 2016 3 commits
  13. 18 Dec, 2015 1 commit
    • David Ahern's avatar
      net: Allow accepted sockets to be bound to l3mdev domain · 6dd9a14e
      David Ahern authored
      Allow accepted sockets to derive their sk_bound_dev_if setting from the
      l3mdev domain in which the packets originated. A sysctl setting is added
      to control the behavior which is similar to sk_mark and
      sysctl_tcp_fwmark_accept.
      
      This effectively allow a process to have a "VRF-global" listen socket,
      with child sockets bound to the VRF device in which the packet originated.
      A similar behavior can be achieved using sk_mark, but a solution using marks
      is incomplete as it does not handle duplicate addresses in different L3
      domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
      domain provides a complete solution.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6dd9a14e
  14. 16 Dec, 2015 1 commit
  15. 14 Oct, 2015 1 commit
  16. 12 Oct, 2015 1 commit
    • Paolo Abeni's avatar
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni authored
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2ca690b
  17. 20 Jul, 2015 1 commit
    • Pablo Neira Ayuso's avatar
      netfilter: fix netns dependencies with conntrack templates · 0838aa7f
      Pablo Neira Ayuso authored
      Quoting Daniel Borkmann:
      
      "When adding connection tracking template rules to a netns, f.e. to
      configure netfilter zones, the kernel will endlessly busy-loop as soon
      as we try to delete the given netns in case there's at least one
      template present, which is problematic i.e. if there is such bravery that
      the priviledged user inside the netns is assumed untrusted.
      
      Minimal example:
      
        ip netns add foo
        ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
        ip netns del foo
      
      What happens is that when nf_ct_iterate_cleanup() is being called from
      nf_conntrack_cleanup_net_list() for a provided netns, we always end up
      with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
      don't get a soft-lockup as we still have a schedule() point, but the
      serving CPU spins on 100% from that point onwards.
      
      Since templates are normally allocated with nf_conntrack_alloc(), we
      also bump net->ct.count. The issue why they are not yet nf_ct_put() is
      because the per netns .exit() handler from x_tables (which would eventually
      invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
      called in the dependency chain at a *later* point in time than the per
      netns .exit() handler for the connection tracker.
      
      This is clearly a chicken'n'egg problem: after the connection tracker
      .exit() handler, we've teared down all the connection tracking
      infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
      invoked at a later point in time during the netns cleanup, as that would
      lead to a use-after-free. At the same time, we cannot make x_tables depend
      on the connection tracker module, so that the xt_ct_tg_destroy() would
      be invoked earlier in the cleanup chain."
      
      Daniel confirms this has to do with the order in which modules are loaded or
      having compiled nf_conntrack as modules while x_tables built-in. So we have no
      guarantees regarding the order in which netns callbacks are executed.
      
      Fix this by allocating the templates through kmalloc() from the respective
      SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
      Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
      is marked as unlikely since conntrack templates are rarely allocated and only
      from the configuration plane path.
      
      Note that templates are not kept in any list to avoid further dependencies with
      nf_conntrack anymore, thus, the tmpl larval list is removed.
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Tested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0838aa7f
  18. 15 Jul, 2015 1 commit
    • Eric W. Biederman's avatar
      netfilter: Per network namespace netfilter hooks. · 085db2c0
      Eric W. Biederman authored
      - Add a new set of functions for registering and unregistering per
        network namespace hooks.
      
      - Modify the old global namespace hook functions to use the per
        network namespace hooks in their implementation, so their remains a
        single list that needs to be walked for any hook (this is important
        for keeping the hook priority working and for keeping the code
        walking the hooks simple).
      
      - Only allow registering the per netdevice hooks in the network
        namespace where the network device lives.
      
      - Dynamically allocate the structures in the per network namespace
        hook list in nf_register_net_hook, and unregister them in
        nf_unregister_net_hook.
      
        Dynamic allocate is required somewhere as the number of network
        namespaces are not fixed so we might as well allocate them in the
        registration function.
      
        The chain of registered hooks on any list is expected to be small so
        the cost of walking that list to find the entry we are unregistering
        should also be small.
      
        Performing the management of the dynamically allocated list entries
        in the registration and unregistration functions keeps the complexity
        from spreading.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      085db2c0
  19. 09 Jul, 2015 1 commit
    • Tom Herbert's avatar
      ipv6: Nonlocal bind · 35a256fe
      Tom Herbert authored
      Add support to allow non-local binds similar to how this was done for IPv4.
      Non-local binds are very useful in emulating the Internet in a box, etc.
      
      This add the ip_nonlocal_bind sysctl under ipv6.
      
      Testing:
      
      Set up nonlocal binding and receive routing on a host, e.g.:
      
      ip -6 rule add from ::/0 iif eth0 lookup 200
      ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
      sysctl -w net.ipv6.ip_nonlocal_bind=1
      
      Set up routing to 2001:0:0:1::/64 on peer to go to first host
      
      ping6 -I 2001:0:0:1::1 peer-address -- to verify
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35a256fe
  20. 18 Jun, 2015 2 commits
    • Pablo Neira Ayuso's avatar
      netfilter: don't pull include/linux/netfilter.h from netns headers · a263653e
      Pablo Neira Ayuso authored
      This pulls the full hook netfilter definitions from all those that include
      net_namespace.h.
      
      Instead let's just include the bare minimum required in the new
      linux/netfilter_defs.h file, and use it from the netfilter netns header files.
      
      I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
      this compilation error:
      
      In file included from include/linux/netfilter_defs.h:4:0,
                       from include/net/netns/netfilter.h:4,
                       from include/net/net_namespace.h:22,
                       from include/linux/netdevice.h:43,
                       from net/netfilter/nfnetlink_queue_core.c:23:
      include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;
      
      And also explicit include linux/netfilter.h in several spots.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      a263653e
    • Pablo Neira Ayuso's avatar
      netfilter: use forward declaration instead of including linux/proc_fs.h · 10c04a8e
      Pablo Neira Ayuso authored
      We don't need to pull the full definitions in that file, a simple forward
      declaration is enough.
      
      Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
      compilation error due to missing declarations, ie.
      
      net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
      net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
        if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat,
        ^
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      10c04a8e
  21. 14 Jun, 2015 1 commit
    • Marcelo Ricardo Leitner's avatar
      sctp: fix ASCONF list handling · 2d45a02d
      Marcelo Ricardo Leitner authored
      ->auto_asconf_splist is per namespace and mangled by functions like
      sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
      
      Also, the call to inet_sk_copy_descendant() was backuping
      ->auto_asconf_list through the copy but was not honoring
      ->do_auto_asconf, which could lead to list corruption if it was
      different between both sockets.
      
      This commit thus fixes the list handling by using ->addr_wq_lock
      spinlock to protect the list. A special handling is done upon socket
      creation and destruction for that. Error handlig on sctp_init_sock()
      will never return an error after having initialized asconf, so
      sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
      will be take on sctp_close_sock(), before locking the socket, so we
      don't do it in inverse order compared to sctp_addr_wq_timeout_handler().
      
      Instead of taking the lock on sctp_sock_migrate() for copying and
      restoring the list values, it's preferred to avoid rewritting it by
      implementing sctp_copy_descendant().
      
      Issue was found with a test application that kept flipping sysctl
      default_auto_asconf on and off, but one could trigger it by issuing
      simultaneous setsockopt() calls on multiple sockets or by
      creating/destroying sockets fast enough. This is only triggerable
      locally.
      
      Fixes: 9f7d653b ("sctp: Add Auto-ASCONF support (core).")
      Reported-by: default avatarJi Jianwen <jiji@redhat.com>
      Suggested-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d45a02d
  22. 27 May, 2015 1 commit