1. 12 Mar, 2015 2 commits
  2. 11 Mar, 2015 1 commit
    • Eric Dumazet's avatar
      net: add real socket cookies · 33cf7c90
      Eric Dumazet authored
      A long standing problem in netlink socket dumps is the use
      of kernel socket addresses as cookies.
      
      1) It is a security concern.
      
      2) Sockets can be reused quite quickly, so there is
         no guarantee a cookie is used once and identify
         a flow.
      
      3) request sock, establish sock, and timewait socks
         for a given flow have different cookies.
      
      Part of our effort to bring better TCP statistics requires
      to switch to a different allocator.
      
      In this patch, I chose to use a per network namespace 64bit generator,
      and to use it only in the case a socket needs to be dumped to netlink.
      (This might be refined later if needed)
      
      Note that I tried to carry cookies from request sock, to establish sock,
      then timewait sockets.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33cf7c90
  3. 03 Mar, 2015 1 commit
    • Eric W. Biederman's avatar
      mpls: Basic routing support · 0189197f
      Eric W. Biederman authored
      This change adds a new Kconfig option MPLS_ROUTING.
      
      The core of this change is the code to look at an mpls packet received
      from another machine.  Look that packet up in a routing table and
      forward the packet on.
      
      Support of MPLS over ATM is not considered or attempted here.  This
      implemntation follows RFC3032 and implements the MPLS shim header that
      can pass over essentially any network.
      
      What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
      net->mpls.platform_label[].  What RFC3031 refers to as the Next Label
      Hop Forwarding Entry (NHLFE) I call mpls_route.  Though calling it the
      label fordwarding information base (lfib) might also be valid.
      
      Further the implemntation forwards packets as described in RFC3032.
      There is no need and given the original motivation for MPLS a strong
      discincentive to have a flexible label forwarding path.  In essence
      the logic is the topmost label is read, looked up, removed, and
      replaced by 0 or more new lables and the sent out the specified
      interface to it's next hop.
      
      Quite a few optional features are not implemented here.  Among them
      are generation of ICMP errors when the TTL is exceeded or the packet
      is larger than the next hop MTU (those conditions are detected and the
      packets are dropped instead of generating an icmp error).  The traffic
      class field is always set to 0.  The implementation focuses on IP over
      MPLS and does not handle egress of other kinds of protocols.
      
      Instead of implementing coordination with the neighbour table and
      sorting out how to input next hops in a different address family (for
      which there is value).  I was lazy and implemented a next hop mac
      address instead.  The code is simpler and there are flavor of MPLS
      such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
      appropriate so a next hop by mac address would need to be implemented
      at some point.
      
      Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.
      
      Decoding the mpls header must be done by first byeswapping a 32bit bit
      endian word into the local cpu endian and then bit shifting to extract
      the pieces.  There is no C bit-field that can represent a wire format
      mpls header on a little endian machine as the low bits of the 20bit
      label wind up in the wrong half of third byte.  Therefore internally
      everything is deal with in cpu native byte order except when writing
      to and reading from a packet.
      
      For management simplicity if a label is configured to forward out
      an interface that is down the packet is dropped early.  Similarly
      if an network interface is removed rt_dev is updated to NULL
      (so no reference is preserved) and any packets for that label
      are dropped.  Keeping the label entries in the kernel allows
      the kernel label table to function as the definitive source
      of which labels are allocated and which are not.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0189197f
  4. 19 Jan, 2015 1 commit
  5. 04 Dec, 2014 1 commit
  6. 30 Sep, 2014 1 commit
    • Hannes Frederic Sowa's avatar
      ipv6: remove rt6i_genid · 705f1c86
      Hannes Frederic Sowa authored
      Eric Dumazet noticed that all no-nonexthop or no-gateway routes which
      are already marked DST_HOST (e.g. input routes routes) will always be
      invalidated during sk_dst_check. Thus per-socket dst caching absolutely
      had no effect and early demuxing had no effect.
      
      Thus this patch removes rt6i_genid: fn_sernum already gets modified during
      add operations, so we only must ensure we mutate fn_sernum during ipv6
      address remove operations. This is a fairly cost extensive operations,
      but address removal should not happen that often. Also our mtu update
      functions do the same and we heard no complains so far. xfrm policy
      changes also cause a call into fib6_flush_trees. Also plug a hole in
      rt6_info (no cacheline changes).
      
      I verified via tracing that this change has effect.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      705f1c86
  7. 24 Apr, 2014 1 commit
  8. 20 Apr, 2014 1 commit
  9. 16 Apr, 2014 1 commit
    • Cong Wang's avatar
      ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif · 6a662719
      Cong Wang authored
      As suggested by Julian:
      
      	Simply, flowi4_iif must not contain 0, it does not
      	look logical to ignore all ip rules with specified iif.
      
      because in fib_rule_match() we do:
      
              if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
                      goto out;
      
      flowi4_iif should be LOOPBACK_IFINDEX by default.
      
      We need to move LOOPBACK_IFINDEX to include/net/flow.h:
      
      1) It is mostly used by flowi_iif
      
      2) Fix the following compile error if we use it in flow.h
      by the patches latter:
      
      In file included from include/linux/netfilter.h:277:0,
                       from include/net/netns/netfilter.h:5,
                       from include/net/net_namespace.h:21,
                       from include/linux/netdevice.h:43,
                       from include/linux/icmpv6.h:12,
                       from include/linux/ipv6.h:61,
                       from include/net/ipv6.h:16,
                       from include/linux/sunrpc/clnt.h:27,
                       from include/linux/nfs_fs.h:30,
                       from init/do_mounts.c:32:
      include/net/flow.h: In function ‘flowi4_init_output’:
      include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a662719
  10. 28 Feb, 2014 1 commit
  11. 09 Feb, 2014 1 commit
  12. 14 Oct, 2013 1 commit
  13. 28 Sep, 2013 1 commit
    • Eric W. Biederman's avatar
      net: Delay default_device_exit_batch until no devices are unregistering v2 · 50624c93
      Eric W. Biederman authored
      There is currently serialization network namespaces exiting and
      network devices exiting as the final part of netdev_run_todo does not
      happen under the rtnl_lock.  This is compounded by the fact that the
      only list of devices unregistering in netdev_run_todo is local to the
      netdev_run_todo.
      
      This lack of serialization in extreme cases results in network devices
      unregistering in netdev_run_todo after the loopback device of their
      network namespace has been freed (making dst_ifdown unsafe), and after
      the their network namespace has exited (making the NETDEV_UNREGISTER,
      and NETDEV_UNREGISTER_FINAL callbacks unsafe).
      
      Add the missing serialization by a per network namespace count of how
      many network devices are unregistering and having a wait queue that is
      woken up whenever the count is decreased.  The count and wait queue
      allow default_device_exit_batch to wait until all of the unregistration
      activity for a network namespace has finished before proceeding to
      unregister the loopback device and then allowing the network namespace
      to exit.
      
      Only a single global wait queue is used because there is a single global
      lock, and there is a single waiter, per network namespace wait queues
      would be a waste of resources.
      
      The per network namespace count of unregistering devices gives a
      progress guarantee because the number of network devices unregistering
      in an exiting network namespace must ultimately drop to zero (assuming
      network device unregistration completes).
      
      The basic logic remains the same as in v1.  This patch is now half
      comment and half rtnl_lock_unregistering an expanded version of
      wait_event performs no extra work in the common case where no network
      devices are unregistering when we get to default_device_exit_batch.
      Reported-by: default avatarFrancesco Ruggeri <fruggeri@aristanetworks.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50624c93
  14. 21 Sep, 2013 1 commit
  15. 31 Jul, 2013 1 commit
  16. 26 Jun, 2013 1 commit
  17. 03 Jun, 2013 1 commit
    • Timo Teräs's avatar
      ipv4: use separate genid for next hop exceptions · 5aad1de5
      Timo Teräs authored
      commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
      added the support to flush learned pmtu information.
      
      However, using rt_genid is quite heavy as it is bumped on route
      add/change and multicast events amongst other places. These can
      happen quite often, especially if using dynamic routing protocols.
      
      While this is ok with routes (as they are just recreated locally),
      the pmtu information is learned from remote systems and the icmp
      notification can come with long delays. It is worthy to have separate
      genid to avoid excessive pmtu resets.
      
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarTimo Teräs <timo.teras@iki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5aad1de5
  18. 05 Apr, 2013 1 commit
  19. 20 Nov, 2012 1 commit
    • Eric W. Biederman's avatar
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman authored
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  20. 18 Nov, 2012 4 commits
  21. 05 Oct, 2012 1 commit
  22. 19 Sep, 2012 1 commit
  23. 18 Sep, 2012 1 commit
  24. 15 Aug, 2012 1 commit
  25. 09 Aug, 2012 2 commits
  26. 16 Jul, 2012 1 commit
    • Andrey Vagin's avatar
      net: make sock diag per-namespace · 51d7cccf
      Andrey Vagin authored
      Before this patch sock_diag works for init_net only and dumps
      information about sockets from all namespaces.
      
      This patch expands sock_diag for all name-spaces.
      It creates a netlink kernel socket for each netns and filters
      data during dumping.
      
      v2: filter accoding with netns in all places
          remove an unused variable.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarAndrew Vagin <avagin@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51d7cccf
  27. 23 Apr, 2012 1 commit
  28. 20 Apr, 2012 4 commits
  29. 11 Dec, 2011 1 commit
  30. 26 Jul, 2011 1 commit
  31. 01 Jul, 2011 1 commit
    • Thomas Graf's avatar
      rtnl: provide link dump consistency info · 4e985ada
      Thomas Graf authored
      This patch adds a change sequence counter to each net namespace
      which is bumped whenever a netdevice is added or removed from
      the list. If such a change occurred while a link dump took place,
      the dump will have the NLM_F_DUMP_INTR flag set in the first
      message which has been interrupted and in all subsequent messages
      of the same dump.
      
      Note that links may still be modified or renamed while a dump is
      taking place but we can guarantee for userspace to receive a
      complete list of links and not miss any.
      
      Testing:
      I have added 500 VLAN netdevices to make sure the dump is split
      over multiple messages. Then while continuously dumping links in
      one process I also continuously deleted and re-added a dummy
      netdevice in another process. Multiple dumps per seconds have
      had the NLM_F_DUMP_INTR flag set.
      
      I guess we can wait for Johannes patch to hit net-next via the
      wireless tree.  I just wanted to give this some testing right away.
      Signed-off-by: default avatarThomas Graf <tgraf@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e985ada
  32. 12 Jun, 2011 1 commit
    • Al Viro's avatar
      Delay struct net freeing while there's a sysfs instance refering to it · a685e089
      Al Viro authored
      	* new refcount in struct net, controlling actual freeing of the memory
      	* new method in kobj_ns_type_operations (->drop_ns())
      	* ->current_ns() semantics change - it's supposed to be followed by
      corresponding ->drop_ns().  For struct net in case of CONFIG_NET_NS it bumps
      the new refcount; net_drop_ns() decrements it and calls net_free() if the
      last reference has been dropped.  Method renamed to ->grab_current_ns().
      	* old net_free() callers call net_drop_ns() instead.
      	* sysfs_exit_ns() is gone, along with a large part of callchain
      leading to it; now that the references stored in ->ns[...] stay valid we
      do not need to hunt them down and replace them with NULL.  That fixes
      problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
      of sb->s_instances abuse.
      
      	Note that struct net *shutdown* logics has not changed - net_cleanup()
      is called exactly when it used to be called.  The only thing postponed by
      having a sysfs instance refering to that struct net is actual freeing of
      memory occupied by struct net.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a685e089