      net: Pass ndm_state to route netlink FDB notifications. · b3379041
      Hubert Sokolowski authored
      Before this change applications monitoring FDB notifications
      were not able to determine whether a new FDB entry is permament
      or not:
      bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self
      bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self
      bridge monitor fdb
      f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent
      f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent
      With this change ndm_state from the original netlink message
      is passed to the new netlink message sent as notification.
      bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self
      bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self
      bridge monitor fdb
      f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent
      f1:f2:f3:f4:f5:f7 dev sw0p1 self static
      Signed-off-by: default avatarHubert Sokolowski <hubert.sokolowski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      switchdev: Pass original device to port netdev driver · 6ff64f6f
      Ido Schimmel authored
      switchdev drivers need to know the netdev on which the switchdev op was
      invoked. For example, the STP state of a VLAN interface configured on top
      of a port can change while being member in a bridge. In this case, the
      underlying driver should only change the STP state of that particular
      VLAN and not of all the VLANs configured on the port.
      However, current switchdev infrastructure only passes the port netdev down
      to the driver. Solve that by passing the original device down to the
      driver as part of the required switchdev object / attribute.
      This doesn't entail any change in current switchdev drivers. It simply
      enables those supporting stacked devices to know the originating device
      and act accordingly.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen authored
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: default avatarRonen Arad <ronen.arad@intel.com>
      Acked-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      tcp: use dctcp if enabled on the route to the initiator · c3a8d947
      Daniel Borkmann authored
      Currently, the following case doesn't use DCTCP, even if it should:
      A responder has f.e. Cubic as system wide default, but for a specific
      route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
      initiating host then uses DCTCP as congestion control, but since the
      initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
      and we have to fall back to Reno after 3WHS completes.
      We were thinking on how to solve this in a minimal, non-intrusive
      way without bloating tcp_ecn_create_request() needlessly: lets cache
      the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
      is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
      contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
      to only do a single metric feature lookup inside tcp_ecn_create_request().
      Joint work with Florian Westphal.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · 4f7d2cdf
      Daniel Borkmann authored
      Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
      SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
      anymore with respect to their policy, that is, ifla_vfinfo_policy[].
      Before, they were part of ifla_policy[], but they have been nested since
      placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
      which is another nested attribute for the actual VF attributes such as
      Despite the policy being split out from ifla_policy[] in this commit,
      it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
      testing for struct nlattr, but it doesn't know about the data context and
      their requirements.
      Fix, on top of Jason's initial work, does 1) parsing of the attributes
      with the right policy, and 2) using the resulting parsed attribute table
      from 1) instead of the nla_for_each_nested() loop (just like we used to
      do when still part of ifla_policy[]).
      Reference: http://thread.gmane.org/gmane.linux.network/368913
      Fixes: c02db8c6
       ("rtnetlink: make SR-IOV VF interface symmetric")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Rony Efraim <ronye@mellanox.com>
      Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      switchdev; add VLAN support for port's bridge_getlink · 7d4f8d87
      Scott Feldman authored
      One more missing piece of the puzzle.  Add vlan dump support to switchdev
      port's bridge_getlink.  iproute2 "bridge vlan show" cmd already knows how
      to show the vlans installed on the bridge and the device , but (until now)
      no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
      msg.  Before this patch, "bridge vlan show":
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlans
      	sw1p1				<< device side vlans (missing)
      	sw1p2    57
      	br0     None
      (When the port is bridged, the output repeats the vlan list for the vlans
      on the bridge side of the port and the vlans on the device side of the
      port.  The listing above show no vlans for the device side even though they
      are installed).
      After this patch:
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlan
      	sw1p1    30-34			<< device side vlans
      		 3840 PVID
      	sw1p2    57
      	sw1p2    57
      		 3840 PVID
      	sw1p3    3842 PVID
      	sw1p4    3843 PVID
      	br0     None
      I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
      switchdev support adds an obj dump for VLAN objects, using the same
      call-back scheme as FDB dump.  Support included for both compressed and
      un-compressed vlan dumps.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      rtnl/bond: don't send rtnl msg for unregistered iface · ed2a80ab
      Nicolas Dichtel authored
      Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
      causes the kernel to send a rtnl message for the bond2 interface, with an
      ifindex 0.
      'ip monitor' shows:
      0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default
          link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
      9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
          link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
      The patch fixes the spotted bug by checking in bond driver if the interface
      is registered before calling the notifier chain.
      It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
      Fixes: d4261e56
       ("bonding: create netlink event when bonding option is changed")
      CC: Jiri Pirko <jiri@resnulli.us>
      Reported-by: default avatarJulien Meunier <julien.meunier@6wind.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel authored
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      Libraries like libnl will wait forever for NLMSG_DONE.
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf
       ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: allow to delete a whole device group · 66400d54
      WANG Cong authored
      With dev group, we can change a batch of net devices,
      so we should allow to delete them together too.
      Group 0 is not allowed to be deleted since it is
      the default group.
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: use for_each_netdev_safe() in rtnl_group_changelink() · d079535d
      WANG Cong authored
      In case we move the whole dev group to another netns,
      we should call for_each_netdev_safe(), otherwise we get
      a soft lockup:
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ip:798]
       irq event stamp: 255424
       hardirqs last  enabled at (255423): [<ffffffff81a2aa95>] restore_args+0x0/0x30
       hardirqs last disabled at (255424): [<ffffffff81a2ad5a>] apic_timer_interrupt+0x6a/0x80
       softirqs last  enabled at (255422): [<ffffffff81079ebc>] __do_softirq+0x2c1/0x3a9
       softirqs last disabled at (255417): [<ffffffff8107a190>] irq_exit+0x41/0x95
       CPU: 0 PID: 798 Comm: ip Not tainted 4.0.0-rc4+ #881
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff8800d1b88000 ti: ffff880119530000 task.ti: ffff880119530000
       RIP: 0010:[<ffffffff810cad11>]  [<ffffffff810cad11>] debug_lockdep_rcu_enabled+0x28/0x30
       RSP: 0018:ffff880119533778  EFLAGS: 00000246
       RAX: ffff8800d1b88000 RBX: 0000000000000002 RCX: 0000000000000038
       RDX: 0000000000000000 RSI: ffff8800d1b888c8 RDI: ffff8800d1b888c8
       RBP: ffff880119533778 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 000000000000b5c2 R12: 0000000000000246
       R13: ffff880119533708 R14: 00000000001d5a40 R15: ffff88011a7d5a40
       FS:  00007fc01315f740(0000) GS:ffff88011a600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00007f367a120988 CR3: 000000011849c000 CR4: 00000000000007f0
        ffff880119533798 ffffffff811ac868 ffffffff811ac831 ffffffff811ac828
        ffff8801195337c8 ffffffff811ac8c9 ffff8801195339b0 ffff8801197633e0
        0000000000000000 ffff8801195339b0 ffff8801195337d8 ffffffff811ad2d7
       Call Trace:
        [<ffffffff811ac868>] rcu_read_lock+0x37/0x6e
        [<ffffffff811ac831>] ? rcu_read_unlock+0x5f/0x5f
        [<ffffffff811ac828>] ? rcu_read_unlock+0x56/0x5f
        [<ffffffff811ac8c9>] __fget+0x2a/0x7a
        [<ffffffff811ad2d7>] fget+0x13/0x15
        [<ffffffff811be732>] proc_ns_fget+0xe/0x38
        [<ffffffff817c7714>] get_net_ns_by_fd+0x11/0x59
        [<ffffffff817df359>] rtnl_link_get_net+0x33/0x3e
        [<ffffffff817df3d7>] do_setlink+0x73/0x87b
        [<ffffffff810b28ce>] ? trace_hardirqs_off+0xd/0xf
        [<ffffffff81a2aa95>] ? retint_restore_args+0xe/0xe
        [<ffffffff817e0301>] rtnl_newlink+0x40c/0x699
        [<ffffffff817dffe0>] ? rtnl_newlink+0xeb/0x699
        [<ffffffff81a29246>] ? _raw_spin_unlock+0x28/0x33
        [<ffffffff8143ed1e>] ? security_capable+0x18/0x1a
        [<ffffffff8107da51>] ? ns_capable+0x4d/0x65
        [<ffffffff817de5ce>] rtnetlink_rcv_msg+0x181/0x194
        [<ffffffff817de407>] ? rtnl_lock+0x17/0x19
        [<ffffffff817de407>] ? rtnl_lock+0x17/0x19
        [<ffffffff817de44d>] ? __rtnl_unlock+0x17/0x17
        [<ffffffff818327c6>] netlink_rcv_skb+0x4d/0x93
        [<ffffffff817de42f>] rtnetlink_rcv+0x26/0x2d
        [<ffffffff81830f18>] netlink_unicast+0xcb/0x150
        [<ffffffff8183198e>] netlink_sendmsg+0x501/0x523
        [<ffffffff8115cba9>] ? might_fault+0x59/0xa9
        [<ffffffff817b5398>] ? copy_from_user+0x2a/0x2c
        [<ffffffff817b7b74>] sock_sendmsg+0x34/0x3c
        [<ffffffff817b7f6d>] ___sys_sendmsg+0x1b8/0x255
        [<ffffffff8115c5eb>] ? handle_pte_fault+0xbd5/0xd4a
        [<ffffffff8100a2b0>] ? native_sched_clock+0x35/0x37
        [<ffffffff8109e94b>] ? sched_clock_local+0x12/0x72
        [<ffffffff8109eb9c>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810cadbf>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff811ac1d8>] ? __fcheck_files+0x4c/0x58
        [<ffffffff811ac946>] ? __fget_light+0x2d/0x52
        [<ffffffff817b8adc>] __sys_sendmsg+0x42/0x60
        [<ffffffff817b8b0c>] SyS_sendmsg+0x12/0x1c
        [<ffffffff81a29e32>] system_call_fastpath+0x12/0x17
      Fixes: e7ed828f
       ("netlink: support setting devgroup parameters")
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: do not use rcu in rtnl_dump_ifinfo() · cac5e65e
      Eric Dumazet authored
      We did a failed attempt in the past to only use rcu in rtnl dump
      operations (commit e67f88dd
       "net: dont hold rtnl mutex during
      netlink dump callbacks")
      Now that dumps are holding RTNL anyway, there is no need to also
      use rcu locking, as it forbids any scheduling ability, like
      GFP_KERNEL allocations that controlling path should use instead
      of GFP_ATOMIC whenever possible.
      This should fix following splat Cong Wang reported :
       [ INFO: suspicious RCU usage. ]
       3.19.0+ #805 Tainted: G        W
       include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
       other info that might help us debug this:
       rcu_scheduler_active = 1, debug_locks = 0
       2 locks held by ip/771:
        #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
        #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e
       stack backtrace:
       CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
        ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
        00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
       Call Trace:
        [<ffffffff81a27457>] dump_stack+0x4c/0x65
        [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
        [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
        [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
        [<ffffffff8109e67d>] __might_sleep+0x78/0x80
        [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
        [<ffffffff817c1383>] alloc_netid+0x61/0x69
        [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
        [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
        [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
        [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
        [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
        [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
        [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
        [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
        [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
        [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
        [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
        [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
        [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
        [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
        [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
        [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
        [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 0c7aecd4
       ("netns: add rtnl cmd to add and get peer netns ids")
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric W. Biederman's avatar
      net: Verify permission to link_net in newlink · 06615bed
      Eric W. Biederman authored
      When applicable verify that the caller has permisson to the underlying
      network namespace for a newly created network device.
      Similary checks exist for the network namespace a network device will
      be created in.
      Fixes: 317f4810
       ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric W. Biederman's avatar
      net: Verify permission to dest_net in newlink · 505ce415
      Eric W. Biederman authored
      When applicable verify that the caller has permision to create a
      network device in another network namespace.  This check is already
      present when moving a network device between network namespaces in
      setlink so all that is needed is to duplicate that check in newlink.
      This change almost backports cleanly, but there are context conflicts
      as the code that follows was added in v4.0-rc1
      Fixes: b51642f6
       net: Enable a userns root rtnl calls that are safe for unprivilged users
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 364d5716
      Daniel Borkmann authored
      ifla_vf_policy[] is wrong in advertising its individual member types as
      NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
      len member as *max* attribute length [0, len].
      The issue is that when do_setvfinfo() is being called to set up a VF
      through ndo handler, we could set corrupted data if the attribute length
      is less than the size of the related structure itself.
      The intent is exactly the opposite, namely to make sure to pass at least
      data of minimum size of len.
      Fixes: ebc08a6f
       ("rtnetlink: Add VF config code to rtnetlink")
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      rtnetlink: pass link_net to the newlink handler · 7b4ce694
      Nicolas Dichtel authored
      When IFLA_LINK_NETNSID is used, the netdevice should be built in this link netns
      and moved at the end to another netns (pointed by the socket netns or
      Existing user of the newlink handler will use the netns argument (src_net) to
      find a link netdevice or to check some other information into the link netns.
      For example, to find a netdevice, two information are required: an ifindex
      (usually from IFLA_LINK) and a netns (this link netns).
      Note: when using IFLA_LINK_NETNSID and IFLA_NET_NS_[PID|FD], a user may create a
      netdevice that stands in netnsX and with its link part in netnsY, by sending a
      rtnl message from netnsZ.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>