1. 19 Apr, 2013 24 commits
    • Rajesh Borundia's avatar
      qlcnic: Change 82xx adapter VLAN id endian type. · f80bc8fe
      Rajesh Borundia authored
      
      
      o 82xx adapter requires VLAN id in little endian format.
        Instead of passing vlan id parameter as __le16, pass the
        parameter as u16 and  use cpu_to_le16 at appropriate places.
      Signed-off-by: default avatarRajesh Borundia <rajesh.borundia@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f80bc8fe
    • David S. Miller's avatar
      Merge branch 'netlink-mmap' · 42bbcb78
      David S. Miller authored
      Patrick McHardy says:
      
      ====================
      The following patches contain an implementation of memory mapped I/O for
      netlink. The implementation is modelled after AF_PACKET memory mapped I/O
      with a few differences:
      
      - In order to perform memory mapped I/O to userspace, the kernel allocates
        skbs with the data area pointing to the data area of the mapped frames.
        All netlink subsystems assume a linear data area, so for the sake of
        simplicity, the mapped data area is not attached to the paged area but
        to skb->data. This requires introduction of a special skb alloction
        function that just allocates an skb head without the data area. Since this
        is a quite rare use case, I introduced a new function based on __alloc_skb
        instead of splitting it up into head and data alloction. The alternative
        would be to   introduce an __alloc_skb_head and __alloc_skb_data function,
        which would actually be useful for a specific error case in memory mapped
        netlink, but would require a couple of extra instructions for the common
        skb allocation case, so it doesn't really seem worth it.
      
        In order to get the destination memory area for skb->data before message
        construction, memory mapped netlink I/O needs to look up the destination
        socket during allocation instead of during transmission because the
        ring is owned by the receiveing socket/process. A special skb allocation
        function (netlink_alloc_skb) taking the destination pid as an argument is
        used for this, all subsystems that want to support memory mapped I/O need
        to use this function, automatic fallback to the receive queue happens
        for unconverted subsystems. Dumps automatically use memory mapped I/O if
        the receiving socket has enabled it.
      
        The visible effect of looking up the destination socket during allocation
        instead of transmission is that message ordering in userspace might
        change in case allocation and transmission aren't performed atomically.
        This usually doesn't matter since most subsystems have a BKL-like lock
        like the rtnl mutex, to my knowledge the currently only existing case
        where it might matter is nfnetlink_queue combined with the recently
        introduced batched verdicts, but a) that subsystem already includes
        sequence numbers which allow userspace to reorder messages in case it
        cares to, also the reodering window is quite small and b) with memory
        mapped transmission batching can be performed in a subsystem indepandant
        manner.
      
      - AF_NETLINK contains flow control for database dumps, with regular I/O
        dump continuation are triggered based on the sockets receive queue space
        and by recvmsg() calls. Since with memory mapped I/O there are no
        recvmsg() calls under normal operation, this is done in netlink_poll(),
        under the assumption that userspace has processed all pending frames
        before invoking poll(), thus the ring is expected to have room for new
        messages. Dumps currently don't benefit as much as they could from
        memory mapped I/O because each single continuation requires a poll()
        call. A more agressive approach seems like a good idea to me, especially
        in case the socket is not subscribed to any multicast groups (IOW only
        receiving explicitly requested data).
      
      Besides that, the memory mapped netlink implementation extends the states
      defined by AF_PACKET between userspace and the kernel by a SKIP status, this
      is intended for the case that userspace wants to queue frames (specifically
      when using nfnetlink_queue, an IDS and stream reassembly, requested by
      Eric Leblond) for a longer period of time. The kernel skips over all frames
      marked with SKIP when looking or unused frames and only fails when not finding
      a free frame or when having skipped the entire ring.
      
      Also noteworthy is memory mapped sendmsg: the kernel performs validation
      of messages before accepting and processing them, in order to prevent
      userspace from changing the messages contents after validation, the
      kernel checks that the ring is only mapped once and the file descriptor
      is not shared (in order to avoid having userspace set up another mapping
      after the first mentioned check). If either of both is not true, the
      message copied to an allocated skb and processed as with regular I/O.
      I'd especially appreciate review of this part since I'm not really versed
      in memory, file and process management,
      
      The remaining interesting details are included in the changelogs of the
      individual patches and the documentation, so I won't repeat them here.
      
      As an example, nfnetlink_queue is convererted to support memory mapped
      I/O. Other subsystems that would probably benefit are nfnetlink_log,
      audit and maybe ISCSI, not sure.
      
      Following are some numbers collected by Florian Westphal based on a
      slightly older version, which included an experimental patch for the
      nfnetlink_queue ordering issue.
      
      ===
      
      Test hardware is a 12-core machine
      Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
      ixgbe interfaces are used (i.e., multiqueue nics).
      irqs are distributed across the cpus.
      
      I've made several tests.
      
      The simple one consists of 3GBit UDP traffic, packets are 1500 bytes
      in size (i.e., no fragmentation), with a single nfqueue
      and the test client programs in libmnl examples directory.
      Packets are sent from one /24 net to another /24 net, i.e.
      there are a few hundred flows active at any given time.
      
      I've also tested with snort, but I disabled all rules.
      6Gbit UDP traffic is generated in the snort case, and
      6 nfqueues are used (i.e., 6 snorts run in parallel).
      
      I've tested with 3 different kernels, all based on 3.7.1.
      - 3.7.1, without the mmap patches
      - 3.7.1, with Patricks mmap patches
      - 3.7.1, with mmap patches and extended spinlock to ensure packet ids are
        monotonically increasing and cannot be re-ordered.  This is what we
        currently ship in our product.
      
        [ the spinlock that is extended is the per nfqueue spinlock, it will
          be held from the time the netlink skb is allocated until the netlink
          skb is sent to userspace:
      
          http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9
      
      
        ]
      
      snort is normally used in "batch mode", i.e., after processing 25 packets
      a single "batch verdict" is sent to accept the packets seen so far.
      "mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this
      time (except where noted below).
      
      One reason is that snort has a reload thread, so kernel needs to copy;
      also in the snort case no payload rewrite takes place, so compared
      to the rx path the tx path is cheap.
      
      Results:
      
      3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone
      nfq-queue:           1.7 gbit out
      snort-recv-batch-25  5.1 gbit out
      snort-recv-no-batch  3.1 gbit out
      
      3.7.1 + mmap + without extended spinlocked section
      nfq-queue:           1.7 gbit out (recv/sendmsg)
      nfq-queue-mmap:      2.4 gbit out
      snort-mmap-batch-25	 5.6 gbit out  (warning: since ids can be
                                              re-ordered, this version is "broken").
      snort-recv-batch-25	 5.1 gbit out
      snort-mmap-no-batch	 4.6 gbit out (i.e., one verdict per packet)
      
      Kernel 3.7.1 + mmap + extended spinlock section:
      nfq-queue:	1.4 gbit out
      nfq-queue-mmap: 2.3 gbit out
      snort:          5.6 gbit out
      
      Conclusions:
      - The "extended spinlocked section" hurts performance in the
        single queue case; with 6 snorts there is no measureable slowdown.
      - I tried to re-write the mmap-snort to work without batch verdicts, but
        results were not very encouraging:
      
      kernel 3.7.1 + mmap (without extended spinlocked section):
      
      snort-mmap-batch-25      5.6 gbit out (what we currenlty ship)
      snort-recv-batch-25      5.1 gbit out (without using mmap)
      snort-mmap-batch-1       4.6 gbit out (with mmap but without batch verdicts)
      snort-mmap-txring-25     5.2 gbit out (with mmap but without batch verdicts)
      snort-mmap-txring-1      4.6 gbit out (with mmap but without batch verdicts)
      
      The difference between the last two is that in the txring-25 case, we
      put a verdict into the tx ring after every packet, but will only
      invoke sendmsg(, NULL, 0) after processing 25 packets.  So the only
      difference is the number of sendmsg calls/context switches.
      
      So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster
      than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42bbcb78
    • Patrick McHardy's avatar
    • Patrick McHardy's avatar
      netfilter: rename netlink related "pid" variables to "portid" · ec464e5d
      Patrick McHardy authored
      
      
      Get rid of the confusing mix of pid and portid and use portid consistently
      for all netlink related socket identities.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec464e5d
    • Patrick McHardy's avatar
    • Patrick McHardy's avatar
      netlink: add RX/TX-ring support to netlink diag · 4ae9fbee
      Patrick McHardy authored
      
      
      Based on AF_PACKET.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ae9fbee
    • Patrick McHardy's avatar
      netlink: add flow control for memory mapped I/O · cd1df525
      Patrick McHardy authored
      
      
      Add flow control for memory mapped RX. Since user-space usually doesn't
      invoke recvmsg() when using memory mapped I/O, flow control is performed
      in netlink_poll(). Dumps are allowed to continue if at least half of the
      ring frames are unused.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd1df525
    • Patrick McHardy's avatar
      netlink: implement memory mapped recvmsg() · f9c22888
      Patrick McHardy authored
      
      
      Add support for mmap'ed recvmsg(). To allow the kernel to construct messages
      into the mapped area, a dataless skb is allocated and the data pointer is
      set to point into the ring frame. This means frames will be delivered to
      userspace in order of allocation instead of order of transmission. This
      usually doesn't matter since the order is either not determinable by
      userspace or message creation/transmission is serialized. The only case
      where this can have a visible difference is nfnetlink_queue. Userspace
      can't assume mmap'ed messages have ordered IDs anymore and needs to check
      this if using batched verdicts.
      
      For non-mapped sockets, nothing changes.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c22888
    • Patrick McHardy's avatar
      netlink: implement memory mapped sendmsg() · 5fd96123
      Patrick McHardy authored
      
      
      Add support for mmap'ed sendmsg() to netlink. Since the kernel validates
      received messages before processing them, the code makes sure userspace
      can't modify the message contents after invoking sendmsg(). To do that
      only a single mapping of the TX ring is allowed to exist and the socket
      must not be shared. If either of these two conditions does not hold, it
      falls back to copying.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd96123
    • Patrick McHardy's avatar
      netlink: add mmap'ed netlink helper functions · 9652e931
      Patrick McHardy authored
      
      
      Add helper functions for looking up mmap'ed frame headers, reading and
      writing their status, allocating skbs with mmap'ed data areas and a poll
      function.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9652e931
    • Patrick McHardy's avatar
      netlink: mmaped netlink: ring setup · ccdfcc39
      Patrick McHardy authored
      
      
      Add support for mmap'ed RX and TX ring setup and teardown based on the
      af_packet.c code. The following patches will use this to add the real
      mmap'ed receive and transmit functionality.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ccdfcc39
    • Patrick McHardy's avatar
      netlink: add netlink_skb_set_owner_r() · cf0a018a
      Patrick McHardy authored
      
      
      For mmap'ed I/O a netlink specific skb destructor needs to be invoked
      after the final kfree_skb() to clean up state. This doesn't work currently
      since the skb's ownership is transfered to the receiving socket using
      skb_set_owner_r(), which orphans the skb, thereby invoking the destructor
      prematurely.
      
      Since netlink doesn't account skbs to the originating socket, there's no
      need to orphan the skb. Add a netlink specific skb_set_owner_r() variant
      that does not orphan the skb and use a netlink specific destructor to
      call sock_rfree().
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf0a018a
    • Patrick McHardy's avatar
      netlink: don't orphan skb in netlink_trim() · 1298ca46
      Patrick McHardy authored
      
      
      Netlink doesn't account skbs to the sending socket, so the there's no
      need to orphan the skb before trimming it.
      
      Removing the skb_orphan() call is required for mmap'ed netlink, which uses
      a netlink specific skb destructor that must not be invoked before the
      final freeing of the skb.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1298ca46
    • Patrick McHardy's avatar
      net: add function to allocate sk_buff head without data area · 0ebd0ac5
      Patrick McHardy authored
      
      
      Add a function to allocate a sk_buff head without any data. This will
      be used by memory mapped netlink to attach data from the mmaped area
      to the skb.
      
      Additionally change skb_release_all() to check whether the skb has a
      data area to allow the skb destructor to clear the data pointer in case
      only a head has been allocated.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ebd0ac5
    • Patrick McHardy's avatar
      netlink: rename ssk to sk in struct netlink_skb_params · e32123e5
      Patrick McHardy authored
      
      
      Memory mapped netlink needs to store the receiving userspace socket
      when sending from the kernel to userspace. Rename 'ssk' to 'sk' to
      avoid confusion.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e32123e5
    • Patrick McHardy's avatar
      cd967e05
    • David S. Miller's avatar
      Merge branch '8021ad' · 447b816f
      David S. Miller authored
      
      
      Patrick McHardy says:
      
      ====================
      The following patches add support for 802.1ad (provider tagging) to the
      VLAN driver. The patchset consists of the following parts:
      
      - renaming of the NET_F_HW_VLAN feature flags to indicate that they only
        operate on CTAGs
      
      - preparation for 802.1ad VLAN filtering offload by adding a proto argument
        to the rx_{add,kill}_vid net_device_ops callbacks
      
      - preparation of the VLAN code to support multiple protocols by making the
        protocol used for tagging a property of the VLAN device and converting
        the device lookup functions accordingly
      
      - second step of preparation of the VLAN code by making the packet tagging
        functions take a protocol argument
      
      - introducation of 802.1ad support in the VLAN code, consisting mainly of
        checking for ETH_P_8021AD in a couple of places and testing the netdevice
        offload feature checks to take the protocol into account
      
      - announcement of STAG offloading capabilities in a couple of drivers for
        virtual network devices
      
      The patchset is based on net-next.git and has been tested with single and
      double tagging with and without HW acceleration (for CTAGs).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      447b816f
    • Patrick McHardy's avatar
      net: vlan: announce STAG offload capability in some drivers · 28d2b136
      Patrick McHardy authored
      
      
      - macvlan: propagate STAG filtering capabilities from underlying device
      - ifb: announce STAG tagging support in addition to CTAG tagging support
      - veth: announce STAG tagging/stripping support in addition to CTAG support
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28d2b136
    • Patrick McHardy's avatar
      net: vlan: add 802.1ad support · 8ad227ff
      Patrick McHardy authored
      
      
      Add support for 802.1ad VLAN devices. This mainly consists of checking for
      ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check
      offloading capabilities based on the used protocol.
      
      Configuration is done using "ip link":
      
      # ip link add link eth0 eth0.1000 \
      	type vlan proto 802.1ad id 1000
      # ip link add link eth0.1000 eth0.1000.1000 \
      	type vlan proto 802.1q id 1000
      
      52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
          20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64
      92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84)
          20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ad227ff
    • Patrick McHardy's avatar
      net: vlan: add protocol argument to packet tagging functions · 86a9bad3
      Patrick McHardy authored
      
      
      Add a protocol argument to the VLAN packet tagging functions. In case of HW
      tagging, we need that protocol available in the ndo_start_xmit functions,
      so it is stored in a new field in the skb. The new field fits into a hole
      (on 64 bit) and doesn't increase the sks's size.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86a9bad3
    • Patrick McHardy's avatar
      net: vlan: prepare for 802.1ad support · 1fd9b1fc
      Patrick McHardy authored
      
      
      Make the encapsulation protocol value a property of VLAN devices and change
      the device lookup functions to take the protocol value into account.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1fd9b1fc
    • Patrick McHardy's avatar
      net: vlan: prepare for 802.1ad VLAN filtering offload · 80d5c368
      Patrick McHardy authored
      
      
      Change the rx_{add,kill}_vid callbacks to take a protocol argument in
      preparation of 802.1ad support. The protocol argument used so far is
      always htons(ETH_P_8021Q).
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80d5c368
    • Patrick McHardy's avatar
      net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_* · f646968f
      Patrick McHardy authored
      
      
      Rename the hardware VLAN acceleration features to include "CTAG" to indicate
      that they only support CTAGs. Follow up patches will introduce 802.1ad
      server provider tagging (STAGs) and require the distinction for hardware not
      supporting acclerating both.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f646968f
    • David S. Miller's avatar
      Merge branch 'intel' · c2962897
      David S. Miller authored
      
      
      Jeff Kirsher says:
      
      ====================
      This series contains updates to ixgbe and igb.
      
      The ixgbe changes contains 2 patches from the community, one which is a
      fix from akepner to fix a issue where netif_running() in shutdown was
      not done under rtnl_lock.  The other community fix from Joe Perches
      cleans up #ifdef CONFIG_DEBUG_FS which is no longer necessary.  The
      last ixgbe patch, from Jacob Keller, adds support for WoL on 82559
      SFP+ LOM.
      
      The remaining patches are against igb, 10 of which were previously
      submitted in a pull request where changes were requested.
      
      The following igb patches:
       igb: Support for 100base-fx SFP
       igb: Support to read and export SFF-8472/8079 data
      are v2 based on feedback from Dan Carpenter and Ben Hutchings in
      the previous pull request.
      
      The largest set of changes are in my patch to cleanup code comments
      and whitespace to align the igb driver with the networking style of
      code comments.  While cleaning up the code comments, fixed several
      other whitespace/checkpatch.pl code formatting issues.
      
      Other notable igb patches are EEE capable devices query the PHY to
      determine what the link partner is advertising, added support for
      i354 devices and added support for spoofchk config.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2962897
  2. 18 Apr, 2013 16 commits