1. 12 Mar, 2015 1 commit
    • Eric W. Biederman's avatar
      net: Introduce possible_net_t · 0c5c9fb5
      Eric W. Biederman authored
      Having to say
      > #ifdef CONFIG_NET_NS
      > 	struct net *net;
      > #endif
      in structures is a little bit wordy and a little bit error prone.
      Instead it is possible to say:
      > typedef struct {
      > #ifdef CONFIG_NET_NS
      >       struct net *net;
      > #endif
      > } possible_net_t;
      And then in a header say:
      > 	possible_net_t net;
      Which is cleaner and easier to use and easier to test, as the
      possible_net_t is always there no matter what the compile options.
      Further this allows read_pnet and write_pnet to be functions in all
      cases which is better at catching typos.
      This change adds possible_net_t, updates the definitions of read_pnet
      and write_pnet, updates optional struct net * variables that
      write_pnet uses on to have the type possible_net_t, and finally fixes
      up the b0rked users of read_pnet and write_pnet.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 21 Aug, 2014 1 commit
  3. 16 Jan, 2014 1 commit
    • Daniel Borkmann's avatar
      packet: use percpu mmap tx frame pending refcount · b0138408
      Daniel Borkmann authored
      In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
      and one atomic_dec() call in skb destructor and use a percpu
      reference count instead in order to determine if packets are
      still pending to be sent out. Micro-benchmark with [1] that has
      been slightly modified (that is, protcol = 0 in socket(2) and
      bind(2)), example on a rather crappy testing machine; I expect
      it to scale and have even better results on bigger machines:
      ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
      With patch:    4,022,015 cyc
      Without patch: 4,812,994 cyc
      time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
      With patch:
        real         1m32.241s
        user         0m0.287s
        sys          1m29.316s
      Without patch:
        real         1m38.386s
        user         0m0.265s
        sys          1m35.572s
      In function tpacket_snd(), it is okay to use packet_read_pending()
      since in fast-path we short-circuit the condition already with
      ph != NULL, since we have next frames to process. In case we have
      MSG_DONTWAIT, we also do not execute this path as need_wait is
      false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
      okay to call a packet_read_pending(), because when we ever reach
      that path, we're done processing outgoing frames anyway and only
      look if there are skbs still outstanding to be orphaned. We can
      stay lockless in this percpu counter since it's acceptable when we
      reach this path for the sum to be imprecise first, but we'll level
      out at 0 after all pending frames have reached the skb destructor
      eventually through tx reclaim. When people pin a tx process to
      particular CPUs, we expect overflows to happen in the reference
      counter as on one CPU we expect heavy increase; and distributed
      through ksoftirqd on all CPUs a decrease, for example. As
      David Laight points out, since the C language doesn't define the
      result of signed int overflow (i.e. rather than wrap, it is
      allowed to saturate as a possible outcome), we have to use
      unsigned int as reference count. The sum over all CPUs when tx
      is complete will result in 0 again.
      The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
      can _only_ be set from inside tpacket_snd() path and we made sure
      to increase tx_ring.pending in any case before we called po->xmit(skb).
      So testing for tx_ring.pending == 0 is not too useful. Instead, it
      would rather have been useful to test if lower layers didn't orphan
      the skb so that we're missing ring slots being put back to
      TP_STATUS_AVAILABLE. But such a bug will be caught in user space
      already as we end up realizing that we do not have any
      TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
      Btw, in case of RX_RING path, we do not make use of the pending
      member, therefore we also don't need to use up any percpu memory
      here. Also note that __alloc_percpu() already returns a zero-filled
      percpu area, so initialization is done already.
        [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmapSigned-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 09 Dec, 2013 1 commit
    • Daniel Borkmann's avatar
      packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa
      Daniel Borkmann authored
      This patch introduces a PACKET_QDISC_BYPASS socket option, that
      allows for using a similar xmit() function as in pktgen instead
      of taking the dev_queue_xmit() path. This can be very useful when
      PF_PACKET applications are required to be used in a similar
      scenario as pktgen, but with full, flexible packet payload that
      needs to be provided, for example.
      On default, nothing changes in behaviour for normal PF_PACKET
      TX users, so everything stays as is for applications. New users,
      however, can now set PACKET_QDISC_BYPASS if needed to prevent
      own packets from i) reentering packet_rcv() and ii) to directly
      push the frame to the driver.
      In doing so we can increase pps (here 64 byte packets) for
      PF_PACKET a bit:
        # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
        1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
        2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
        3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
        4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
        5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
        6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
        7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
        8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
        9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
       10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
       11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
       20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081
       [**]: qdisc path with packet_rcv(), how probably most people
             seem to use it (hopefully not anymore if not needed)
      The test was done using a modified trafgen, sending a simple
      static 64 bytes packet, on all CPUs.  The trick in the fast
      "qdisc path" case, is to avoid reentering packet_rcv() by
      setting the RAW socket protocol to zero, like:
      socket(PF_PACKET, SOCK_RAW, 0);
      Tradeoffs are documented as well in this patch, clearly, if
      queues are busy, we will drop more packets, tc disciplines are
      ignored, and these packets are not visible to taps anymore. For
      a pktgen like scenario, we argue that this is acceptable.
      The pointer to the xmit function has been placed in packet
      socket structure hole between cached_dev and prot_hook that
      is hot anyway as we're working on cached_dev in each send path.
      Done in joint work together with Jesper Dangaard Brouer.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 21 Nov, 2013 1 commit
    • Daniel Borkmann's avatar
      packet: fix use after free race in send path when dev is released · e40526cb
      Daniel Borkmann authored
      Salam reported a use after free bug in PF_PACKET that occurs when
      we're sending out frames on a socket bound device and suddenly the
      net device is being unregistered. It appears that commit 827d9780
      introduced a possible race condition between {t,}packet_snd() and
      packet_notifier(). In the case of a bound socket, packet_notifier()
      can drop the last reference to the net_device and {t,}packet_snd()
      might end up suddenly sending a packet over a freed net_device.
      To avoid reverting 827d9780 and thus introducing a performance
      regression compared to the current state of things, we decided to
      hold a cached RCU protected pointer to the net device and maintain
      it on write side via bind spin_lock protected register_prot_hook()
      and __unregister_prot_hook() calls.
      In {t,}packet_snd() path, we access this pointer under rcu_read_lock
      through packet_cached_dev_get() that holds reference to the device
      to prevent it from being freed through packet_notifier() while
      we're in send path. This is okay to do as dev_put()/dev_hold() are
      per-cpu counters, so this should not be a performance issue. Also,
      the code simplifies a bit as we don't need need_rls_dev anymore.
      Fixes: 827d9780 ("af-packet: Use existing netdev reference for bound sockets.")
      Reported-by: default avatarSalam Noureddine <noureddine@aristanetworks.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarSalam Noureddine <noureddine@aristanetworks.com>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 24 Apr, 2013 2 commits
    • Daniel Borkmann's avatar
      packet: account statistics only in tpacket_stats_u · ee80fbf3
      Daniel Borkmann authored
      Currently, packet_sock has a struct tpacket_stats stats member for
      TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
      ``union tpacket_stats_u stats_u'' was introduced, where however only
      statistics for TPACKET_V3 are held, and when copied to user space,
      TPACKET_V3 does some hackery and access also tpacket_stats' stats,
      although everything could have been done within the union itself.
      Unify accounting within the tpacket_stats_u union so that we can
      remove 8 bytes from packet_sock that are there unnecessary. Note that
      even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
      this still works due to the union with same types + offsets, that are
      exposed to the user space.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Daniel Borkmann's avatar
      packet: reorder a member in packet_ring_buffer · 0578edc5
      Daniel Borkmann authored
      There's a 4 byte hole in packet_ring_buffer structure before
      prb_bdqc, that can be filled with 'pending' member, thus we can
      reduce the overall structure size from 224 bytes to 216 bytes.
      This also has the side-effect, that in struct packet_sock 2*4 byte
      holes after the embedded packet_ring_buffer members are removed,
      and overall, packet_sock can be reduced by 1 cacheline:
      Before: size: 1344, cachelines: 21, members: 24
      After:  size: 1280, cachelines: 20, members: 24
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 19 Mar, 2013 1 commit
    • Willem de Bruijn's avatar
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn authored
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      Also, passes psock_fanout.c unit test added to selftests.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 07 Nov, 2012 1 commit
  9. 20 Aug, 2012 1 commit
  10. 14 Aug, 2012 1 commit