1. 30 Dec, 2012 1 commit
    • Daniel Borkmann's avatar
      net: filter: return -EINVAL if BPF_S_ANC* operation is not supported · aa1113d9
      Daniel Borkmann authored
      Currently, we return -EINVAL for malformed or wrong BPF filters.
      However, this is not done for BPF_S_ANC* operations, which makes it
      more difficult to detect if it's actually supported or not by the
      BPF machine. Therefore, we should also return -EINVAL if K is within
      the SKF_AD_OFF universe and the ancillary operation did not match.
      Why exactly is it needed? If tools such as libpcap/tcpdump want to
      make use of new ancillary operations (like filtering VLAN in kernel
      space), there is currently no sane way to test if this feature /
      BPF_S_ANC* op is present or not, since no error is returned. This
      patch will make life easier for that and allow for a proper usage
      for user space applications.
      There was concern, if this patch will break userland. Short answer: Yes
      and no. Long answer: It will "break" only for code that calls ...
        { BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> },
      ... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is *not* an
      ancillary. And here comes the BUT: assuming some *old* code will have
      such an instruction where <K> is between [0xfffff000, 0xffffffff] and
      it doesn't know ancillary operations, then this will give a
      non-expected / unwanted behavior as well (since we do not return the
      BPF machine with 0 after a failed load_pointer(), which was the case
      before introducing ancillary operations, but load sth. into the
      accumulator instead, and continue with the next instruction, for
      instance). Thus, user space code would already have been broken by
      introducing ancillary operations into the BPF machine per se. Code
      that does such a direct load, e.g. "load word at packet offset
      0xffffffff into accumulator" ("ld [0xffffffff]") is quite broken,
      isn't it? The whole assumption of ancillary operations is that no-one
      intentionally calls things like "ld [0xffffffff]" and expect this
      word to be loaded from such a packet offset. Hence, we can also safely
      make use of this feature testing patch and facilitate application
      development. Therefore, at least from this patch onwards, we have
      *for sure* a check whether current or in future implemented BPF_S_ANC*
      ops are supported in the kernel. Patch was tested on x86_64.
      (Thanks to Eric for the previous review.)
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reported-by: default avatarAni Sinha <ani@aristanetworks.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 01 Nov, 2012 1 commit
    • Pavel Emelyanov's avatar
      sk-filter: Add ability to get socket filter program (v2) · a8fc9277
      Pavel Emelyanov authored
      The SO_ATTACH_FILTER option is set only. I propose to add the get
      ability by using SO_ATTACH_FILTER in getsockopt. To be less
      irritating to eyes the SO_GET_FILTER alias to it is declared. This
      ability is required by checkpoint-restore project to be able to
      save full state of a socket.
      There are two issues with getting filter back.
      First, kernel modifies the sock_filter->code on filter load, thus in
      order to return the filter element back to user we have to decode it
      into user-visible constants. Fortunately the modification in question
      is interconvertible.
      Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
      speed up the run-time division by doing kernel_k = reciprocal(user_k).
      Bad news is that different user_k may result in same kernel_k, so we
      can't get the original user_k back. Good news is that we don't have
      to do it. What we need to is calculate a user2_k so, that
        reciprocal(user2_k) == reciprocal(user_k) == kernel_k
      i.e. if it's re-loaded back the compiled again value will be exactly
      the same as it was. That said, the user2_k can be calculated like this
        user2_k = reciprocal(kernel_k)
      with an exception, that if kernel_k == 0, then user2_k == 1.
      The optlen argument is treated like this -- when zero, kernel returns
      the amount of sock_fprog elements in filter, otherwise it should be
      large enough for the sock_fprog array.
      changes since v1:
      * Declared SO_GET_FILTER in all arch headers
      * Added decode of vlan-tag codes
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 31 Oct, 2012 1 commit
  4. 24 Sep, 2012 1 commit
  5. 10 Sep, 2012 1 commit
  6. 31 Jul, 2012 1 commit
    • Mel Gorman's avatar
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman authored
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 08 Jun, 2012 1 commit
  8. 15 Apr, 2012 1 commit
  9. 13 Apr, 2012 1 commit
  10. 03 Apr, 2012 3 commits
  11. 28 Mar, 2012 1 commit
  12. 19 Oct, 2011 1 commit
  13. 02 Aug, 2011 1 commit
  14. 26 May, 2011 1 commit
  15. 23 May, 2011 1 commit
  16. 28 Apr, 2011 1 commit
    • Eric Dumazet's avatar
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet authored
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      It is disabled by default, and must be enabled by the admin.
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      To get a trace of generated code, use :
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      Example of generated code :
      # tcpdump -p -n -s 0 -i eth1 host
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      BPF program is 144 bytes long, so native program is almost same size ;)
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 31 Mar, 2011 1 commit
  18. 18 Jan, 2011 1 commit
  19. 09 Jan, 2011 1 commit
  20. 21 Dec, 2010 1 commit
    • Eric Dumazet's avatar
      filter: optimize accesses to ancillary data · 12b16dad
      Eric Dumazet authored
      We can translate pseudo load instructions at filter check time to
      dedicated instructions to speed up filtering and avoid one switch().
      libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use
      other ancillary accesses.
      Note : I made the assertion that ancillary data was always accessed with
      BPF_LD|BPF_?|BPF_ABS instructions, not with BPF_LD|BPF_?|BPF_IND ones
      (offset given by K constant, not by K + X register)
      On x86_64, this saves a few bytes of text :
      # size net/core/filter.o.*
         text	   data	    bss	    dec	    hex	filename
         4864	      0	      0	   4864	   1300	net/core/filter.o.new
         4944	      0	      0	   4944	   1350	net/core/filter.o.old
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  21. 09 Dec, 2010 1 commit
  22. 08 Dec, 2010 1 commit
  23. 06 Dec, 2010 3 commits
    • Eric Dumazet's avatar
      filter: add a security check at install time · 2d5311e4
      Eric Dumazet authored
      We added some security checks in commit 57fe93b3
      (filter: make sure filters dont read uninitialized memory) to close a
      potential leak of kernel information to user.
      This added a potential extra cost at run time, while we can perform a
      check of the filter itself, to make sure a malicious user doesnt try to
      abuse us.
      This patch adds a check_loads() function, whole unique purpose is to
      make this check, allocating a temporary array of mask. We scan the
      filter and propagate a bitmask information, telling us if a load M(K) is
      allowed because a previous store M(K) is guaranteed. (So that
      sk_run_filter() can possibly not read unitialized memory)
      Note: this can uncover application bug, denying a filter attach,
      previously allowed.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Dan Rosenberg <drosenberg@vsecurity.com>
      Cc: Changli Gao <xiaosuo@gmail.com>
      Acked-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      filter: add SKF_AD_RXHASH and SKF_AD_CPU · da2033c2
      Eric Dumazet authored
      Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism,
      to be able to build advanced filters.
      This can help spreading packets on several sockets with a fast
      selection, after RPS dispatch to N cpus for example, or to catch a
      percentage of flows in one queue.
      tcpdump -s 500 "cpu = 1" :
      [0] ld CPU
      [1] jeq #1  jt 2  jf 3
      [2] ret #500
      [3] ret #0
      # take 12.5 % of flows (average)
      tcpdump -s 1000 "rxhash & 7 = 2" :
      [0] ld RXHASH
      [1] and #7
      [2] jeq #2  jt 3  jf 4
      [3] ret #1000
      [4] ret #0
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Rui <wirelesser@gmail.com>
      Acked-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      filter: fix sk_filter rcu handling · 46bcf14f
      Eric Dumazet authored
      Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and
      sk_clone() in commit 47e958ea
      Problem is we can have several clones sharing a common sk_filter, and
      these clones might want to sk_filter_attach() their own filters at the
      same time, and can overwrite old_filter->rcu, corrupting RCU queues.
      We can not use filter->rcu without being sure no other thread could do
      the same thing.
      Switch code to a more conventional ref-counting technique : Do the
      atomic decrement immediately and queue one rcu call back when last
      reference is released.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  24. 19 Nov, 2010 4 commits
  25. 18 Nov, 2010 2 commits
  26. 10 Nov, 2010 1 commit
    • David S. Miller's avatar
      filter: make sure filters dont read uninitialized memory · 57fe93b3
      David S. Miller authored
      There is a possibility malicious users can get limited information about
      uninitialized stack mem array. Even if sk_run_filter() result is bound
      to packet length (0 .. 65535), we could imagine this can be used by
      hostile user.
      Initializing mem[] array, like Dan Rosenberg suggested in his patch is
      expensive since most filters dont even use this array.
      Its hard to make the filter validation in sk_chk_filter(), because of
      the jumps. This might be done later.
      In this patch, I use a bitmap (a single long var) so that only filters
      using mem[] loads/stores pay the price of added security checks.
      For other filters, additional cost is a single instruction.
      [ Since we access fentry->k a lot now, cache it in a local variable
        and mark filter entry pointer as const. -DaveM ]
      Reported-by: default avatarDan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  27. 25 Oct, 2010 1 commit
  28. 27 Sep, 2010 1 commit
  29. 25 Jun, 2010 1 commit
    • Hagen Paul Pfeifer's avatar
      net: optimize Berkeley Packet Filter (BPF) processing · 01f2f3f6
      Hagen Paul Pfeifer authored
      Gcc is currenlty not in the ability to optimize the switch statement in
      sk_run_filter() because of dense case labels. This patch replace the
      OR'd labels with ordered sequenced case labels. The sk_chk_filter()
      function is modified to patch/replace the original OPCODES in a
      ordered but equivalent form. gcc is now in the ability to transform the
      switch statement in sk_run_filter into a jump table of complexity O(1).
      Until this patch gcc generates a sequence of conditional branches (O(n) of 567
      byte .text segment size (arch x86_64):
      7ff: 8b 06                 mov    (%rsi),%eax
      801: 66 83 f8 35           cmp    $0x35,%ax
      805: 0f 84 d0 02 00 00     je     adb <sk_run_filter+0x31d>
      80b: 0f 87 07 01 00 00     ja     918 <sk_run_filter+0x15a>
      811: 66 83 f8 15           cmp    $0x15,%ax
      815: 0f 84 c5 02 00 00     je     ae0 <sk_run_filter+0x322>
      81b: 77 73                 ja     890 <sk_run_filter+0xd2>
      81d: 66 83 f8 04           cmp    $0x4,%ax
      821: 0f 84 17 02 00 00     je     a3e <sk_run_filter+0x280>
      827: 77 29                 ja     852 <sk_run_filter+0x94>
      829: 66 83 f8 01           cmp    $0x1,%ax
      With the modification the compiler translate the switch statement into
      the following jump table fragment:
      7ff: 66 83 3e 2c           cmpw   $0x2c,(%rsi)
      803: 0f 87 1f 02 00 00     ja     a28 <sk_run_filter+0x26a>
      809: 0f b7 06              movzwl (%rsi),%eax
      80c: ff 24 c5 00 00 00 00  jmpq   *0x0(,%rax,8)
      813: 44 89 e3              mov    %r12d,%ebx
      816: e9 43 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      81b: 41 89 dc              mov    %ebx,%r12d
      81e: e9 3b 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      Furthermore, I reordered the instructions to reduce cache line misses by
      order the most common instruction to the start.
      Signed-off-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  30. 22 Apr, 2010 1 commit
  31. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      The script does the followings.
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
      The conversion was done in the following steps.
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
      6. percpu.h was updated not to include slab.h.
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
  32. 25 Feb, 2010 1 commit
    • Paul E. McKenney's avatar
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney authored
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>