1. 31 Jul, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: add helpers to access tunnel metadata · d3aa45ce
      Alexei Starovoitov authored
      Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
      bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
      skb: pointer to skb
      key: pointer to 'struct bpf_tunnel_key'
      size: size of 'struct bpf_tunnel_key'
      flags: room for future extensions
      
      First eBPF program that uses these helpers will allocate per_cpu
      metadata_dst structures that will be used on TX.
      On RX metadata_dst is allocated by tunnel driver.
      
      Typical usage for TX:
      struct bpf_tunnel_key tkey;
      ... populate tkey ...
      bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
      
      RX:
      struct bpf_tunnel_key tkey = {};
      bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      ... lookup or redirect based on tkey ...
      
      'struct bpf_tunnel_key' will be extended in the future by adding
      elements to the end and the 'size' argument will indicate which fields
      are populated, thereby keeping backwards compatibility.
      The 'flags' argument may be used as well when the 'size' is not enough or
      to indicate completely different layout of bpf_tunnel_key.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3aa45ce
  2. 20 Jul, 2015 2 commits
    • Alexei Starovoitov's avatar
      bpf: introduce bpf_skb_vlan_push/pop() helpers · 4e10df9a
      Alexei Starovoitov authored
      Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
      helper functions. These functions may change skb->data/hlen which are
      cached by some JITs to improve performance of ld_abs/ld_ind instructions.
      Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
      re-compute header len and re-cache skb->data/hlen back into cpu registers.
      Note, skb->data/hlen are not directly accessible from the programs,
      so any changes to skb->data done either by these helpers or by other
      TC actions are safe.
      
      eBPF JIT supported by three architectures:
      - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
      - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
        (experiments showed that conditional re-caching is slower).
      - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
        in the program (re-caching is tbd).
      
      These helpers allow more scalable handling of vlan from the programs.
      Instead of creating thousands of vlan netdevs on top of eth0 and attaching
      TC+ingress+bpf to all of them, the program can be attached to eth0 directly
      and manipulate vlans as necessary.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e10df9a
    • Daniel Borkmann's avatar
      ebpf: add helper to retrieve net_cls's classid cookie · 8d20aabe
      Daniel Borkmann authored
      It would be very useful to retrieve the net_cls's classid from an eBPF
      program to allow for a more fine-grained classification, it could be
      directly used or in conjunction with additional policies. I.e. docker,
      but also tooling such as cgexec, can easily run applications via net_cls
      cgroups:
      
        cgcreate -g net_cls:/foo
        echo 42 > foo/net_cls.classid
        cgexec -g net_cls:foo <prog>
      
      Thus, their respecitve classid cookie of foo can then be looked up on
      the egress path to apply further policies. The helper is desigend such
      that a non-zero value returns the cgroup id.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d20aabe
  3. 15 Jun, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: introduce current->pid, tgid, uid, gid, comm accessors · ffeedafb
      Alexei Starovoitov authored
      eBPF programs attached to kprobes need to filter based on
      current->pid, uid and other fields, so introduce helper functions:
      
      u64 bpf_get_current_pid_tgid(void)
      Return: current->tgid << 32 | current->pid
      
      u64 bpf_get_current_uid_gid(void)
      Return: current_gid << 32 | current_uid
      
      bpf_get_current_comm(char *buf, int size_of_buf)
      stores current->comm into buf
      
      They can be used from the programs attached to TC as well to classify packets
      based on current task fields.
      
      Update tracex2 example to print histogram of write syscalls for each process
      instead of aggregated for all.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffeedafb
  4. 07 Jun, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: allow programs to write to certain skb fields · d691f9e8
      Alexei Starovoitov authored
      allow programs read/write skb->mark, tc_index fields and
      ((struct qdisc_skb_cb *)cb)->data.
      
      mark and tc_index are generically useful in TC.
      cb[0]-cb[4] are primarily used to pass arguments from one
      program to another called via bpf_tail_call() which can
      be seen in sockex3_kern.c example.
      
      All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
      mark, tc_index are writeable from tc_cls_act only.
      cb[0]-cb[4] are writeable by both sockets and tc_cls_act.
      
      Add verifier tests and improve sample code.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d691f9e8
  5. 03 Jun, 2015 1 commit
  6. 30 May, 2015 1 commit
  7. 21 May, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov authored
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04fd61ab
  8. 16 Apr, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: fix bpf helpers to use skb->mac_header relative offsets · a166151c
      Alexei Starovoitov authored
      For the short-term solution, lets fix bpf helper functions to use
      skb->mac_header relative offsets instead of skb->data in order to
      get the same eBPF programs with cls_bpf and act_bpf work on ingress
      and egress qdisc path. We need to ensure that mac_header is set
      before calling into programs. This is effectively the first option
      from below referenced discussion.
      
      More long term solution for LD_ABS|LD_IND instructions will be more
      intrusive but also more beneficial than this, and implemented later
      as it's too risky at this point in time.
      
      I.e., we plan to look into the option of moving skb_pull() out of
      eth_type_trans() and into netif_receive_skb() as has been suggested
      as second option. Meanwhile, this solution ensures ingress can be
      used with eBPF, too, and that we won't run into ABI troubles later.
      For dealing with negative offsets inside eBPF helper functions,
      we've implemented bpf_skb_clone_unwritable() to test for unwriteable
      headers.
      
      Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a166151c
  9. 06 Apr, 2015 1 commit
    • Alexei Starovoitov's avatar
      tc: bpf: add checksum helpers · 91bc4822
      Alexei Starovoitov authored
      Commit 608cd71a ("tc: bpf: generalize pedit action") has added the
      possibility to mangle packet data to BPF programs in the tc pipeline.
      This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
      for fixing up the protocol checksums after the packet mangling.
      
      It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
      unnecessary checksum recomputations when BPF programs adjusting l3/l4
      checksums and documents all three helpers in uapi header.
      
      Moreover, a sample program is added to show how BPF programs can make use
      of the mangle and csum helpers.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91bc4822
  10. 03 Apr, 2015 1 commit
  11. 02 Apr, 2015 3 commits
    • Alexei Starovoitov's avatar
      tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86
      Alexei Starovoitov authored
      Debugging of BPF programs needs some form of printk from the
      program, so let programs call limited trace_printk() with %d %u
      %x %p modifiers only.
      
      Similar to kernel modules, during program load verifier checks
      whether program is calling bpf_trace_printk() and if so, kernel
      allocates trace_printk buffers and emits big 'this is debug
      only' banner.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9c959c86
    • Alexei Starovoitov's avatar
      tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31
      Alexei Starovoitov authored
      bpf_ktime_get_ns() is used by programs to compute time delta
      between events or as a timestamp
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d9847d31
    • Alexei Starovoitov's avatar
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov authored
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2541517c
  12. 29 Mar, 2015 1 commit
  13. 24 Mar, 2015 1 commit
  14. 20 Mar, 2015 1 commit
    • Daniel Borkmann's avatar
      ebpf: add sched_act_type and map it to sk_filter's verifier ops · 94caee8c
      Daniel Borkmann authored
      In order to prepare eBPF support for tc action, we need to add
      sched_act_type, so that the eBPF verifier is aware of what helper
      function act_bpf may use, that it can load skb data and read out
      currently available skb fields.
      
      This is bascially analogous to 96be4325 ("ebpf: add sched_cls_type
      and map it to sk_filter's verifier ops").
      
      BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
      separate since both will have a different set of functionality in
      future (classifier vs action), thus we won't run into ABI troubles
      when the point in time comes to diverge functionality from the
      classifier.
      
      The future plan for act_bpf would be that it will be able to write
      into skb->data and alter selected fields mirrored in struct __sk_buff.
      
      For an initial support, it's sufficient to map it to sk_filter_ops.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Reviewed-by: default avatarJiri Pirko <jiri@resnulli.us>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94caee8c
  15. 17 Mar, 2015 1 commit
  16. 15 Mar, 2015 3 commits
  17. 01 Mar, 2015 2 commits
    • Daniel Borkmann's avatar
      ebpf: add sched_cls_type and map it to sk_filter's verifier ops · 96be4325
      Daniel Borkmann authored
      As discussed recently and at netconf/netdev01, we want to prevent making
      bpf_verifier_ops registration available for modules, but have them at a
      controlled place inside the kernel instead.
      
      The reason for this is, that out-of-tree modules can go crazy and define
      and register any verfifier ops they want, doing all sorts of crap, even
      bypassing available GPLed eBPF helper functions. We don't want to offer
      such a shiny playground, of course, but keep strict control to ourselves
      inside the core kernel.
      
      This also encourages us to design eBPF user helpers carefully and
      generically, so they can be shared among various subsystems using eBPF.
      
      For the eBPF traffic classifier (cls_bpf), it's a good start to share
      the same helper facilities as we currently do in eBPF for socket filters.
      
      That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
      one day if there's a good reason to diverge the set of helper functions
      from the set available to socket filters, we keep ABI compatibility.
      
      In future, we could place all bpf_prog_type_list at a central place,
      perhaps.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96be4325
    • Daniel Borkmann's avatar
      ebpf: export BPF_PSEUDO_MAP_FD to uapi · f1a66f85
      Daniel Borkmann authored
      We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
      ELF BPF loader where instructions are being loaded that need map fixups.
      
      An initial stage loads all maps into the kernel, and later on replaces
      related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
      register and the actual fd as immediate value.
      
      The kernel verifier recognizes this keyword and replaces the map fd with
      a real pointer internally.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1a66f85
  18. 05 Dec, 2014 1 commit
  19. 18 Nov, 2014 4 commits
    • Alexei Starovoitov's avatar
      bpf: allow eBPF programs to use maps · d0003ec0
      Alexei Starovoitov authored
      expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
      map accessors to eBPF programs
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0003ec0
    • Alexei Starovoitov's avatar
      bpf: add array type of eBPF maps · 28fbcfa0
      Alexei Starovoitov authored
      add new map type BPF_MAP_TYPE_ARRAY and its implementation
      
      - optimized for fastest possible lookup()
        . in the future verifier/JIT may recognize lookup() with constant key
          and optimize it into constant pointer. Can optimize non-constant
          key into direct pointer arithmetic as well, since pointers and
          value_size are constant for the life of the eBPF program.
          In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
          while preserving concurrent access to this map from user space
      
      - two main use cases for array type:
        . 'global' eBPF variables: array of 1 element with key=0 and value is a
          collection of 'global' variables which programs can use to keep the state
          between events
        . aggregation of tracing events into fixed set of buckets
      
      - all array elements pre-allocated and zero initialized at init time
      
      - key as an index in array and can only be 4 byte
      
      - map_delete_elem() returns EINVAL, since elements cannot be deleted
      
      - map_update_elem() replaces elements in an non-atomic way
        (for atomic updates hashtable type should be used instead)
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28fbcfa0
    • Alexei Starovoitov's avatar
      bpf: add hashtable type of eBPF maps · 0f8e4bd8
      Alexei Starovoitov authored
      add new map type BPF_MAP_TYPE_HASH and its implementation
      
      - maps are created/destroyed by userspace. Both userspace and eBPF programs
        can lookup/update/delete elements from the map
      
      - eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
        for concurrent updates
      
      - key/value are opaque range of bytes (aligned to 8 bytes)
      
      - user space provides 3 configuration attributes via BPF syscall:
        key_size, value_size, max_entries
      
      - map takes care of allocating/freeing key/value pairs
      
      - map_update_elem() must fail to insert new element when max_entries
        limit is reached to make sure that eBPF programs cannot exhaust memory
      
      - map_update_elem() replaces elements in an atomic way
      
      - optimized for speed of lookup() which can be called multiple times from
        eBPF program which itself is triggered by high volume of events
        . in the future JIT compiler may recognize lookup() call and optimize it
          further, since key_size is constant for life of eBPF program
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f8e4bd8
    • Alexei Starovoitov's avatar
      bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command · 3274f520
      Alexei Starovoitov authored
      the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
      either update existing map element or create a new one.
      Initially the plan was to add a new command to handle the case of
      'create new element if it didn't exist', but 'flags' style looks
      cleaner and overall diff is much smaller (more code reused), so add 'flags'
      attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
       #define BPF_ANY	0 /* create new element or update existing */
       #define BPF_NOEXIST	1 /* create new element if it didn't exist */
       #define BPF_EXIST	2 /* update existing element */
      
      bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
      if element already exists.
      
      bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
      if element doesn't exist.
      
      Userspace will call it as:
      int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
      {
          union bpf_attr attr = {
              .map_fd = fd,
              .key = ptr_to_u64(key),
              .value = ptr_to_u64(value),
              .flags = flags;
          };
      
          return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
      }
      
      First two bits of 'flags' are used to encode style of bpf_update_elem() command.
      Bits 2-63 are reserved for future use.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3274f520
  20. 14 Oct, 2014 1 commit
  21. 26 Sep, 2014 4 commits
    • Alexei Starovoitov's avatar
      bpf: verifier (add ability to receive verification log) · cbd35700
      Alexei Starovoitov authored
      add optional attributes for BPF_PROG_LOAD syscall:
      union bpf_attr {
          struct {
      	...
      	__u32         log_level; /* verbosity level of eBPF verifier */
      	__u32         log_size;  /* size of user buffer */
      	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
          };
      };
      
      when log_level > 0 the verifier will return its verification log in the user
      supplied buffer 'log_buf' which can be used by program author to analyze why
      verifier rejected given program.
      
      'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
      provides several examples of these messages, like the program:
      
        BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
        BPF_LD_MAP_FD(BPF_REG_1, 0),
        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
        BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
        BPF_EXIT_INSN(),
      
      will be rejected with the following multi-line message in log_buf:
      
        0: (7a) *(u64 *)(r10 -8) = 0
        1: (bf) r2 = r10
        2: (07) r2 += -8
        3: (b7) r1 = 0
        4: (85) call 1
        5: (15) if r0 == 0x0 goto pc+1
         R0=map_ptr R10=fp
        6: (7a) *(u64 *)(r0 +4) = 0
        misaligned access off 4 size 8
      
      The format of the output can change at any time as verifier evolves.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbd35700
    • Alexei Starovoitov's avatar
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov authored
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09756af4
    • Alexei Starovoitov's avatar
      bpf: add lookup/update/delete/iterate methods to BPF maps · db20fd2b
      Alexei Starovoitov authored
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      The maps are accessed from user space via BPF syscall, which has commands:
      
      - create a map with given type and attributes
        fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
        returns fd or negative error
      
      - lookup key in a given map referenced by fd
        err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero and stores found elem into value or negative error
      
      - create or update key/value pair in a given map
        err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero or negative error
      
      - find and delete element by key in a given map
        err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key
      
      - iterate map elements (based on input key return next_key)
        err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->next_key
      
      - close(fd) deletes the map
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db20fd2b
    • Alexei Starovoitov's avatar
      bpf: introduce BPF syscall and maps · 99c55f7d
      Alexei Starovoitov authored
      BPF syscall is a multiplexor for a range of different operations on eBPF.
      This patch introduces syscall with single command to create a map.
      Next patch adds commands to access maps.
      
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      Userspace example:
      /* this syscall wrapper creates a map with given type and attributes
       * and returns map_fd on success.
       * use close(map_fd) to delete the map
       */
      int bpf_create_map(enum bpf_map_type map_type, int key_size,
                         int value_size, int max_entries)
      {
          union bpf_attr attr = {
              .map_type = map_type,
              .key_size = key_size,
              .value_size = value_size,
              .max_entries = max_entries
          };
      
          return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
      }
      
      'union bpf_attr' is backwards compatible with future extensions.
      
      More details in Documentation/networking/filter.txt and in manpage
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99c55f7d
  22. 09 Sep, 2014 1 commit