ebpf-rs
ebpf-rs is a verifier, interpreter, JIT compiler for eBPF and corresponding facilities to hook them into rCore written in rust in development. This document serves as a reference of eBPF, Kprobes and other technologies used in this project and for future implementations.
link: https://github.com/NickCao/ebpf-rs
eBPF machine
the eBPF machine is a traditional RISC register machine with a total of 11 64-bit registers, a program counter and a 512 byte fixed size stack.
registers allocation
10 general purpose read-write registers numbered from r0 to r9, and 1 read-only frame pointer numbered r10. The pc register is implicit and cannot be read.
calling convention
- r0: return value
- r1-r5: arguments (more than 5 arguments is not supported)
- r6-r9: callee saved
eBPF instruction
encoding
64 bit fixed size instructions encoded in host byte order
63 31 15 11 7 0
+------------------------+----------------+----+----+--------+
|immediate |offset |src |dst |opcode |
+------------------------+----------------+----+----+--------+
from most significant bit to least significant bit
- 32 bit signed immediate
- 16 bit signed offset
- 4 bit source register
- 4 bit destination register
- 8 bit opcode
when field are not used, they should be zeroed
opcode
opcode has two different encodings
type 1 type 2
7 4 2 0 7 3 2 0
+----+----+-----+ +----+---+-----+
|mode|size|class| |op |src|class|
+----+----+-----+ +----+---+-----+
which can be distinguished by the class field
type 1
LD 0x0 // load
LDX 0x1
ST 0x2 // store
STX 0x3
type 2
ALU 0x4
JMP 0x5
JMP32 0x6
ALU64 0x7
type 1
type 1 instructions are generally related to memory access, with few specialized exceptions for accessing packet data, which is out of the scope of this document
the size
field specifies the size of memory to access
W 0x0 // 4 byte
H 0x1 // 2 byte
B 0x2 // 1 byte
DW 0x3 // 8 byte
the mode
field specifies the addressing mode
IMM 0x0 // immediate
ABS 0x1 // packet absolute offset
IND 0x2 // packet indirect (register) offset
MEM 0x3 // memory
LEN 0x4 // reserved
MSH 0x5 // reserved
XADD 0x6 // exclusive (atomic) add
type 2
type 2 instructions are for alu and control flow operations
the src
field specifies the source of operand
K 0x0 // immediate
X 0x1 // register
the op
field specifies the operation
for class
in ALU
or ALU64
BPF_ADD 0x00
BPF_SUB 0x10
BPF_MUL 0x20
BPF_DIV 0x30
BPF_OR 0x40
BPF_AND 0x50
BPF_LSH 0x60
BPF_RSH 0x70
BPF_NEG 0x80
BPF_MOD 0x90
BPF_XOR 0xa0
BPF_MOV 0xb0
BPF_ARSH 0xc0
BPF_END 0xd0 // endianness conversion
for class
in JMP
or JMP32
BPF_JA 0x00 // JMP only
BPF_JEQ 0x10
BPF_JGT 0x20
BPF_JGE 0x30
BPF_JSET 0x40
BPF_JNE 0x50
BPF_JSGT 0x60
BPF_JSGE 0x70
BPF_CALL 0x80 // JMP only, function call
BPF_EXIT 0x90 // JMP only, function return
BPF_JLT 0xa0
BPF_JLE 0xb0
BPF_JSLT 0xc0
BPF_JSLE 0xd0
JMP32
and ALU
instructions operate on the lower 32 bits of the registers, zeroing the upper half of destination register
double word load
there exists one special instruction that takes 128 bits: LD IMM DW
, it loads two 32 bit immediates from this and the next instruction into a single register, effectively loading a 64 bit immediate
implementation notes
for eBPF runtimes, there's no need to implement the full eBPF instruction set, a subset of which is more than enough for kernel instrumentation. a great reference is the llvm unit tests for eBPF target: llvm test eBPF
eBPF function
eBPF functions are called with specialized instruction CALL
, by a predefined function index encoded in immediate. functions are normally provided as helpers
by the eBPF runtime to augment the restricted functionalities of eBPF programs. some eBPF runtimes allow eBPF programs to call each other following the same calling convention with dynamically function index assigned at load time.
map
as the eBPF machine only has a stack of limited size without heap allocation, most eBPF runtimes provide hashmap-like data stuctures in the form of helper functions. maps are normally created when eBPF programs are being loaded into kernel, by a userland helper, and the parameters for maps such as data type or max size can be specified in special ELF sections.
eBPF with clang
build
instead of manually crafting eBPF programs, we can use clang to compile a subset of c code into eBPF programs
clang -target bpf -Werror -O2 -c ebpf.c -o ebpf.o
notice that -O2
MUST NOT be omitted, otherwise clang would generate pseudo instructions out of the eBPF specification
disassemble
llvm-objdump --triple=bpf -S ebpf.o
c subset
- all functions must be inlined
- no global variables
- no unbounded loops
intro to Kprobes
Kprobes enables you to dynamically insert breakpoint into nearly any kernel routine and specifying a handler routine to be invoked when the breakpoint is hit. it does so by replacing the instruction to trap with INT3
on x86 (or equivalent instructions on other platforms) and single step the replaced instruction in breakpointe handler after executing user specified handler routines.
A fork of rCore with Kprobes implements can be found at https://github.com/NickCao/rCore
single step
single stepping the replaced instruction is trickier than it seems. only if the instruction would not produce exceptions or results related to pc or privilege levels can we safely run it out of the original context. otherwise we must take into consideration of all possible side effects and remediate them.
simplified implementation
hereby we propose a simplified implementation of Kprobe that is easier to implement, while introducing some limitations. instead of allowing trapping at any point in any traced function, we only trace functions that eventually returns to its caller without recursions, and limit the breakpoint to be the entry point of the functions.
the way to insert the breakpoint remains the same, but when the breakpoint is hit, we replace the content of ra
register with our own handler, restore the original instruction, run user defined handlers, then jump back to the entry point of the function. when the function returns (to our handler), we insert the breakpoint again and wait for it to be hit again.
while this implementation saves us the cost of single stepping the trapped instruction, it imposes much restriction on the function to be traced and generally won't work on SMP systems.
run handler on function return
when probing functions, both the entry point and exit of functions are interested, thus we need to redirect the return address of the function to an address we control in order to run custom handlers on function return. Since recursive functions and on SMP systems, a single function may be called multiple times simultaneously, we cannot set ra to a static value, instead we need a trampoline, a contiguous region of ebreak instructions, and set ra to an unused address within that range, in order to distinguish between multiple invocations. A trampoline inplemented as such may not be able to handle functions on hot paths due to it's limited size, thus measures to dynamically increase it's size should be implemented, and don't forget to flush icache when doing so.
symbols in Rust programs
Rust follows an obscure symbol name mangling scheme, making it impossible to recover the original type name from the symbol table, without further information from the compiler.
However a new mangling scheme is proposed, Rust Symbol Mangling (v0), which makes the mangled names more machine readable. Now that support for this mangling scheme is already implemented in both rustc and upstream toolchains, we can use it for better visibility into rust programs.
# .cargo/config.toml
[build]
rustflags = ["-Zsymbol-mangling-version=v0"]
both gdb and objdump should support this scheme out of the box, for other tools, you can still use rustfilt
manually to demangle the names.
probe async Rust functions
Async functions in rust are not functions at all, but ast level constructs that get translated into a FSM that drives it self to completion when polled. Thus we cannot probe async functions as we probe normal functions. To solve the problem, we can either add nop normal functions to async functions, and probe on them instead, or we can probe on the poll function of the generated FSM, which results in less visibility but still some useful informations on the async function, especially those performance related.