ebpf-rs

ebpf-rs is a verifier, interpreter, JIT compiler for eBPF and corresponding facilities to hook them into rCore written in rust in development. This document serves as a reference of eBPF, Kprobes and other technologies used in this project and for future implementations.

link: https://github.com/NickCao/ebpf-rs

eBPF machine

the eBPF machine is a traditional RISC register machine with a total of 11 64-bit registers, a program counter and a 512 byte fixed size stack.

registers allocation

10 general purpose read-write registers numbered from r0 to r9, and 1 read-only frame pointer numbered r10. The pc register is implicit and cannot be read.

calling convention

  • r0: return value
  • r1-r5: arguments (more than 5 arguments is not supported)
  • r6-r9: callee saved

eBPF instruction

encoding

64 bit fixed size instructions encoded in host byte order

63                       31               15   11   7        0
+------------------------+----------------+----+----+--------+
|immediate               |offset          |src |dst |opcode  |
+------------------------+----------------+----+----+--------+

from most significant bit to least significant bit

  • 32 bit signed immediate
  • 16 bit signed offset
  • 4 bit source register
  • 4 bit destination register
  • 8 bit opcode

when field are not used, they should be zeroed

opcode

opcode has two different encodings

type 1                type 2
7    4    2     0     7    3   2     0
+----+----+-----+     +----+---+-----+
|mode|size|class|     |op  |src|class|
+----+----+-----+     +----+---+-----+

which can be distinguished by the class field

type 1
LD    0x0 // load
LDX   0x1
ST    0x2 // store
STX   0x3

type 2
ALU   0x4
JMP   0x5
JMP32 0x6
ALU64 0x7

type 1

type 1 instructions are generally related to memory access, with few specialized exceptions for accessing packet data, which is out of the scope of this document

the size field specifies the size of memory to access

W   0x0  // 4 byte
H   0x1  // 2 byte
B   0x2  // 1 byte
DW  0x3  // 8 byte

the mode field specifies the addressing mode

IMM   0x0  // immediate
ABS   0x1  // packet absolute offset
IND   0x2  // packet indirect (register) offset
MEM   0x3  // memory
LEN   0x4  // reserved
MSH   0x5  // reserved
XADD  0x6  // exclusive (atomic) add

type 2

type 2 instructions are for alu and control flow operations

the src field specifies the source of operand

K  0x0  // immediate
X  0x1  // register

the op field specifies the operation

for class in ALU or ALU64

BPF_ADD   0x00
BPF_SUB   0x10
BPF_MUL   0x20
BPF_DIV   0x30
BPF_OR    0x40
BPF_AND   0x50
BPF_LSH   0x60
BPF_RSH   0x70
BPF_NEG   0x80
BPF_MOD   0x90
BPF_XOR   0xa0
BPF_MOV   0xb0
BPF_ARSH  0xc0
BPF_END   0xd0  // endianness conversion

for class in JMP or JMP32

BPF_JA    0x00  // JMP only
BPF_JEQ   0x10
BPF_JGT   0x20
BPF_JGE   0x30
BPF_JSET  0x40
BPF_JNE   0x50
BPF_JSGT  0x60
BPF_JSGE  0x70
BPF_CALL  0x80  // JMP only, function call
BPF_EXIT  0x90  // JMP only, function return
BPF_JLT   0xa0
BPF_JLE   0xb0
BPF_JSLT  0xc0
BPF_JSLE  0xd0

JMP32 and ALU instructions operate on the lower 32 bits of the registers, zeroing the upper half of destination register

double word load

there exists one special instruction that takes 128 bits: LD IMM DW, it loads two 32 bit immediates from this and the next instruction into a single register, effectively loading a 64 bit immediate

implementation notes

for eBPF runtimes, there's no need to implement the full eBPF instruction set, a subset of which is more than enough for kernel instrumentation. a great reference is the llvm unit tests for eBPF target: llvm test eBPF

eBPF function

eBPF functions are called with specialized instruction CALL, by a predefined function index encoded in immediate. functions are normally provided as helpers by the eBPF runtime to augment the restricted functionalities of eBPF programs. some eBPF runtimes allow eBPF programs to call each other following the same calling convention with dynamically function index assigned at load time.

map

as the eBPF machine only has a stack of limited size without heap allocation, most eBPF runtimes provide hashmap-like data stuctures in the form of helper functions. maps are normally created when eBPF programs are being loaded into kernel, by a userland helper, and the parameters for maps such as data type or max size can be specified in special ELF sections.

eBPF with clang

build

instead of manually crafting eBPF programs, we can use clang to compile a subset of c code into eBPF programs

clang -target bpf -Werror -O2 -c ebpf.c -o ebpf.o

notice that -O2 MUST NOT be omitted, otherwise clang would generate pseudo instructions out of the eBPF specification

disassemble

llvm-objdump --triple=bpf -S ebpf.o

c subset

  • all functions must be inlined
  • no global variables
  • no unbounded loops

intro to Kprobes

Kprobes enables you to dynamically insert breakpoint into nearly any kernel routine and specifying a handler routine to be invoked when the breakpoint is hit. it does so by replacing the instruction to trap with INT3 on x86 (or equivalent instructions on other platforms) and single step the replaced instruction in breakpointe handler after executing user specified handler routines.

A fork of rCore with Kprobes implements can be found at https://github.com/NickCao/rCore

single step

single stepping the replaced instruction is trickier than it seems. only if the instruction would not produce exceptions or results related to pc or privilege levels can we safely run it out of the original context. otherwise we must take into consideration of all possible side effects and remediate them.

simplified implementation

hereby we propose a simplified implementation of Kprobe that is easier to implement, while introducing some limitations. instead of allowing trapping at any point in any traced function, we only trace functions that eventually returns to its caller without recursions, and limit the breakpoint to be the entry point of the functions.

the way to insert the breakpoint remains the same, but when the breakpoint is hit, we replace the content of ra register with our own handler, restore the original instruction, run user defined handlers, then jump back to the entry point of the function. when the function returns (to our handler), we insert the breakpoint again and wait for it to be hit again.

while this implementation saves us the cost of single stepping the trapped instruction, it imposes much restriction on the function to be traced and generally won't work on SMP systems.

run handler on function return

when probing functions, both the entry point and exit of functions are interested, thus we need to redirect the return address of the function to an address we control in order to run custom handlers on function return. Since recursive functions and on SMP systems, a single function may be called multiple times simultaneously, we cannot set ra to a static value, instead we need a trampoline, a contiguous region of ebreak instructions, and set ra to an unused address within that range, in order to distinguish between multiple invocations. A trampoline inplemented as such may not be able to handle functions on hot paths due to it's limited size, thus measures to dynamically increase it's size should be implemented, and don't forget to flush icache when doing so.

symbols in Rust programs

Rust follows an obscure symbol name mangling scheme, making it impossible to recover the original type name from the symbol table, without further information from the compiler.

However a new mangling scheme is proposed, Rust Symbol Mangling (v0), which makes the mangled names more machine readable. Now that support for this mangling scheme is already implemented in both rustc and upstream toolchains, we can use it for better visibility into rust programs.

# .cargo/config.toml
[build]
rustflags = ["-Zsymbol-mangling-version=v0"]

both gdb and objdump should support this scheme out of the box, for other tools, you can still use rustfilt manually to demangle the names.

probe async Rust functions

Async functions in rust are not functions at all, but ast level constructs that get translated into a FSM that drives it self to completion when polled. Thus we cannot probe async functions as we probe normal functions. To solve the problem, we can either add nop normal functions to async functions, and probe on them instead, or we can probe on the poll function of the generated FSM, which results in less visibility but still some useful informations on the async function, especially those performance related.