Skip to main content
Category

Blog

Exploring eBPF, IO Visor and Beyond

By Blog

I recently got acquainted with eBPF as the enabling technology for PLUMgrid’s distributed, programmable data-plane. While working on the product, perhaps due to my academic mindset,  my mind got piqued by this technology described here as the “In-kernel Universal Virtual Machine”.

This led me further explore the history for eBPF and its parent Linux Foundation project IO Visor, resulting in these set of slides [link], which I used to deliver talks at universities, labs, and conferences. I communicated the technology at a high level as well,  along with the efforts to make it more accessible by the  IO Visor Project (important since, as Brenden Gregg described, raw eBPF programming as “brutal”). While other people have already explained eBPF and IO Visor earlier (BPF Internals I, BPF Internals II, and IO Visor Challenges Open vSwitch) ,  I wanted to talk here about possible research directions and the wide scope for this exciting technology.

Before I go into these areas, however, a short primer on eBPF is essential for completeness.

A brief history of Packet Filters

So let’s start with eBPF’s ancestor — the Berkeley Packet Filter (BPF). Essentially built to enable line rate monitoring of packet, BPF allows description of a simple filter inside the kernel that lets through (to userspace) only those packets that meet its criteria. This is the technology used by tcpdump (and its derivatives like wireshark and other tools using libpcap). Let’s look at an example (taken from this Cloudflare blog):

$ sudo tcpdump -p -ni eth0 “ip and udp”

with this program, tcpdump will sniff through all traffic at the eth0 interface and return UDP packets only. Adding a -d flag to the above commands actually does something interesting:

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 5
(002) ldb      [23]
(003) jeq      #0x11            jt 4    jf 5
(004) ret      #65535
(005) ret      #0

The above, recognizable as an assembly program for an ISA, shows a basic implementation of the filter as a bytecode that assumes a received packet resides at memory location [0] and then uses offset of Ethertype field and Protocol field, to drop packet the packet if UDP packet is not detected, by indicating a 0 (zero) return value.

This above code represents a bytecode for the BPF “psuedo-machine” architecture that is the underpinning of BPF. Thus what BPF provides is a simple machine architecture, sufficient to build packet filtering logic. Most of BPF safety guarantees arise from the very limited set of instructions allowed by the pseudo machine architecture, with a basic verifier to prevent loops was also exercise before the code is inserted inside the kernel. The in-kernel BPF is represented by an interpreter to allow the filter to be executed on every packet. For more details refer to this excellent blog post by Sukarma (link).

Extending BPF

Now this basic abstraction has been in the Linux kernel for nearly two decades, but recently got an upgrade with eBPF (e for extended) that has made BPF more than just a packet filter.

The main motivation behind eBPF was to extend the capabilities of BPF pseudo-machine to become more powerful and expressive, all the while providing the stability guarantees that ensured its existence in the kernel in the first place. This is a tough balancing act; on one hand making the BPF machine architecture more powerful means more machine instructions, more registers, bigger stack and 64 bit instructions. On the other hand  — in line with the “with great power comes great responsibility” adage —  the verification of the new bytecode also becomes significantly more challenging.

This challenge was taken up and delivered in the form of eBPF patch to the Linux kernel “filter”, with a new bpf syscall added from kernel 3.18. While the details of this new patch can be found at various different locations, including the presentation I have been giving, I will only briefly address the exciting new features and what they enable.

  • Souped up machine architecture: eBPF makes the  instruction set 64 bit and significantly expands its supported instruction count. Think of this like upgrading to a new Intel core architecture and the benefit in efficiency and capability therein. Another important consideration was to make the new architecture similar to x86-64, and ARM-64 architecture, thus allowing ease in writing a JIT compiler for eBPF bytecode.
  • Support for maps: This new feature allows storage of values between eBPF code execution. Notice that previously BPF programs were only run in isolation, with no-recall. The new map feature is crucially important as it allows the retention of state between every execution of eBPF code (full recall), thus allowing implementing a state-machine based on events triggering an eBPF function.
  • Helper functions: Helper functions are akin to having a library that allows eBPF programs — restricted to the confines of the isolated/virtualized pseudo-machine of BPF — to access resources (like the above mentioned maps) in an approved, kernel-safe way. This allows for increasing eBPF capabilities by offloading some functions, like requesting pseudo-numbers or recalculating checksum, to programs outside the eBPF code.
  • Tail-calls: As we noted earlier, pre-eBPF, programs would execute in isolation and had no (direct) control to trigger another filter/program. With the tail-call feature, an eBPF program can control the next eBPF program to execute. This ability allows a (sort-of) get-out-of-jail card from the per eBPF program restriction of 4096 instructions; more importantly it allows the ability to stitch containerized kernel code together —  hence enabling micro-services inside kernel.

There are lots of details about the internals of eBPF program types, where they can be hooked, and the type of context they work with all over the interwebs (links at the end of this blog). We will skip this to focus next on what are the various different things people are already doing, and some “out there ideas” that can  get people to start thinking in a different direction.

Before we go, note that most of the complexity of pushing code into kernel and then using helper-functions and tail-calls is being progressively reduced under the IO Visor Project’s github repository. This includes the ability to build code and manage maps through a python front-end (bcc) and a persistent file system in the form of bpf-fuse.

…. ask what eBPF can do for you?

So now that we understand the basics of the eBPF ecosystem, we discuss a few opportunities arising from having a programmable engine inside the kernel. Some of the ideas I will be throwing out there are already possible or being used, while others require some modifications to the eBPF ecosystem (by e.g., increasing the type of maps and helper functions)  — something a creative developer can upstream into the Linux kernel (not easy, I know!).

Security

While BPF was originally meant to help in packet-filtering, the seccomp patch allowed for mechanism to trap system calls and possibly block them, thereby limiting the number of calls accessible to an application.

With eBPF ability to have shared maps across instances, we can do dynamic taint analysis which has a minimal overhead. This will increase ability to track and stop malware and security breaches.

Similarly, with the ability to keep state, it will be trivial to implement stateful packet filtering. With eBPF, each application can also build their own filtering mechanism — thus web servers can post intelligent DDoS rejection programs, that block traffic at the kernel without disrupting the application thread, and can be configured through maps. These DDoS signatures can be generated using more involved DDoS detection algorithm present in the user-space.

Tracing

The introduction of eBPF has ignited significant interest in the tracing community. While I could describe several uses, the following blogs by Brenden Gregg (hist, off_cpu,uprobes), provide greater detail and convey the potential quite well.

Briefly, the ability to monitor kernel and user events (through kprobe and uprobe), then keep statistics in maps that can be polled from the user space provide  the key differentiation from other tracing tools. These are  useful as they enable the “Goldilocks effect” — the right amount of insight without perturbing the monitored system metric significantly.

Networking

Since eBPF programs can hook into different places of the networking stack, we can now program networking logic that can look at the entire payload, read any protocol header type, and keep state to implement any protocol machine. These capabilities allow for building a programmable data-plane inside commodity machines.

Another key feature of eBPF, the ability to call other programs, allows us to connect eBPF programs and pass packets between them. This in essence allows us to build an orchestration system that can connect eBPF program and implement a chain of network functions. Thus  third party vendors can build best-of-breed network elements, which can then be stitched together using the tail-call capability in eBPF. This independence guarantees greatest possible flexibility for users planning to build a powerful and cost-effective data-plane. The fact that PLUMgrid builds its entire ONS product line on top of this core functionality is a testament to its potential in this area as well.

IoT devices and development

How can a tech-blog be complete without throwing in the IoT buzz word? Well, it seems to me there is an interesting — albeit futuristic — application of eBPF in the IoT space. First, let me postulate that widespread IoT acceptance is hampered by the current approach of using  specialized, energy efficient operating systems like TinyOS, Contiki, RIOT. If instead we have the ability to use tools familiar to typical developers (within a standard Linux environment), the integration and development of solutions will be accelerated.

With the above premise, it is interesting to think of building an event-based microkernel like OS inside the monolithic Linux Kernel. This can happen if it becomes feasible (or safe) to trap (even if a subset of)  I/O interrupts and invoke an energy-aware scheduler to appropriate set state for both radio and processor on these devices. The event-driven approach to building an IoT application is perfectly inline with current best-practices, inasmuch that the above IoT specific OS use this approach for optimizing performance. At the same time, before deployment, and for debugging or even upgrading the IoT application, normal Linux tools will be available for developers and users alike.

Mobile Apps

Android will soon have eBPF functionality — when it does, the possibility of pushing functionality to monitor per application usage inside the kernel can make for some very interesting monitoring apps. We can implement several of the applications above, but with the additional benefits of having lower impact on the battery life.

Conclusion

While I have tried to convey a summary of eBPFs capability and its possible use-cases, community efforts driven by the IOVisor project continue to expand the horizon. IO Visor argues for an abstract model of an IO Module along with a mechanism to connect between such modules and other system components. These modules, described as an eBPF program, can as one instantiation be run within the Linux kernel, but can also be extrapolated to other implementations using offloads. Having the same interface, an eBPF program and its capabilities, will allow users to design and define IO interactions in an implementation independent way, with the actual implementation optimized for a particular use case e.g. NFV and data-plane acceleration.

If you are interested in IO Visor, join the IO Visor developer mailing, follow @iovisor twitter group and find out more about the project. See you there.

Useful Links

https://github.com/iovisor/bpf-docs
http://lwn.net/Articles/603984/
http://lwn.net/Articles/603983/
https://lwn.net/Articles/625224/
https://www.kernel.org/doc/Documentation/networking/filter.txt
http://man7.org/linux/man-pages/man2/bpf.2.html
https://videos.cdn.redhat.com/summit2015/presentations/13737_an-overview-of-linux-networking-subsystem-extended-bpf.pdf
https://github.com/torvalds/linux/tree/master/samples/bpf
http://lxr.free-electrons.com/source/net/sched/cls_bpf.c

 

About the author of this post
Affan Ahmed SyedAffan Ahmed Syed
Director Engineering at PLUMgrid Inc.
LinkedIn | Twitter: @aintiha

Linux eBPF Stack Trace Hack

By Blog

Stack trace support by Linux eBPF will make many new and awesome things possible, however, it didn’t make it into the just-released Linux 4.4, which added other eBPF features. Envisaging some time on older kernels that have eBPF but not stack tracing, I’ve developed a hacky workaround for doing awesome things now.

I’ll show my new bcc tools (eBPF front-end) that do this, then explain how it works.

stackcount: Frequency Counting Kernel Stack Traces

The stackcount tool frequency counts kernel stacks for a given function. This is performed in kernel for efficiency using an eBPF map. Only unique stacks and their counts are copied to user-level for printing.

For example, frequency counting kernel stack traces that led to submit_bio():

linux_ebpf_stack_trace_hack_screen_shot_1

The order of printed stack traces is from least to most frequent. The most frequent in this example, printed last, was taken 79 times during tracing.

The last stack trace shows syscall handling, ext4_rename(), and filemap_flush(): looks like an application issued file rename has caused back end disk I/O due to ext4 block allocation and a filemap_flush().

This tool should be very useful for exploring and studying kernel behavior, quickly answering how a given function is being called.

stacksnoop: Printing Kernel Stack Traces

The stacksnoop tool prints kernel stack traces for each event. For example, for ext4_sync_fs():

linux_ebpf_stack_trace_hack_screen_shot_2

Since the output is verbose, this isn’t suitable for high frequency calls (eg, over 1,000 per second). You can use funccount from bcc tools to measure the rate of a function call, and if it is high, try stackcount instead.

How It Works: Crazy Stuff

eBPF is an in-kernel virtual machine that can do all sorts of things, including “crazy stuff“. So I wrote a user-defined stack walker in eBPF, which the kernel can run. Here is the relevant code from stackcount (you are not expected to understand this):

linux_ebpf_stack_trace_hack_screen_shot_3

Once eBPF supports this properly, much of the above code will become a single function call.

If you are curious: I’ve used an unrolled loop to walk each frame (eBPF doesn’t do backwards jumps), with a maximum of ten frames in this case. It walks the RBP register (base pointer) and saves the return instruction pointer for each frame into an array. I’ve had to use explicit bpf_probe_read()s to dereference pointers (bcc can automatically do this in some cases). I’ve also left the unrolled loop in the code (Python could have generated it) to keep it simple, and to help illustrate overhead.

This hack (so far) only works for x86_64, kernel-mode, and to a limited stack depth. If I (or you) really need more, keep hacking, although bear in mind that this is just a workaround until proper stack walking exists.

Other Solutions

stackcount implements an important new capability for the core Linux kernel: frequency counting stack traces. Just printing stack traces, like what stacksnoop does, has been possible for a long time: ftrace can do this, which I use in my kprobe tool from perf-toolsperf_events can also dump stack traces and has a reporting mode that will print unique paths and percentages (although it is performed less efficiently in user mode).

SystemTap has long had the capability to frequency count kernel- and user-mode stack traces, also in kernel for efficiency, although it is an add-on and not part of the mainline kernel.

Future Readers

If you’re on Linux 4.5 or later, then eBPF may officially support stack walking. To check, look for something like a BPF_FUNC_get_stack in bpf_func_id. Or check the latest source code to tools like stackcount – the tool should still exist, but the above stack walker hack may be replaced with a simple call.

Thanks to Brenden Blanco (PLUMgrid) for help with this hack. If you’re at SCaLE14x you can catch his IO Visor eBPF talk on Saturday, and my Broken Linux Performance Tools talk on Sunday!

**Used with permission from Brendan Gregg. (original post)**

About the author of this post
brendan_rajasthan2011_thumbBrendan Gregg
Brendan Gregg is a senior performance architect at Netflix, where he does large scale computer performance design, analysis, and tuning. He is the author of Systems Performance published by Prentice Hall, and received the USENIX LISA Award for Outstanding Achievement in System Administration. He has previously worked as a performance and kernel engineer, and has created performance analysis tools included in multiple operating systems, as well as visualizations and methodologies.

Come and learn more about IO Visor at P4 Workshop

By Blog

After a successful and exciting OPNFV Summit last week, you can learn more about IO Visor at the 2nd P4 Workshop by Stanford/ONRC on Wednesday, November 18th. At 3:00pm I’ll discussing how IO Visor and the P4 language are ideally suited for each other in the session, P4 and IO Visor for Building Data Center Infrastructure Components. During this session you’ll see in action a programmable data plane and development tools simplifying the creation and sharing of dynamic “IO modules” for building your data center block by block. These IO modules can be used to create virtual network infrastructure, monitoring tools and security frameworks within your data center. You’ll also learn about eBPF, which enables infrastructure developers to create any in-kernel IO module and load/unload them at runtime, without recompiling or rebooting.

I’m looking forward to meeting all of you all at P4 Workshop this Wednesday.


About the author of this post

BPF INTERNALS – II

By Blog

bpf-internals-ii-01

Continuing from where I left before, in this post we would see some of the major changes in BPF that have happened recently – how it is evolving to be a very stable and accepted in-kernel VM and can probably be the next big thing – in not just filtering but going beyond. From what I observe, the most attractive feature of BPF is its ability to give access to the developers so that they can execute dynamically compiled code within the kernel – in a limited context, but still securely. This itself is a valuable asset.

As we have seen already, the use of BPF is not just limited to filtering out network packets but for seccomp, tracing etc. The eventual step for BPF in such a scenario was to evolve and come out of it’s use in the network filtering world. To improve the architecture and bytecode, lots of additions have been proposed. I started a bit late when I saw Alexei’s patches for kernel version 3.17-rcX. Perhaps, this was the relevant mail by Alexei that got me interested in the upcoming changes. So, here is a summary of what all major changes have occured. We will be seeing each of them in sufficient detail.

Architecture

The classic BPF we discussed in the last post had two 32 bit registers – A and X. All arithmetic operations were supported and performed using these two registers. The newer BPF called extended-BPF or eBPF has ten 64 bit registers and supports arbitary load/stores. It also contains new instructions like BPF_CALL which can be used to call some new kernel-side helper functions. We will look into this in detail a bit later as well. The new eBPF follows calling conventions which are more like modern machines (x86_64). Here is the mapping of the new eBPF registers to x86 registers:

bpf-internals-ii-01-1

The closeness to the machine ABI also ensures that unnecessary register spilling/copying can be avoided. The R0 register stores the return from the eBPF program and the eBPF program contexts can be loaded through register R1. Earlier, there used to be just two jump targets i.e. either jump to TRUE or FALSE targets. Now, there can be arbitary jump targets – true or fall through. Another aspect of the eBPF instruction set is the ease of use with the in-kernel JIT compiler. eBPF Registers and most instructions are now mapped one-to-one. This makes emitting these eBPF instructions from any external compiler (in userspace) not such a daunting task. Of course, prior to any execution, the generated bytecode is passed through a verifier in the kernel to check its sanity. The verifier in itself is a very interesting and important piece of code and probably story for another day.

Building BPF Programs

From a users perspective, the new eBPF bytecode can now be another headache to generate. But fear not, an LLVM based backend now supports instructions being generated for BPF pseudo-machine type directly. It is being ‘graduated’ from just being an experimental backend and can hit the shelf anytime soon. In the meantime, you can always use this script to setup the BPF supported LLVM yourslef. But, then what next? So, a BPF program (not necessarily just a filter anymore) can be done in two parts – A kernel part (the BPF bytecode which will get loaded in the kernel) and the userspace part (which may, if needed gather data from the kernel part) Currently you can specify a eBPF program in a restricted C like language. For example, here is a program in the restricted C which returns true if the first argument of the input program context is 42.  Nothing fancy :

include

This C like syntax generates a BPF binary which can then be loaded in the kernel. Here is what it looks like in BPF ‘assembly’ representation as generated by the LLVM backed (supplied with 3.4) :

6

If you are adventerous enough, you can also probably write complete and valid BPF programs in assembly in a single go – right from your userspace program. I do not know if this is of any use these days. I have done this sometime back for a moderately elaborate trace filtering program though. It is also not effective as well, becasue I think at this point in human history, LLVM can generate assembly better and more efficiently than a human.

What we discussed just now is probably not a relevant program anymore. An example by Alexei here is what is more relevant these days. With the integration of Kprobe with BPF, a BPF program can be run at any valid dynamically instrumentable function in the kernel. So now, we can probably just use pt_regs as the context and get individual register values at each time the probe is hit. As of now, some helper functions are available in BPF as well, which can get the current timestamp. You can have a very cheap tracing tool right there 🙂

BPF Maps

I think one of the most interesting features in this new eBPF is the BPF maps. It looks like an abstract data type – initially a hash-table, but from kernel 3.19 onwards, support for array-maps seems to have been added as well. These bpf_maps can be used to store data generated from a eBPF program being executed. You can see the implementation details in arraymap.c or hashtab.c Lets pause for a while and see some more magic added in eBPF – esp. the BPF syscall which forms the primary interface for the user to interact and use eBPF. The reason we want to know more about this syscall is to know how to work with these cool BPF maps.

BPF Syscall

Another nice thing about eBPF is a new syscall being added to make life easier while dealing with BPF programs. In an article last year on LWN Jonathan Corbet discussed the use of BPF syscall. For example, to load a BPF program you could call

1_0

with of course, the corresponding bpf_attr structure being filled before :

4

Yes, this may seem cumbersome to some, so for now, there are some wrapper functions inbpf_load.c and libbpf.c released to help folks out where you may need not give too many details about your compiled bpf program. Much of what happens in the BPF syscall is determined by the arguments supported here. To elaborate more, let’s see how to load the BPF program we did before. Assuming that we have the sample program in its BPF bytecode form generated and now we want to load it up, we take the help of the wrapper function load_bpf_file() which parses the BPF ELF file and extracts the BPF bytecode from the relevant section. It also iterates over all ELF sections to get licence info, map info etc. Eventually, as per the type of BPF program – Kprobre/kretprobe or socket program, and the info and bytecode just gathered from the ELF parsing, the bpf_attr attribute structure is filled and actual syscall is made.

Creating and accessing BPF maps

Coming back to the maps, apart from this simple syscall to load the BPF program, there are many more actions that can be taken based on just the arguments. Have a look at bpf/syscall.c From the userspace side the new BPF syscall comes to the rescue and allows most of these operations on bpf_maps to be performed! From the kernel side however, with some special helper function and the use of BPF_CALL instruction, the values in these maps can be updated/deleted/accessed etc. These helpers inturn call the actual function according to the type of map – hash-map or an array. For example, here is a BPF program that just creates an array-map and does nothing else,

3

When loaded in the kernel, the array-map is created. Form the userspace we can then probably initialize the map with some values with a function that look likes this,

2_0

where bpf_update_elem() wrapper is in-turn calling the BPF syscall with proper arguments and attributes as,

1_0-1

This inturn calls map_update_elem() which securely copies the key and value using copy_from_user() and then calls the specialized function for updating the value for array-map at the specified index. Similar things happen for reading/deleting/creating has or array maps from userspace.

So probably, things will start falling into pieces now from the earlier post by Brendan Gregg where he was updating a map from the BPF program (using the BPF CALL instruction which calls the internal kernel helpers) and then concurrently accessing it from userspace to generate a beautiful histogram (through the syscall I just mentioned above). BPF Maps are indeed a very powerful addition to the system. You can also checkout more detailed and complete examplesnow that you know what is going on. To summarize, this is how an example BPF program written in restricted C for kernel part and normal C for userspace part would run these days:

ebpf-session

In the next BPF post, I will discuss the eBPF  verifier in detail. This is the most crucial part of BPF and deserves detailed attention I think. There is also something cool happening these days on the Plumgrid side I think – the BPF Compiler Collection. There was a very interesting demo using such tools and the power of eBPF at the recent Red Hat Summit. I got BCC working and tried out some examples with probes – where I could easily compile and load BPF programs from my Python scripts! How cool is that   Also, I have been digging through the LTTng’s interpreter lately so probably another post detailing how the BPF and LTTng’s interpreters work would be nice to know. That’s all for now. Run BPF.


About the author of this post

BPF INTERNALS – I

By Blog

Recent post by Brendan Gregg inspired me to write my own blog post about my findings of how Berkeley Packet Filter (BPF) evolved, it’s interesting history and the immense powers it holds – the way Brendan calls it ‘brutal’. I came across this while studying interpreters and small process virtual machines like the proposed KTap’s VM. I was looking at some known papers on register vs stack basd VMs, their performances and various code dispatch mechanisms used in these small VMs. The review of state-of-the-art soon moved to native code compilation and a discussion on LWN caught my eye. The benefits of JIT were too good to be overlooked, and BPF’s application in things like filtering, tracing and seccomp (used in Chrome as well) made me interested. I knew that the kernel devs were on to something here. This is when I started digging through the BPF background.

Background

Network packet analysis requires an interesting bunch of tech. Right from the time a packet reaches the embedded controller on the network hardware in your PC (hardware/data link layer) to the point they do someting useful in your system, such as display something in your browser (application layer). For connected systems evolving these days, the amount of data transfer is huge, and the support infrastructure for the network analysis needed a way to filter out things pretty fast. The initial concept of packet filtering developed keeping in mind such needs and there were many stategies discussed with every filter such as CMU/Stanford packet Filter (CSPF), Sun’s NIT filter and so on. For example, some earlier filtering approaches used a tree based model (in CSPF) to represenf filters and filter them out using predicate-tree walking. This earlier approach was also inherited in the Linux kernel’s old filter in the net subsystem.

Consider an engineer’s need to have a probably simple and unrealistic filter on the network packets with the predicates P1, P2, P3 and P4:

bpf-diagram-01

Filtering approach like the one of CSPF would have represented this filter in a expression tree structure as follows:

bpf-diagram-02
It is then trivial to walk the tree evaluating each expression and performing operations on each of them. But this would mean there can be extra costs assiciated with evaluating the predicates which may not necessarily have to be evaluated. For example, what if the packet is neither an ARP packet nor an IP packet? Having the knowledge that P1 and P2 predicates are untrue, we may need not have to evaluate other 2 predicates and perform 2 other boolean operation on them to determine the outcome.

In 1992-93, McCanne et al. proposed a BSD Packet Filter with a new CFG-bytecode based filter design. This was an in-kernel approach where a tiny interpreter would evaluate expressions represented as BPF bytecodes. Instead of simple expression trees, they proposed a CFG based filter design. One of the control flow graph representation of the same filter above can be:

bpf-diagram-03

The evaluation can start from P1 and the right edge is for FALSE and left is for TRUE with each predicate being evaluated in this fashion until the evaluation reaches the final result of TRUE or FALSE. The inherent property of ‘remembering’ in the CFG, i.e, if P1 and P2 are false, the path reaches a final FALSE is remembered and P3 and P4 need not be evaluated. This was then easy to represent in bytecode form where a minimal BPF VM can be designed to evaluate these predicates with jumps to TRUE or FALSE targets.

The BPF Machine

A pseudo-instruction representation of the same filter described above for earlier versions of BPF in Linux kernel can be shown as,

bpf-code-01
To know how to read these BPF instructions, look at the filter documentation in Kernel source and see what each line does. Each of these instructions are actually just bytecodes which the BPF machine interprets. Like all real machines, this requires a definition of how the VM internals would look like. In the Linux kernel’s version of the BPF based in-kernel filtering technique they adopted, there were initially just 2 important registers, A and X with another 16 register ‘scratch space’ M[0-15]. The Instruction format and some sample instructions for this earlier version of BPF are shown below:

bpf-code-02_0

There were some radical changes done to the BPF infrastructure recently – extensions to its instruction set, registers, addition of things like BPF-maps etc. We shall discuss what those changes in detail, probably in the next post in this series. For now we’ll just see the good ol’ way of how BPF worked.

Interpreter

Each of the instructions seen above are represented as arrays of these 4 values and each program is an array of such instructions. The BPF interpreter sees each opcode and performs the operations on the registers or data accordingly after it goes through a verifier for a sanity check to make sure the filter code is secure and would not cause harm. The program which consists of these instructions, then passes through a dispatch routine. As an example, here is a small snippet from the BPF instruction dispatch for the instruction ‘add’ before it was restructured in Linux kernel v3.15 onwards,

bpf-code-03_0

Above snippet is taken from net/core/filter.c in Linux kernel v3.14. Here, fentry is the socket_filter structure and the filter is applied to the sk_buff data element. The dispatch loop (136), runs till all the instructions are exhaused. The dispatch is basically a huge switch-case dispatch with each opcode being tested (143) and necessary action being taken. For example, here an ‘add’ operation on registers would add A+X and store it in A. Yes, this is simple isn’t it? Let us take it a level above.

JIT Compilation

This is nothing new. JIT compilation of bytecodes has been there for a long time. I think it is one of those eventual steps taken once an interpreted language decides to look for optimizing bytecode execution speed. Interpreter dispatches can be a bit costly once the size of the filter/code and the execution time increases. With high frequency packet filtering, we need to save as much time as possible and a good way is to convert the bytecode to native machine code by Just-In-Time compiling it and then executing the native code from the code cache. For BPF, JIT was discussed first in the BPF+ research paper by Begel etc al. in 1999. Along with other optimizations (redundant predicate elimination, peephole optimizations etc,) a JIT assembler for BPF bytecodes was also discussed. They showed improvements from 3.5x to 9x in certain cases. I quickly started seeing if the Linux kernel had done something similar. And behold, here is how the JIT looks like for the ‘add’ instruction we discussed before (Linux kernel v3.14),

bpf-code-04_0

As seen above in arch/x86/net/bpf_jit_comp.c for v3.14, instead of performing operations during the code dispatch directly, the JIT compiler emits the native code to a memory area and keeps it ready for execution.The JITed filter image is built like a function call, so we add some prologue and epilogue to it as well,

bpf-code-05_0

There are rules to BPF (such as no-loop etc.) which the verifier checks before the image is built as we are now in dangerous waters of executing external machine code inside the linux kernel. In those days, all this would have been done by bpf_jit_compile which upon completion would point the filter function to the filter image,

bpf-code-06_0

Smooooooth… Upon execution of the filter function, instead of interpreting, the filter will now start executing the native code. Even though things have changed a bit recently, this had been indeed a fun way to learn how interpreters and JIT compilers work in general and the kind of optimizations that can be done. In the next part of this post series, I will look into what changes have been done recently, the restructuring and extension efforts to BPF and its evolution to eBPF along with BPF maps and the very recent and ongoing efforts in hist-triggers. I will discuss about my experiemntal userspace eBPF library and it’s use for LTTng’s UST event filtering and its comparison to LTTng’s bytecode interpreter. Brendan’s blog-post is highly recommended and so are the links to ‘More Reading’ in that post.

Thanks to Alexei Starovoitov, Eric Dumazet and all the other kernel contributors to BPF that I may have missed. They are doing awesome work and are the direct source for my learnings as well. It seems, looking at versatility of eBPF, it’s adoption in newer tools like shark, and with Brendan’s views and first experiemnts, this may indeed be the next big thing in tracing.


About the author of this post

Hello World!

By Blog

First off: welcome to the IO Visor Project! We are excited for the birth of this community and thrilled about the future that lays ahead of us all.

It has been over 4 years since the conception of what has now become the IO Visor Project, and it feels like a pretty adventurous journey. We’d like to take you down Memory Lane and share  how a group of end users and vendors got here and what the IO Visor Project is all about.

So … how did the IO Visor Project started?

Several PLUMgrid engineers had a vision: a dream of creating a new type of programmable data plane. This new type of extensible architecture would for the first time enable developers to dynamically build IO modules (think stand alone “programs” that can manipulate a packet in the kernel and perform all sort of functions on it), load and unload those in-kernel at run time and do it without any disruption to the system.

We wanted to transform how functions like networking or security or tracing are designed, implemented and delivered and more importantly we wanted to build a technology that would future-proof large-scale deployments with easy-to-extend functionalities.

Yes, it was an ambitious target but this is why we contributed initial IP and code to kickstart the IO Visor Project. Now, a diverse and engaged open source community is taking that initial work and running with it. A technology, compilers, a set of developers tools and real-world use case examples that can be used to create the next set of IO Module that your application and users demand.

What is so unique about it?

The developers that work on eBPF (extended Berkeley Packet Filter, the core technology behind IO Visor Project) refer to it as universal in-kernel virtual machine with run-time extensibility. IO Visor provides infrastructure developers the ability to create applications, publish them, deploy them in live systems without having to recompile or reboot a full datacenter. The IO modules are platform independent, meaning that they could run on any hardware  that uses Linux.

Running IO and networking functions in-kernel delivers the performance of hardware without layers of software and middleware. With functions running in-kernel of each compute node in a data center, IO Visor enables distributed, scale-out performance, eliminating hairpinning, tromboning and bottlenecks that are prevalent in so many implementations today.

Data center operators no longer need to compromise on flexibility and performance.

And finally … why should you care?

  1. This is the first time in the history of the Linux kernel that a developer can envision a new functionality and simply make it happen.

  2. Use cases are constantly changing, and we need an infrastructure that can evolve with them.

  3. Software development cycles should not be longer than hardware cycle.

  4. Single-node implementations won’t cut it in the land of cattle.

Where next?

Browse through iovisor.org where you will find plenty resources and information on the project and its components. Although the IO Visor Project was just formed, there is a lively community of developers who have been working together for several years. The community leverages Github for developer resources at https://github.com/iovisor

The IO Visor Project is open to all developers and there is no fee to join or participate so we hope to see many of you become a part of it!

Welcome again to the IO Visor Project!


About the author of this post

What are the implications of the IO Visor project and why it matters

By Blog

IO Visor Project is an IO engine with set of development tools that resides between the Linux OS and hardware, along with a set of development tools. It is an in-kernel virtual machine for IO instructions, somewhat like Java virtual machines. You see apps and a runtime engine atop a host and hardware layer. Being software defined, it has the flexibility for modern IO infrastructure and can become a foundation for new generation of Linux virtualization and networking.

Extended Berkeley Packet Filter (eBPF), the technology that underpins IO Visor, is not new but being a project hosted by the Linux Foundation will enable proliferation. It’s general purpose enough to build storage systems, distributed virtual networks or security sandboxes, but let us examine networking uses.

Don’t we have IO virtualization such as SR-IOV (Single Root I/O virtualization?) Don’t dataplane libraries such as DPDK (data plane development kit) and projects such as P4 provide flexible packet processing too? They may seem to overlap, but are actually complementary. IO Visor combines kernel-space performance with extensibility via plug ins to low level functions (e.g. DPDK or directly to hardware) so you can run IO Visor modules implemented atop DPDK.

With support of Broadcom, Cavium, Cisco, Huawei and Intel we may see plug-ins to support a variety of hardware devices. Networking endpoints have increasingly moved into virtual switches, so it is makes sense to provide IO extensibility within the kernel and not rely solely on physical switches. But physical switches are also important, and with hardware vendor support for this project, we may see IO Visor apps that span from software and hardware devices.

Linux portability gives this project a potentially large footprint. Since Linux is the basis for many network switch OSs – including those from Arista, Cisco Systems, Dell Networking, Cumulus Networks, Extreme Networks, Open Networking Linux (basis for Big Switch Networks’ Switch Light) — on the long-term, many vendors may choose to examine IO Visor.

Since IO Visor is platform independent, it can be hosted on different CPU or hardware network processing units.  SuSE and Ubuntu, as founding members, may jumpstart support for the commercial Linux community to support a variety of platforms and devices.

Here are some practical business use cases.

  • Security. Performance requirements traditionally requires I/O to run in the kernel but updates were hard to make creating a tradeoff between speed and security functionality. IO Visor reduces this limitation, so I foresee the development of high performance IO security functions that can be updated with new capabilities, just like anti virus programs updating with signatures.Security use cases have used BPF for years. The popular OpenSSH utilities use it to sandbox privileges and Google’s Chrome browser on Linux and Chrome OS use it to sandbox Adobe Flash. Having it in upstream Linux should enable it find more uses.
  • Cloud building blocks. Converged systems integrate storage, compute and virtualization, and will benefit from a universal IO layer. Systems like VMware vSphere distributed switches provides networking devices that spans multi hosts, but don’t offer platform independent extensibility. IO Visor enables creation of distributed virtual networks.  PLUMgrid, which contributed the initial IO Visor code, based their Open Networking Suite on this technology, so it’s known to work commercially.
  • Carrier networking. Carriers support NFV in the pursuit of reducing opex, capex and increasing agility, but performance demands have been a concern. IO Visor can provide the performance with dynamic changes. Since IO Visor does not require physical or virtual appliances to create distributed networks, it can drive high density and reduced capex for carriers uses such as vCPE. Some founding member companies provide technologies to carriers, and through collaboration OPNFV, I expect carrier networking requirements will influence IO Visor development in new ways.

Foundational software systems, regardless of technical soundness, cannot succeed unless there are applications. Since the project founding members provide a wide range of solutions, we expect their contributions to build applications, tools and IO Modules and not focus solely on the IO Visor engine.

End-users won’t directly interact with IO Visor but they will instead see improvements in performance, flexibility and security and being introduced to new classes of Linux based tools and devices.

Given that Linux is used widely, we feel this project can have widespread affects throughout the Linux virtualization and networking space. With this project, another layer of the IT infrastructure may get transformed to provided more flexibility in a portable, open manner.


About the author of this post

Programmable IO across Virtual and Physical Infrastructures

By Blog

In recent years, with the advent of virtualization, private and public clouds, the nature of application development and deployment has changed significantly. The demands on today’s businesses require applications to be deployed at scale in minutes, and not days, months or years. The need for agility and scale extends to the infrastructure functions needed to support these applications, like networking, storage, security and load balancing.

The advent of software defined functions, like Software-defined Networking (SDN) and Software-defined Storage (SDS), and Software-defined Security (SD-Sec), attempt to deliver on the promise of just-in-time provisioning, auto-scaling and fine grained policy control. A logical way to scale some of these IO functions is to allow them to be implemented in the host OS stack.

The IO Visor project attempts to add the ability to programmatically insert IO functions into the data plane of the Linux kernel to allow for software defined control of infrastructure needed to support modern day applications.  IO Visor requires that clearly specified data plane functions be compiled into a format that can be inserted into data plane of the Linux kernel.

Barefoot Networks has contributed to the development of P4, an open, domain specific language, that is designed to specify and program networking data planes. P4 is an imperative language that allows one to describe the data plane behavior in an intuitive manner.

P4 is the perfect high-level language to specify IO functions in IO Visor. P4 allows for concise, unambiguous, and human readable specification of IO behavior, especially when compared to alternate forms of specification using procedural languages like C. IO functions specified in P4 can be compiled into the extended Berkeley Packet Filter (eBPF) format and pushed into the data plane of the Linux kernel.

Barefoot Networks has prototyped a compiler that compiles P4 programs into LLVM IR. This LLVM IR is further compiled into eBPF using the LLVM compiler. We will contribute this compiler and the supporting tool chain to the IO Visor project by open sourcing the implementation under a permissive license. We believe that this contribution will significantly accelerate the specification of IO functions in the IO Visor project.

We look forward to a world where IO functions can be programmed easily and quickly on both virtual and physical networking infrastructures to adapt to the needs of the applications.


About the author of this post

ck-imageChaitanya (CK) Kodeboyina

Datacenter Security with IO Visor

By Blog

Firewalls as intermediary networking devices have played an important role in protecting a company or organization’s internal servers and hosts, but with networking functionality virtualized (NFV), more and more company applications are moved to the public cloud and the traditional security perimeter now becomes obscure. Before the advent of virtualized firewalls, physical firewalls continued to provide security for public cloud in a similar way to traditional network design. Usually the firewall is placed in the aggregation layer and all traffic from different tenants is routed to physical firewall and the firewall obviously needs multi-tenants support (figure 1 left).

io_visor_security
Figure 1. Physical firewall and NVF-like firewall solution in data center

With this solution the traffic between servers in two different security zones will be routed to the aggregation layer router and the firewall enforces the security policy the same way as it used to in enterprise environment. The drawback of this solution is obvious: all traffic even all internal traffic between VMs will be routed to aggregation layer that makes the system less scalable. When the traffic increases to some point that the physical firewall cannot handle, then the physical firewall has to upgraded or a load balance device added for traffic distribution.

An alternative solution to address this issue is to move the physical firewall function to NFV-like deployment, i.e. firewall running in a VM. If VMs hosting different applications need separation, they connect to the firewall NFV and the security policy will be enforced inside the VM. When there is more need, a new firewall NFV can be instantiated along with the application VMs (illustrated as figure 1 right). With NFV each tenant can be thought of as one virtual domain and security can be enforced within each tenant domain, and the traffic within one virtual domain can be optimized e.g. if all VMs are hosted in one rack, the traffic even will not go out of the rack.

When a firewall is deployed as a VM there are some challenges:

– The networking security perimeter shouldn’t change with dynamic VM migration;

– Security information should be carried along the data path within the tenant domain

– The Virtual Appliance may need to migrate to maximize the networking performance

It is non-trivial to implement all those goals with the existing infrastructure because existing configuration-based networking controllers lack the ability to distribute the traffic in a dynamic way. Controller plugins can provide some levels of programmability but the overall networking scalability issue still exists.

IO Visor creates a run-time extensible data plane that allows NFV vendors or customers to define their packet processing logic dynamically. IO visor provides the ability to create a module running in the hypervisor of each data center server and to create a virtual Fabric Overlay on top of it. By overlaying the networking dynamically and logically IO visor can keep the application VMs protected. The firewall can run either in a static networking environment or in a dynamic networking environment with IO visor platform running underneath. Additionally with IO visor’s programmability users can carry and interpret security information in their own way without NFV’s awareness, building up another layer of networking transparency to deploy the virtual security devices like NFV firewall. Finally IO visor can help optimizing the network performance, e.g. In the above example if there is a VM migration in a tenant domain and that lead to sub-optimal networking performance, the user can create its own NFV by describing networking functionality (optimization algorithm) in IO visor framework and load it run-time without waiting for the delivery from NFV vendor or any 3-rd party. This will significantly reduce the NFV delivery time.

In summary IO visor can help security via NFV-based architectures to be deployed in data center transparently and allows for more extensibility with the ability to add security features quickly.


About the author of this post

Better Networking Through Networking

By Blog

Immersed in the technology world, we tend to think of networking as linking machines, especially computers, to operate interactively. Before there were computers though, we might have said that networking refers to how people come together intelligently to get things done. In our hyper-connected world, open source software development is an excellent example of the latter, and increasingly a driving force behind innovation of the former.

With the announcement today at LinuxCon, SUSE is excited to be a founding member of IO Visor, an open source project and a community of developers to accelerate the innovation, development, and sharing of new IO and networking functions.

As an enterprise Linux company, our job is to produce the most reliable, secure, stable and enterprise-ready Linux on which our customers can base their entire physical and virtual infrastructure, whether they’re running their workloads in the data center, public or private cloud, or some combination of all of them.

The hybrid nature of our customers’ infrastructures is putting new pressures on the networking stack in particular. Since the physical network interface protocols vary across vendors and evolve over time, a handy thing would be to provide a “network hypervisor” that abstracts away the physical network interface thereby accommodating new requirements more easily. This would help accelerate innovation and enable enterprises to more easily deploy software defined network infrastructures.

This is what IO Visor does. Through a module that is upstream in the Linux kernel, Linux distributions like SUSE Linux Enterprise Server will be able to provide a programmable environment in which network function data planes can be loaded and instantiated at runtime, giving developers the ability to create applications, publish them, and deploy them in live systems without having to recompile. And the IO Visor solution is agnostic, not bound to any particular vendor’s hardware or software solution.

Open source software development is now the new normal for business IT, and projects like OpenStack, Open Container Initiative and Cloud Foundry are where the innovation is happening today. With the increased frequency that new open source projects are forming, we have to be judicious about which ones we elect to participate in, and we put a high priority on those that will help to solve challenges that our customers have today. IO Visor is such a project.


About the author of this post