Workload isolation makes it harder for a vulnerability in one service to compromise every other part of the platform. It has a long history going back to 1990s qmail, and we generally agree that it’s a good, useful thing.
Despite a plethora of isolation options, in the time I spent consulting for technology companies I learned that the most common isolation mechanism is “nothing”. And that makes some sense! Most services are the single tenant of their deployment environment, or at least so central to the logical architecture that there’s nothing to meaningfully isolate them from. Since isolation can be expensive, and security is under-resourced generally, elaborate containment schemes are often not high up on the list of priorities.
That logic goes out the window when you’re hosting other people’s stuff. Fly.io is a content delivery network for Docker containers. We make applications fast by parking them close to their users; we do that by running bare metal servers in a bunch of data centers around the world, and knitting them together with a global WireGuard mesh. Fly.io is extremely easy to play with — single-digit minutes to get your head around, and rather than talk about it, I’ll just suggest you grab a free account and try it.
Meanwhile, I’m going to rattle off a bunch of different isolation techniques. I’ll spoil the list for you now: we use Firecracker, the virtualization engine behind Amazon’s Lambda and Fargate services. But the solution space we chose Firecracker from is interesting, and so you’re going to hear about it.
People like to say “chroot isn’t a security boundary”, but of course that isn’t really true, it’s just not very strong by itself. Chroot is the original sandboxing technique.
The funniest problem with chroot is how it’s implemented: in the kernel process table, every struct proc (I was raised on BSD) has a pointer to its current working directory and to its root directory. The root directory is “enforced” when you try to cd to “..”; if your current working directory is already the root, the kernel won’t let “..” go below it. But when you call chroot(2), you don’t necessarily change directories; if you’re “above” your new root, the kernel will never see that new root in a path traversal.
The real problem, of course, is the kernel attack surface. We don’t need to get cute yet; by itself, considering no other countermeasures, chroot gives you ptrace, procfs, device nodes, and, of course, the network.
You shake a lot of these problems off by not running anything as “root”, but not everything. A quick-but-important aside: in real-world attacks, the most important capability you can concede to an attacker is access to your internal network. It’s for the same reason that SSRF vulnerabilities (“unexpected HTTP proxies”) are almost always game-over, even though at first blush they might not seem much scarier than an unchecked redirect: there will be something you can aim an internal HTTP request to that will give an attacker code execution. Network access in a chroot jail is like that, but far more flexible.
This problem will loom over almost everything I write about here; just keep it in mind.
chroot is a popular component in modern sandboxes, but none of them really rely on it exclusively.
It’s 1998 and the only serious language you have available to build in is C. You want to receive mail for a group of users, or authenticate and kick off a new SSH session. But those are complicated, multi-step operations, and nobody knows how to write secure C code; it’ll be 30 years before anyone figures that out. You assume you’re going to screw up a parse somewhere and cough up RCE. But you need privileges to get your job done.
One solution: break the service up into smaller services. Give the services different user IDs. Connect services with group IDs. Mush the code around so that the gnarliest stuff winds up in the low-privileged services with the fewest connections to other services. Keep the stuff that needs to be privileged, like mailbox delivery or setting the login user, as tiny as you can.
Call this approach “privsep”.
Despite what its author said about his design, this approach works well. It’s not foolproof, but it has in fact a pretty good track record. The major downside is that it takes a lot of effort to implement; your application needs to be aware that you’re doing it.
If you can change your applications to fit the sandbox, you can take privsep pretty far. OpenBSD got this right with “pledge” and “unveil”, which allow programs to gradually ratchet down the access they get the kernel. It’s a better, more flexible idiom than seccomp, about which more later. But you’re not running OpenBSD, so, moving on.
People like to say “Docker isn’t a security boundary”, but that’s not so true anymore, though it once was.
The core idea behind containers is kernel namespacing, which is chroot extended to other kernel identifiers — process IDs, user IDs, network interfaces. Configured carefully, these features give the appearance of a program running on its own machine, even as it shares a running kernel with other programs outside its container.
But even with its own PID space, its own users and groups, and its own network interfaces, we still can’t have processes writing handler paths to
/sys, rebooting the system, loading kernel modules, and making new device nodes, and while many of these concerns can be avoided simply by not running as root, not all of them can.
Systems security people spent almost a decade dunking on Docker because of all the gaps in this simplified container model. But nobody really runs containers like this anymore.
Enter mandatory access control, system call filtering, and capabilities.
Mandatory access control frameworks (AppArmor is the one you’ll see) offer system- (or container-) wide access control lists. You can read a version of Docker’s default AppArmor template to see what problems this fixes; it’s a nice concise description of the weaknesses of namespaces on their own.
System call filters let us turn off kernel features; in 2020, if you’re filtering system calls, you’re probably doing it with seccomp-bpf.
Capabilities split “root” into a whole mess of sub-privileges, ensuring that there’s rarely a need to give any program superuser access.
There are lots of implementations of this idea.
Modern Docker, for instance, takes advantage of all these features. Though imperfect, the solution Docker security people arrived at is, I think, a success story. Developers don’t harden their application environments consciously, and yet, for the most part, they also don’t run containers privileged, or give them extra capabilities, or disable the MAC policies and system call filters Docker enforces by default.
It may be even easier to jail a process outside of Docker; Googlers built minijail and nsjail, Cloudflare has “sandbox”, there’s “firejail”, which is somewhat tuned for things like browsers, and systemd will do some of this work for you. Which tool is a matter of taste; nsjail has nice BPF UX; firejail interoperates with AppArmor. Some of them can be preloaded into uncooperative processes.
With namespaced jails, we’ve arrived at the most popular current endpoint for workload isolation. You can do better, but the attacks you’ll be dealing with start to get subtle.
A limitation of jailed application environments is that they tend to be applied container- or at least process-wide. At high-volumes, allocating a process for every job might be expensive.
From a security perspective, assuming you trust the language runtimes (I guess I do) these approaches are attractive when you can expose a limited system interface, which is what everyone does with them, and less attractive as a general design if you need all of POSIX.
Here’s a problem we haven’t addressed yet: you can design an intricate, minimal whitelist of system calls, drop all privileges, and cut most of the filesystem off. But then a Linux kernel developer restructures the memory access checks the kernel uses when deref’ing pointers passed to system calls, and someone forgets to tell the person who maintains waitid(2), and now userland programs can pass kernel addresses to waitid and whack random kernel memory. waitid(2) is innocuous, you weren’t going to filter it out, and yet there you were, boned.
Or, how about this: every time a process faults an address, the kernel has to look up the backing storage to resolve the address. Since this is relatively slow, the kernel caches. But it has to keep those caches synchronized between all the threads in a process, so the per-thread caches get counters tied to the containing process. Except: the counters are 32 bits wide, and the invalidation logic is screwed up, so that if you roll the counter, then immediately spawn a thread, then have that thread roll the counter again, you can desynchronize a thread’s cache and get the kernel to follow stale pointers.
Bugs like this happen. They’re called kernel LPEs. A lot of them, you can mitigate by tightening system call and device filters, and compiling a minimal kernel (you weren’t really using IPv6 DCCP anways). But some of them, like Jann Horn’s cache invalidation bug, you can’t fix that way. How concerned you are about them depends on your workloads. If you’re just running your own applications, you might not care much: the attacker exploiting this flaw already has RCE on your systems and thus some access to your internal network. If you’re running someone else’s applications, you should probably care a lot, because this is your primary security barrier.
If namespaces and filters constitute a “jail”, gVisor is The Village from The Prisoner. Instead of just filtering system calls, what if we just reimplement most of Linux? We run ordinary Unix programs, but intercept all the system calls, and, for the most part, instead of passing them to the kernel, we satisfy them ourselves. The Linux kernel has almost 400 system calls. How many of them do we need to efficiently emulate the rest? gVisor needs less than 20.
With those, gVisor implements basically all of Linux in userland. Processes. Devices. Tasks. Address spaces and page tables. Filesystems. TCP/IP; the entire IP network stack, all reimplemented, in Go, backended by native Linux userland.
The pitch here is straightforward: you’re unlikely to have routine exploitable memory corruption flaws in Go code. You are sort of likely to have them in the C-language Linux kernel. Go is fast enough to credibly emulate Linux in userland. Why expose C code if you don’t have to?
As batshit as this plan is, it works surprisingly well; you can build gVisor and
runsc, its container runtime, relatively easily. Once you have
runsc installed, it will run Docker containers for you. After reading the code, I sort of couldn’t believe it was working as well as it did, or, if it was, that it was actually using the code I had read. But I scattered a bunch of panic calls across the codebase and, yup, that all that stuff is actually happening. It’s pretty amazing.
You are probably strictly better off with gVisor than you are with a tuned Docker configuration, and I like it a lot. The big downside is performance; you’ll be looking at a low-double-digits percentage hit, degrading with I/O load. Google runs this stuff at scale in GCE; you can probably get away with it too. If you’re running gVisor, you should brag about it, because, again, gVisor is pretty bananas.
If you’re worried about kernel attack surface but don’t want to reimplement the entire kernel in userland, there’s an easier approach: just virtualize. Let Linux be Linux, and boot it in a virtual machine.
You almost certainly already trust virtualization; if hypervisors are comprehensively broken, so is all of AWS, GCE, and Azure. And Linux makes hypervising pretty simple!
The challenge here is primarily about performance. A big part of the point of containers is that they’re lightweight. In a sense, the grail of serverside isolation is virtualization that’s light enough to run container workloads.
It turns out, this is a reasonable ask. A major part of what makes virtual machines so expensive is hardware emulation, with enough fidelity to run multiple operating systems. But we don’t care about diverse operating systems; it’s usually fine to constrain our workloads to Linux. How lightweight can we a virtual machine if it’s only going to boot a simple Linux kernel, with simple devices?
Turns out: pretty lightweight! So we’ve got Kata Containers, which is the big-company supported serverside lightweight virtualization project that came out of Intel’s Clear Containers (mission statement: “come up with a container scheme that is locked in to VT-x”). Using QEMU-Lite, Kata gets rid of BIOS boot overhead, replaces real devices with their virtio equivalents, and aggressively caches, and manages to get boot time down by like 75%. kvmtool, an alternative KVM runtime, gets even lighter.
There’s two catches.
The first, and really the big problem for the whole virtualization approach, is that you need bare metal servers to efficiently do lightweight virtualization; you want KVM but without nested virtualization. You’re probably not going to shell out for EC2 metal instances just to get some extra isolation.
The second, more philosophical problem is that QEMU and kvmtool are relatively complicated C codebases, and we’d like to minimize our dependence on these. You could reasonably take the argument either way between gVisor, which emulates Linux in a memory-safe language, or Kata/kvmtool, which runs virtualized Linux with a small memory-unsafe hypervisor. They’re both probably better than locked-down
runc Docker, though.
Lightweight virtualization is how AWS runs Lambda, its function-as-a-service platform, and Fargate, its serverless container platform. But rather than trusting (and painstaking tuning) QEMU, AWS reimplemented it, in Rust. The result is Firecracker.
Firecracker is a VMM optimized for security. It’s really kind of difficult to oversell how clean Firecracker is; the Firecracker paper boasts that they’ve implemented their block device in around 1400 lines of Rust, but it looks to me like they’re counting a lot of test code; you only need to get your head around a couple hundred lines of Rust code to grok it. The network driver, which adapts a Linux tap device to a virtio device a guest Linux kernel can talk to, is about 700 lines before you hit tests — and that’s rust, so something like 1/3 of those lines are use-statements! It’s really great.
The reason Firecracker (and, if you overlook the C code, kvmtool) can be this simple is that they’re pushing the system complexity down a layer. It’s still there; you’re booting an actual, make-menuconfig’d kernel, in all of it’s memory-unsafe glory. But you’re doing it inside a hypervisor where, in the Firecracker case, really you’re only worried about the integrity of the kvm subsystem itself.
We aren’t yet significant contributors to Firecracker, but it still feels weird talking the project up because it’s such a core part of our offering. That said: the team at AWS really did this thing the Western District Way:
- •The Firecracker VMM is tiny, easily readable, and deliberately implements the minimal number of concepts required to run a Linux server workload.
- •The VMM is written in Rust.
- •The VMM seccomp-bpf’s itself down to something like 40 system calls], several, including basic things like
ioctl, with tight argument filters.
- •Runs itself under an external jailer that chroots, namespaces, and drops privileges.
Keep in mind, I think, that no matter how intricate your Linux system isolation is, the most important attack surface you need to reduce is exposure to your network. If you can spend time segmenting an unsegmented single-VPC network or further tightening the default Docker seccomp-bpf policy, your time is probably better spent on the network.
Remember also that when security tools designers think about isolation and attack surface reduction, they’re generally assuming that you need ordinary tools to run, and ordinary tools want Internet access; your isolation tools aren’t going to do the network isolation out of the box, the way they might, for instance, shield you from Video4Linux bugs.
It seems to me like, for new designs, the basic menu of mainstream options today is:
- •Jailing otherwise-unmanaged Unix programs with
nsjail or something like it.
- •Running unprivileged Docker containers, perhaps with a tighter seccomp profile than the default.
- •Going full gVisor.
- •Running Firecracker, either directly or, in a K8s environment, with something like Kata.
These are all valid options! I’ll say this: for ROI purposes, if time and effort is a factor, and if I wasn’t hosting hostile code, I would probably tune an
nsjail configuration before I bought into a containerization strategy.