Advanced Container Security
This guide covers runtime hardening: how to prevent a compromised container from damaging the host or other containers, even if the attacker has "root" access inside the container.
We will focus on the Linux kernel primitives that enforce isolation: Seccomp, AppArmor/SELinux, and Capabilities.
1. Understanding "Container Breakouts"
A container breakout occurs when a process inside a container manages to access resources outside its isolated namespace.
Common Vectors:
Kernel Exploits: Exploiting a bug in the host kernel (shared by all containers).
Misconfiguration: Mounting the Docker socket (
/var/run/docker.sock) or the host root filesystem (/).Excessive Capabilities: Having privileges like
CAP_SYS_ADMINallows a container to manipulate the kernel.
2. Limiting System Calls with Seccomp
The Linux kernel has hundreds of system calls (syscalls). A standard web server needs only a small fraction of them (e.g., read, write, socket, accept). Other syscalls (like mount, ptrace, reboot) are dangerous and unnecessary.
Seccomp (Secure Computing Mode) acts as a firewall for syscalls. It allows you to whitelist exactly which calls a container can make.
The Default Profile
Docker applies a default Seccomp profile that blocks ~44 dangerous syscalls out of the box. This prevents many common exploits.
Custom Profiles
For high-security environments, you can create a custom profile (JSON file) that whitelists only what your app needs.
Running with a custom profile:
docker run --security-opt seccomp=profile.json my-app3. Mandatory Access Control (AppArmor / SELinux)
While standard Linux permissions (rwx) control access based on users, Mandatory Access Control (MAC) systems control access based on programs.
AppArmor (Ubuntu/Debian)
AppArmor restricts what specific programs can do.
- Example: You can write a profile that says "Nginx can only read from
/var/www/htmland write to/var/log/nginx." Even if Nginx runs as root, it cannot touch/etc/shadow.
Applying a profile:
docker run --security-opt apparmor=my-nginx-profile nginxSELinux (RHEL/CentOS/Fedora)
SELinux uses a labeling system. Every file and process has a label.
Container Isolation: Docker automatically labels container processes (
container_t) and container files (container_file_t).Protection: This prevents a container process from reading host files (like
/home/user/.ssh) because the labels don't match.
4. Capabilities: Breaking Down "Root"
In Linux, "root" is not a single permission; it is a collection of ~40 "Capabilities."
The Principle of Least Privilege: Even if your container runs as UID 0 (root), you should drop all capabilities it doesn't explicitly need.
The Ultimate Hardening Command:
docker run \
--cap-drop=ALL \ # 1. Drop EVERYTHING. The container has 0 special powers.
--cap-add=NET_BIND_SERVICE \ # 2. Add back ONLY permission to bind ports < 1024.
--security-opt no-new-privileges \ # 3. Prevent processes from gaining new privileges (e.g., via sudo).
my-web-server5. Rootless Docker
The ultimate defense against daemon exploits is to run the Docker daemon itself as a non-root user.
Rootless Mode:
The
dockerddaemon runs as your standard user (e.g.,alice).It uses user namespaces to map your user to "root" inside the container.
Benefit: If the daemon is compromised, the attacker only gains the privileges of
alice, notrooton the host.
Installation (Linux):
curl -fsSL [https://get.docker.com/rootless](https://get.docker.com/rootless) | shNote: Rootless mode has some limitations regarding networking drivers and cgroup resource limits.
6. Runtime Threat Detection (Falco)
Prevention is ideal, but detection is mandatory. How do you know if a container has already been compromised?
Falco is a CNCF tool that monitors kernel syscalls in real-time to detect anomalous behavior.
Falco can detect:
A shell being spawned in a running container (often the first step of an attack).
A process writing to a binary directory (like
/bin).Outbound connections to unexpected IPs (crypto-mining or C2 servers).
Sensitive files (like
/etc/shadow) being read.
Example Falco Rule:
- rule: Terminal Shell in Container
desc: A shell was spawned in a container with an attached terminal.
condition: >
container.id != host and
proc.name = bash and
evt.type = execve
output: "Shell spawned in container (user=%user.name container=%container.name)"
priority: NOTICE