I recently came across an exceptionally dense technical analysis about container security that’s worth sharing. The author started with a simple hypothesis: container filesystem isolation should be sufficient for multi-tenant workloads without virtual machines, if you sufficiently understand what’s happening at the syscall level.
After thorough investigation, the conclusion is more uncomfortable than expected: the defaults protect you well, but the moment you reach for “advanced” features like bidirectional mount propagation or SELinux relabeling, you’re one misconfiguration away from handing an attacker the keys to your host.
Why Filesystem Isolation Matters
The Economics of Multi-Tenancy
Running multiple tenants on shared infrastructure only works if you can guarantee that Tenant A’s compromise can’t touch Tenant B. Linux namespaces provide the foundation; mount namespaces give each container its own view of the filesystem tree. Without this isolation, the cloud economics that make containers attractive fall apart.
Attack Surface Reduction
Containers restrict what a running process can see and do by leveraging kernel namespace features. But “restrict” doesn’t mean “eliminate.” The shared kernel attack surface means a vulnerability in OverlayFS (like CVE-2023-0386) can escalate privileges from inside a container that satisfies the exploit’s preconditions. The kernel is the ultimate shared resource.
The Six Pillars of Filesystem Isolation
1. Mount Namespaces & OverlayFS
Mount namespaces (CLONE_NEWNS) give each container a private view of the mount tree. OverlayFS layers a writable upperdir on top of read-only lowerdir layers, implementing copy-on-write semantics.
The copy-on-write mechanism works like this: when a container modifies a file from a lower layer, the kernel copies it to the upperdir first. All subsequent operations target this copy, leaving the base image untouched.
The shared kernel problem: OverlayFS runs in kernel space. CVE-2023-0386 demonstrated a flaw in how the kernel handles UID/GID mapping validation when copying a capable file from a nosuid mount into another mount during OverlayFS copy-up. This enables local privilege escalation.
2. FUSE (Filesystem in Userspace)
FUSE lets you implement a filesystem as a userspace daemon rather than kernel code. When an application makes a syscall on a FUSE mount, the request travels: VFS → FUSE kernel module → /dev/fuse queue → userspace daemon → response back through the same path.
Why FUSE matters for containers: Kernel overlayfs mounting requires CAP_SYS_ADMIN in the relevant namespace. Where that support isn’t available or reliable, container runtimes like Podman fall back to fuse-overlayfs, which implements overlay semantics in userspace without requiring elevated privileges.
The daemon controls everything: A malicious FUSE daemon can lie about file ownership, permissions, and content. If you mount untrusted FUSE filesystems, access control is theater.
3. Mount Propagation
This is where most of us get burned. Mount propagation determines whether mount events cross namespace boundaries.
The four propagation types are:
- Shared: Mount events propagate bidirectionally
- Slave: Events propagate one direction only
- Private: Events don’t propagate
- Unbindable: The mount cannot be bind mounted
The kernel’s protection for less-privileged namespaces: When you create a mount namespace that is less privileged than its parent - when the new namespace’s owning user namespace differs from the parent mount namespace’s owner - the kernel automatically demotes inherited shared mounts to MS_SLAVE.
This means creating a mount namespace with a different owning user namespace triggers automatic demotion. Creating one as root (without --user) does not demote - the mount stays shared.
4. MAC Enforcement (AppArmor/SELinux)
Mandatory Access Control (MAC) provides a second layer of defense. Even if a process gains capabilities, MAC policies can deny specific operations.
Docker’s default AppArmor profile (docker-default) denies mount operations except for specific allowed types.
SELinux relabeling (:z and :Z): This is where I found surprising behavior. When you run:
podman run -v /host/path:/container/path:Z myimage
The :Z option instructs the container runtime to relabel /host/path with an MCS label matching the container. This is a recursive relabel of file objects under the mount path - the runtime walks the host filesystem tree and changes the label on each file via libselinux.
This happens before the container starts, on the host. It’s not sandboxed. If you relabel a directory shared by other host services (like /var/log), you break those services. This is a host-side operation masquerading as a container configuration option.
Critical warning: Never use :Z on:
- Container runtime state directories (
/var/lib/docker,/var/lib/containers) - Shared system directories (
/var/log,/tmp,/var/run) - Directories mounted by multiple containers
- Network filesystems (NFS, CIFS) where relabeling may fail or corrupt remote state
5. Seccomp (The Syscall Gatekeeper)
Here’s something I initially underestimated: for most container deployments, seccomp is the first line of defense against mount-based attacks, not AppArmor or SELinux. Docker and Podman’s default seccomp profiles block the mount syscall entirely for unprivileged containers.
How seccomp filtering works: When a container starts, the runtime installs a BPF (Berkeley Packet Filter) program that intercepts every syscall. The filter examines the syscall number and, optionally, its arguments, then decides: allow, deny (EPERM), kill the process, or trap to userspace.
Docker’s default seccomp profile blocks approximately 44 syscalls by default, including mount, kexec, swapoff, pivot_root, among others.
The practical implication: Even if you somehow grant CAP_SYS_ADMIN to a container (bad idea), the default seccomp profile still blocks mount(). You’d need both the capability and a permissive seccomp profile (or --security-opt seccomp=unconfined) for mount attacks to work.
6. Cgroups v2 and I/O Resource Isolation
Now, this is my other fear when it comes to multi-tenancy; filesystem operations consume shared resources. A container performing unbounded I/O can starve other containers and host services, even with perfect namespace isolation. This isn’t filesystem integrity isolation, it’s filesystem availability isolation.
The problem: By default, most container deployments don’t configure I/O limits. Containers share the host’s I/O capacity without restrictions. A malicious or misbehaving container can saturate disk I/O, causing performance degradation or outages for other containers and host services.
Why this matters for multi-tenancy: Even if your containers are perfectly isolated at the namespace and MAC level, a single container can degrade performance for everyone on the node. In shared clusters, this is a denial-of-service vector that doesn’t require any privilege escalation.
End-to-End Attack Chains
Attack Chain 1: Docker Socket Mount → Full Host Compromise
Scenario: A container has /var/run/docker.sock mounted (common for CI/CD pipelines, monitoring tools, and “Docker-in-Docker” patterns).
Preconditions:
- Container has access to Docker socket (bind mount)
- Docker daemon runs as root on host
- No additional restrictions (this is the default when mounting the socket)
Step-by-step exploitation:
- Attacker discovers Docker socket access
- Creates privileged container with host filesystem mounted
- Reads sensitive host files (
/etc/shadow, SSH keys, application secrets) - Chroots into host filesystem and gets a root shell
From here they could:
- Add SSH keys to
/root/.ssh/authorized_keys - Create new root users in
/etc/passwd - Install backdoors or cryptominers
- Pivot to other systems on the network
Lesson: Never mount the Docker socket into untrusted containers. If you must, use a Docker socket proxy with API filtering.
Attack Chain 2: Bidirectional Mount Propagation → Host Mount Manipulation
Scenario: A container has Bidirectional mount propagation enabled (required for some CSI drivers) and CAP_SYS_ADMIN.
Preconditions:
- Pod/container has
mountPropagation: Bidirectional - Container has
CAP_SYS_ADMIN(viaprivileged: trueor explicit capability) - Seccomp allows
mountsyscall (disabled byprivileged: true) - A shared mount exists on the host that the container can see
Step-by-step exploitation:
- Attacker creates a bind mount inside the container
- The mount propagates to the host
- Can shadow legitimate host content by mounting over it
Any host process reading this path sees attacker-controlled content.
Lesson: Bidirectional propagation + CAP_SYS_ADMIN = host mount control. This is why Kubernetes restricts Bidirectional to privileged pods only.
Attack Chain 3: OverlayFS CVE → Kernel Privilege Escalation
Scenario: Exploiting CVE-2023-0386 to escalate from container to host root.
Preconditions:
- Unpatched kernel (vulnerable OverlayFS)
- Unprivileged user namespaces enabled
- Ability to mount OverlayFS in a user namespace
- Mount topology with nosuid lower layer and non-nosuid upper layer
The actual exploit adds one critical component: FUSE.
- Create a FUSE filesystem that lies about file ownership (claims files are owned by UID 0 with SUID bit set)
- Use FUSE as the lower layer of an overlay mount inside a user namespace
- Trigger copy-up by modifying the fake SUID binary
- Vulnerable kernel flaw: during copy-up, the kernel didn’t verify that UID 0 in the user namespace maps to a valid UID on the host
- Result: a real root-owned SUID binary appears in the upper directory
- Execute the binary → local privilege escalation to root
Lesson: Patch your kernel. This is the only real fix.
Where My Assumptions Failed
Assumption 1: “FUSE is the weakest link”
I went into this investigation expecting FUSE to be the primary vulnerability vector. User-kernel context switching, daemon-controlled responses, performance overhead - it all screamed “attack surface.”
What I found: FUSE’s weakness is availability (DoS) and cross-user data exposure, not host privilege escalation. A malicious FUSE daemon can hang processes, waste resources, or serve inconsistent data to different users, but it can’t directly escalate privileges to the host.
The actual weakest link is mount propagation misconfiguration. MS_SHARED mode creates a direct, bidirectional channel between container and host mount tables. Unlike FUSE, this isn’t a userspace daemon you can terminate - it’s kernel-enforced behavior that’s easy to enable and hard to notice.
Assumption 2: “SELinux relabeling is sandboxed”
I assumed :z and :Z options were container-side operations, maybe using some capability or namespace trick to relabel files from the container’s perspective.
What I found: Relabeling happens on the host, before the container starts. The container runtime recursively walks the filesystem tree and invokes SELinux relabeling via libselinux.
This means if you mount /var/log:Z into a container, you just relabeled every file in /var/log with a container-specific MCS label. Other host services reading those files may now fail with permission errors.
This isn’t a bug; it’s documented behavior. But it’s surprising if you assume container options stay inside the container.
Assumption 3: “Kernel filesystem bugs are rare”
I knew containers share the kernel, but I assumed mainline filesystem code was battle-tested enough that exploitable bugs were rare.
What I found: CVE-2023-0386 (OverlayFS) exploits the copy-up mechanism. The bug - flawed UID/GID mapping validation during file capability handling - was subtle enough to survive years of production use. CISA added it to their KEV catalog in June 2025, confirming active exploitation two years after the fix was merged.
The shared kernel isn’t a theoretical risk. It’s a practical attack vector with ongoing exploitation.
Alternatives: When Containers Aren’t Enough
The Shared Kernel Problem
The CVE-2023-0386 demonstration illustrates a fundamental constraint: containers share the host kernel, so a kernel vulnerability affects every container simultaneously. No amount of seccomp filtering, capability dropping, or MAC policy can protect against a bug in the kernel’s own filesystem implementation.
This isn’t a configuration problem I can fix. It’s an architectural boundary. The question becomes: what’s the minimum isolation primitive that eliminates this class of vulnerability?
Isolation Boundary Analysis
Each alternative interposes a different boundary between container workloads and the host kernel. The key metric is the Trusted Computing Base (TCB), the set of components that must be correct for isolation to hold.
gVisor takes a fundamentally different approach: instead of filtering syscalls, it reimplements Linux in memory-safe Go. The Sentry component is essentially a userspace kernel, handling syscalls, memory management, filesystems, networking - the works.
This dramatically shrinks the host attack surface. The Sentry needs only 53 host syscalls without networking, 68 with it. Compare that to the ~350 syscalls in Linux 5.3 - that’s an 80% reduction in host kernel exposure.
What I appreciate about the architecture is the defense-in-depth: even the Sentry runs inside seccomp-bpf, namespaces, and cgroups as secondary boundaries. Filesystem operations go through Sentry’s VFS, then to a separate Gofer process that handles host filesystem access via LISAFS.
Firecracker takes the opposite approach: instead of reimplementing the kernel, just give each workload its own kernel. Each microVM runs a real Linux guest, so a host kernel vulnerability requires escaping KVM first - a much harder target.
Kata Containers gives you VM isolation with container UX, but the TCB varies dramatically by hypervisor choice. QEMU has nearly 2 million lines of C with decades of device emulation code - lots of attack surface.
Was My Hypothesis Correct?
My original hypothesis: “Container filesystem isolation is good enough for multi-tenant workloads without VMs, if you understand exactly what’s happening at the syscall level.”
Verdict: Partially correct, with important caveats.
Where the Defaults Hold
For trusted internal workloads, I found the default isolation model works better than I expected. Here’s why:
- Seccomp filtering: Docker/Podman block
mount()and other dangerous syscalls by default. This is the first line of defense. - Kernel behavior for less-privileged namespaces: When a container runtime creates a user namespace, inherited shared mounts are automatically demoted to slave.
- Runtime hardening: Most container runtimes (runc, crun, containerd) explicitly apply
MS_REC|MS_SLAVEto the container’s root early in setup. - MAC policies: AppArmor/SELinux provide defense-in-depth even if other layers fail.
Together, these layers mean that out-of-the-box containers from Docker, Podman, or Kubernetes have reasonable filesystem isolation without special configuration.
Where It Breaks
For untrusted multi-tenant workloads (running other people’s code), I found gaps that can’t be configured away:
- Kernel bugs are game over: A single OverlayFS vulnerability can grant root to any container that satisfies the exploit’s preconditions.
- Configuration complexity creates risk:
MS_SHAREDpropagation, host path mounts, privileged containers - each “advanced” feature punches holes in isolation. - MAC depends on runtime integrity: CVE-2023-28642 showed that path resolution bugs in runc can bypass AppArmor.
- Resource isolation is opt-in: Cgroup I/O limits, storage quotas, and PID limits aren’t enabled by default.
- Production drifts from defaults: Teams add
privileged: trueto “make things work,” mount Docker sockets for CI/CD, and accumulate dangerous configurations over time.
Final Thoughts
Container filesystem isolation isn’t a wall - it’s an interlocking set of kernel mechanisms that work together to create a security boundary. Seccomp filters, capabilities, mount namespaces, OverlayFS, propagation rules, MAC policies, and cgroup resource limits - each contributes a piece.
The defaults work better than I initially expected. Seccomp blocking mount() out of the box means most mount-based attacks fail before they start. The kernel’s shared→slave demotion for less-privileged namespaces, combined with runtime hardening, provides meaningful protection.
But the power to break isolation lives in configuration options that promise convenience. Bidirectional mount propagation, host path SELinux relabeling, privileged containers, Docker socket mounts - these are documented features, not bugs. They work exactly as designed. The problem is that their design trades isolation for functionality, and production environments accumulate these trades over time.
The line between secure multi-tenancy and a compromised host is drawn where you choose to override defaults. And if you’re running truly untrusted code, consider whether containers are the right abstraction at all.
gVisor, Firecracker, and Kata exist because sometimes the answer to “is container isolation enough?” is simply “no.”











Comments