Containers do not contain

Many people assume that containers:

  • provide the same or similar isolation to virtual machines
  • protects the host system
  • sandboxes applications

While all the points except the first one are partially true, some parts of the host filesystems are still exposed by default to containers and there are ways to gain full access.

This section highlights and explains problematic exploitation possibilities that lockc aims to fix via policies.

Please note that as lockc is still in early development stage, it doesn't protect against all examples provided at this time. However, covering them all is in the roadmap.

The goal of lockc is to eventually prevent any of those examples to be done by a regular user. Following some examples as root by explicitly choosing the privileged policy level in lockc is going to be still allowed. However, it is is discouraged to use the priviliged level for containers which are not part of Kubernetes infra (CNI plugins, operators, network meshes etc.). We might still consider restricting some of behaviors even for privileged (i.e. it's probably hard to justify chroot inside containers under any ciricumstance).

Not everything is namespaced

Despite the fact that containers come with their own rootfs, some parts of the filesystem are not namespaced, which means that the content of some directories is exactly the same as on the host OS. Examples:

  • Kernel filesystems under /sys
  • many sysctls under /proc/sys

For non-privileged containers, the content of those directories is read-only. However, privileged containers can write to them. In both cases, we think that even exposing many of those directories without write access is unnecessary for regular containers.

To show some more concrete examples, access to those directories can allow to:

  • Check and change GPU settings
❯ docker run --rm -it opensuse/tumbleweed:latest bash
f4891490a2f3:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
auto
f4891490a2f3:/ # exit
❯ docker run --rm --privileged -it opensuse/tumbleweed:latest bash
bad479286479:/ # echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
bad479286479:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
bad479286479:/ # exit
❯ cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
  • look at the host OS filesystem metadata
❯ docker run --rm -it opensuse/tumbleweed:latest bash
0d35122d08f9:~ # ls /sys/fs/btrfs/a8222a26-d11e-4276-9c38-9df2812cead2/
allocation  bdi  bg_reclaim_threshold  checksum  clone_alignment  devices  devinfo  exclusive_operation  features  generation  label  metadata_uuid  nodesize  qgroups  quota_override  read_policy  sectorsize
  • use fdisk in a privileged container
❯ docker run --rm -it --privileged registry.opensuse.org/opensuse/toolbox:latest bash
8b71e0119552:/ # fdisk -l
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: Samsung SSD 970 EVO Plus 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 8EEBDAB8-F965-4BA0-918A-2671BC67117C

Device           Start        End    Sectors  Size Type
/dev/nvme0n1p1    2048    1026047    1024000  500M EFI System
/dev/nvme0n1p2 1026048 3907029134 3906003087  1.8T Linux filesystem

Host mounts

Container engines allow to bind mount any directory from the host. When using local, non-clusterized container engines (docker, podman etc.) there are no restrictions about what can be mounted. In case of Docker, anyone who has an access to the socket (usually a member of docker group) can mount anything.

That gives every member of the docker group an access to the host OS as root:

❯ docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
efa4f6e0529a:/ # chroot /rootfs
sh-4.4#

The chroot works without --privileged as well:

❯ docker run --rm -it -v /:/rootfs opensuse/tumbleweed:latest bash
abb67212044d:/ # chroot /rootfs
sh-4.4#

The other approach is to mount a Docker socket. The image used here is docker which is the official image with Docker binaries installed. After starting the first container, we are able to list containers running on the host. Then, we are able to run another container - from inside the first one - which is mounting directories from the host

❯ docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock docker sh
/ # docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS     NAMES
066811b60d69   docker    "docker-entrypoint.s…"   5 seconds ago   Up 5 seconds             suspicious_liskov
/ # docker run --rm --privileged -it opensuse/tumbleweed:latest bash
fcb94c1d3af6:/ # exit
/ # docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
54b08e30fd9e:/ # chroot /rootfs
sh-4.4# cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.3"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.3"
PRETTY_NAME="openSUSE Leap 15.3"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.3"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Notice the difference between Linux distibution versions. The second container image we used is openSUSE Tumbleweed, but the host is running openSUSE Leap 15.3.