lockc

Introduction

lockc is open source software for providing MAC (Mandatory Access Control) type of security audit for container workloads.

The main reason why lockc exists is that containers do not contain. Containers are not as secure and isolated as VMs. By default, they expose a lot of information about host OS and provide ways to "break out" from the container. lockc aims to provide more isolation to containers and make them more secure.

The Containers do not contain documentation section explains why we mean by that phrase and what kind of behavior we want to restrict with lockc.

The main technology behind lockc is eBPF - to be more precise, its ability to attach to LSM hooks

Please note that currently lockc is an experimental project, not meant for production environments. Currently we don't publish any official binaries or packages to use, except of a Rust crate. Currently the most convenient way to use it is to use the source code and follow the guide.

Contributing

If you need help or want to talk with contributors, please come chat with us on #lockc channel on the Rust Cloud Native Discord server.

You can find the source code on GitHub and issues and feature requests can be posted on the GitHub issue tracker. lockc relies on the community to fix bugs and add features: if you'd like to contribute, please read the CONTRIBUTING guide and consider opening pull request.

License

lockc's userspace part is licensed under Apache License, version 2.0.

eBPF programs inside lockc/src/bpf directory are licensed under GNU General Public License, version 2.

Documentation is licensed under Mozilla Public License v2.0.

Containers do not contain

Many people assume that containers:

  • provide the same or similar isolation to virtual machines
  • protects the host system
  • sandboxes applications

While all the points except the first one are partially true, some parts of the host filesystems are still exposed by default to containers and there are ways to gain full access.

This section highlights and explains problematic exploitation possibilities that lockc aims to fix via policies.

Please note that as lockc is still in early development stage, it doesn't protect against all examples provided at this time. However, covering them all is in the roadmap.

The goal of lockc is to eventually prevent any of those examples to be done by a regular user. Following some examples as root by explicitly choosing the privileged policy level in lockc is going to be still allowed. However, it is is discouraged to use the priviliged level for containers which are not part of Kubernetes infra (CNI plugins, operators, network meshes etc.). We might still consider restricting some of behaviors even for privileged (i.e. it's probably hard to justify chroot inside containers under any ciricumstance).

Not everything is namespaced

Despite the fact that containers come with their own rootfs, some parts of the filesystem are not namespaced, which means that the content of some directories is exactly the same as on the host OS. Examples:

  • Kernel filesystems under /sys
  • many sysctls under /proc/sys

For non-privileged containers, the content of those directories is read-only. However, privileged containers can write to them. In both cases, we think that even exposing many of those directories without write access is unnecessary for regular containers.

To show some more concrete examples, access to those directories can allow to:

  • Check and change GPU settings
❯ docker run --rm -it opensuse/tumbleweed:latest bash
f4891490a2f3:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
auto
f4891490a2f3:/ # exit
❯ docker run --rm --privileged -it opensuse/tumbleweed:latest bash
bad479286479:/ # echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
bad479286479:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
bad479286479:/ # exit
❯ cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
  • look at the host OS filesystem metadata
❯ docker run --rm -it opensuse/tumbleweed:latest bash
0d35122d08f9:~ # ls /sys/fs/btrfs/a8222a26-d11e-4276-9c38-9df2812cead2/
allocation  bdi  bg_reclaim_threshold  checksum  clone_alignment  devices  devinfo  exclusive_operation  features  generation  label  metadata_uuid  nodesize  qgroups  quota_override  read_policy  sectorsize
  • use fdisk in a privileged container
❯ docker run --rm -it --privileged registry.opensuse.org/opensuse/toolbox:latest bash
8b71e0119552:/ # fdisk -l
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: Samsung SSD 970 EVO Plus 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 8EEBDAB8-F965-4BA0-918A-2671BC67117C

Device           Start        End    Sectors  Size Type
/dev/nvme0n1p1    2048    1026047    1024000  500M EFI System
/dev/nvme0n1p2 1026048 3907029134 3906003087  1.8T Linux filesystem

Host mounts

Container engines allow to bind mount any directory from the host. When using local, non-clusterized container engines (docker, podman etc.) there are no restrictions about what can be mounted. In case of Docker, anyone who has an access to the socket (usually a member of docker group) can mount anything.

That gives every member of the docker group an access to the host OS as root:

❯ docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
efa4f6e0529a:/ # chroot /rootfs
sh-4.4#

The chroot works without --privileged as well:

❯ docker run --rm -it -v /:/rootfs opensuse/tumbleweed:latest bash
abb67212044d:/ # chroot /rootfs
sh-4.4#

The other approach is to mount a Docker socket. The image used here is docker which is the official image with Docker binaries installed. After starting the first container, we are able to list containers running on the host. Then, we are able to run another container - from inside the first one - which is mounting directories from the host

❯ docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock docker sh
/ # docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS     NAMES
066811b60d69   docker    "docker-entrypoint.s…"   5 seconds ago   Up 5 seconds             suspicious_liskov
/ # docker run --rm --privileged -it opensuse/tumbleweed:latest bash
fcb94c1d3af6:/ # exit
/ # docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
54b08e30fd9e:/ # chroot /rootfs
sh-4.4# cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.3"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.3"
PRETTY_NAME="openSUSE Leap 15.3"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.3"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Notice the difference between Linux distibution versions. The second container image we used is openSUSE Tumbleweed, but the host is running openSUSE Leap 15.3.

Architecture

The project consists of 3 parts:

  • the set of BPF programs (written in C)
    • programs for monitoring processes, which detects whether new processes are running inside any container, which means applying policies on them
    • programs attached to particular LSM hooks, which allow or deny actions based on the policy applied to the container (currently all containers have the baseline policy applied, the mechanism of differentiating between policies per container/pod is yet to be implemented)
  • lockcd - the userspace program (written in Rust)
    • loads the BPF programs into the kernel, pins them in BPFFS
    • monitors runc processes, registers new containers and determines which policy should be applied to a container
    • in future, it's going to serve as the configuration manager and log collector

Getting started

Install

lockc provides integration with two container engines and separate installation methods for each of them:

  • Kubernetes (+ containerd-cri) - installation through a Helm chart, after which lockc secures newly created pods
  • Docker - installation on a single machine with Docker as a loca container engine

With Docker

This documentation section explains how to install lockc on a single machine with Docker. In order to do that, we need to install lockcd binary and a systemd unit for it.

Installation methods

There are two ways to do that.

Install with cargo

If you want to install lockc on a machine where you have the source code of lockc, you can do it with cargo. You need to build lockc with Cargo before that. After building lockc, you can install it with the following command.

cargo xtask install

Do not run this command with sudo! Why?

tl;dr: you will be asked for password when necessary, don't worry!

Explanation: Running cargo with sudo ends with weird consequences like not seing cargo content from your home directory or leaving some files owned by root in target. When any destination directory is owned by root, sudo will be launched automatically by xtask install just to perform necessary installation steps.

By default it tries to install lockcd binary in /usr/local/bin, but the destination directory can be changed by the following arguments:

  • --destdir - the rootfs of your system, default: /
  • --prefix - prefix of the most of installation destinations, default: usr/local
  • --bindir - directory for binary files, default: bin
  • --unitdir - directory for systemd units, default: lib/systemd/system
  • --sysconfdir - directory for configuration files, default: etc

By default, binaries are installed from the debug target profile. If you want to change it, use the --profile argument. --profile release is what you most likely want to use when packaging or installing on the production system.

Unpack the bintar

Documentation sections about:

mention Building tarball with binary and unit. To quickly sum it up, you can build a "bintar" by doing:

dapper cargo xtask bintar

or:

cargo xtask bintar

Both commands will produce a bintar available as target/[profile]/lockc.tar.gz (i.e. target/debug/lockc.tar.gz).

That tarball can be copied to any machine and unpacked with the following command:

sudo tar -C / -xzf lockc.tar.gz

Verify the installation

After installing lockc, you should be able to enable and start the lockcd service:

sudo systemctl enable --now lockcd

After starting the service, you can verify that lockc is running by trying to run a "not containing" container, like:

$ docker run --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/" to rootfs at "/rootfs" caused: mount through procfd: operation not permitted: unknown.
ERRO[0020] error waiting for container: context canceled

Or you can try to run a less insecure container and try to ls the contents of /sys:

$ docker run --rm -it registry.opensuse.org/opensuse/toolbox:latest
9b34d760017f:/ # ls /sys
ls: cannot open directory '/sys': Operation not permitted
9b34d760017f:/ # ls /sys/fs/btrfs
ls: cannot access '/sys/fs/btrfs': No such file or directory
9b34d760017f:/ # ls /sys/fs/cgroup
blkio  cpu,cpuacct  cpuset   freezer  memory  net_cls           net_prio    pids  systemd
cpu    cpuacct      devices  hugetlb  misc    net_cls,net_prio  perf_event  rdma

You should be able to see cgroups (which is fine), but other parts of /sys should be hidden.

However, running insecure containers as root with privileged policy level should work:

$ sudo -i
# docker run --label org.lockc.policy=privileged --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest bash
8ea310609fce:/ # 

On Kubernetes

This section explains how to install lockc on a Kubernetes cluster with helm.

The helm chart is available on lockc-helm-chart website. Installation with default values can be done with:

kubectl apply -f https://lockc-project.github.io/helm-charts/namespace.yaml
helm repo add lockc https://lockc-project.github.io/helm-charts/
helm install -n lockc lockc lockc/lockc

More info on lockc helm chart installation can be found here

Policies

lockc provides three policy levels for containers:

  • baseline - meant for regular applications
  • restricted - meant for applications for which we need to be more cautious and secure them more stricly
  • privileged - meant for part of the infrastructure which can have full access to host resources - i.e. CNI plugins in Kubernetes

The default policy level is baseline. The policy level can be changed by the pod-security.kubernetes.io/enforce label on the namespace which the container is running in. We make an exception for the kube-system namespace for which the privileged policy is applied by default.

For now there is no possibility to apply policy levels on local container engines (Docker, containerd, podman), but such feature is planned in the future.

File access

lockc comes with policies about file access which is based on allow- and deny-listing. Baseline and restricted policies have their own pairs of lists. All those lists should contain path prefixes. All the children of listed paths/directories are included, since the decision is made by prefix matching.

The deny list has precedence over allow list. That's because main purpose of the deny list is specifying exceptions whose prefixes are specified in the allow list, but we don't want to allow them.

To sum it up, when any process in the container tries to access a file, lockc:

  1. Checks whether the given path's prefix is in the deny list. If yes, denies the access.
  2. Checks whether the given path's prefix is in the allow list. If yes, allows the access.
  3. In case of no matches, denies the access.

By default, the contents of lists are:

  • baseline
    • allow list
      • /bin
      • /dev/console
      • /dev/full
      • /dev/null
      • /dev/pts
      • /dev/tty
      • /dev/urandom
      • /dev/zero
      • /etc
      • /home
      • /lib
      • /proc
      • /sys/fs/cgroup
      • /tmp
      • /usr
      • /var
    • deny list
      • /proc/acpi
  • restricted
    • allow list
      • /bin
      • /dev/console
      • /dev/full
      • /dev/null
      • /dev/pts
      • /dev/tty
      • /dev/urandom
      • /dev/zero
      • /etc
      • /home
      • /lib
      • /proc
      • /sys/fs/cgroup
      • /tmp
      • /usr
      • /var
    • deny list
      • /proc/acpi
      • /proc/sys

By default, with the baseline policy level, this is a good exampole of not allowed behavior:

# docker run --rm -it registry.opensuse.org/opensuse/toolbox:latest
9b34d760017f:/ # ls /sys
ls: cannot open directory '/sys': Operation not permitted
9b34d760017f:/ # ls /sys/fs/btrfs
ls: cannot access '/sys/fs/btrfs': No such file or directory
9b34d760017f:/ # ls /sys/fs/cgroup
blkio  cpu,cpuacct  cpuset   freezer  memory  net_cls           net_prio    pids  systemd
cpu    cpuacct      devices  hugetlb  misc    net_cls,net_prio  perf_event  rdma

We are able to see cgroups (which is fine), but other parts of /sys are hidden.

Mount policies

lockc comes with the following policies about bind mounts from host filesystem to containers (via -v option) for each policy level:

  • baseline - allow bind mounting from inside /home and /var/data.
  • restricted - does not allow any bind mounts from host
  • privileged - no restrictions, everything can be bind mounted

The baseline behavior in lockc is slightly different than in the Kubernetes Pod Security Admission controller, which disallows all host mounts for baseline containers as well as for restricted. The motivation behind allowing /home and /var/data by lockc is that they are often used in local container engines (Docker, podman) for reasons like:

  • mounting the source code to build or check
  • storing database content on the host for local development

By default, with the baseline policy level, this is a good example of not allowed behavior:

# docker run --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/" to rootfs at "/rootfs" caused: mount through procfd: operation not permitted: unknown.

Syslog

lockc comes with the following policies about access to the kernel message ring buffer for each policy level:

  • baseline - not allowed
  • restricted - not allowed
  • privileged - allowed

By default, with the baseline policy level, checking the kernel logs from the container is not allowed:

# docker run -it --rm registry.opensuse.org/opensuse/toolbox:latest
b10f9fa4a385:/ # dmesg
dmesg: read kernel buffer failed: Operation not permitted

Tuning

This guide shows options and tricks to gain an optimal performance and resouce usage.

Memory usage

Memory usage by lockc depends mostly on BPF maps size. BPF maps are stored in memory and the biggest BPF maps are the ones related to tracking processes and containers. Size of those maps depends on the limit of processes (in separate memory spaces) in the system. That limit is determined by the kernel.pid_max sysctl. By default the limit is 32768. With such limit, memory usage by lockc should be aproximately 10-20 MB.

If you observe too much memory being used after installing lockc, try to check the value of kernel.pid_max, which can be done with:

sudo sysctl kernel.pid_max

Change of that value (i.e. to 10000) can be done with:

sudo sysctl kernel.pid_max=10000

But that change will be not persistent after reboot. Changing it persistently requires adding a configuration to /etc/sysctl.d. I.e. we could create the file /etc/sysctl.d/05-lockc.conf with the following content:

kernel.pid_max = 10000

After creating that file, the lower limit is going to be persistent after reboot.

For Developers

Repositories

lockc currently uses two git repositories:

If you are interested in development and contributing to lockc, we recommend to fork and clone both of them. Both will be needed especially for building a development environment based on Terraform.

The latter chapters assume that you have lockc and helm-charts cloned in the same parent directory. For example, as $HOME/my-repositories/lockc and $HOME/my-repositories/helm-charts.

Building lockc

The first step to try out lockc is to build it. There are several ways to do that:

  • Cargo - build binaries with Cargo (Rust build system) on the host
    • convenient for local development, IDE/editor integration
  • Container image - build a container image which can be deployed on Kubernetes
    • the only method to try lockc on Kubernetes
    • doesn't work for Docker integration, where we rather install lockc as a binary on the host, managed by systemd

Cargo

lockc is written entirely in Rust and uses Cargo as a build system.

Prerequisites

This document assumes that you have Rust installed with rustup.

To build lockc, you will need Rust stable and nightly:

rustup install stable
rustup toolchain install nightly --component rust-src

Then you need to install bpf-linker for linking eBPF programs:

cargo install bpf-linker

By default, bpf-linker is trying to use the internal LLVM library available in Rust. That might not work if you are using musl target. In such case, you need to install LLVM with static libraries on your host system and then install bpf-linker with a different command:

cargo install --git https://github.com/aya-rs/bpf-linker --tag v0.9.3 --no-default-features --features system-llvm -- bpf-linker

Building lockc

After installing all needed dependencies, it's time to build lockc.

You can build and run the entire project with:

cargo xtask run

If you prefer to only build the project, it be done with:

cargo xtask build-ebpf
cargo build

Running tests:

cargo test

Running lints:

cargo clippy

Installing lockc

To install lockc on your host, use the following command:

cargo xtask install

Do not run this command with sudo! Why?

tl;dr: you will be asked for password when necessary, don't worry!

Explanation: Running cargo with sudo ends with weird consequences like not seing cargo content from your home directory or leaving some files owned by root in target. When any destination directory is owned by root, sudo will be launched automatically by xtask install just to perform necessary installation steps.

By default it tries to install lockcd binary in /usr/local/bin, but the destination directory can be changed by the following arguments:

  • --destdir - the rootfs of your system, default: /
  • --prefix - prefix of the most of installation destinations, default: usr/local
  • --bindir - directory for binary files, default: bin
  • --unitdir - directory for systemd units, default: lib/systemd/system
  • --sysconfdir - directory for configuration files, default: etc

By default, binaries are installed from the debug target profile. If you want to change it, use the --profile argument. --profile release is what you most likely want to use when packaging or installing on the production system.

Building tarball with binary and unit

To make distribution of lockc for Docker users easier, we have a possibility of building an archive with binary and systemd unit which can be just unpacked in / directory. It can be done by the following command:

cargo xtask bintar

By default it archives lockcd binary in usr/local/bin, but the destination directory can be changed by the following arguments:

  • --prefix - prefix of the most of installation destinations, default: usr/local
  • --bindir - directory for binary files, default: bin
  • --unitdir - directory for systemd units, default: lib/systemd/system
  • --sysconfdir - directory for configuration files, default: etc

By default, binaries are installed from the debug target profile. If you want to change it, use the --profile argument. --profile release is what you most likely want to use when creating a tarball for releases and production systems.

The resulting binary should be available as target/[profile]/lockc.tar.gz (i.e. target/debug/lockc.tar.gz).

Container image

lockc repository contains a Dockerfile which can be used for building a container image. The main purpose of building it is ability to deploy lockc on Kubernetes.

Building a local image can be done in a basic way, like:

docker build -t lockcd .

For quick development and usage of the image on different (virtual) machines, it's convenient to use ttl.sh which is an anonymous and ephemeral container image registry.

To build and push an image to ttl.sh, you can use the following commands:

export IMAGE_NAME=$(uuidgen)
docker build -t ttl.sh/${IMAGE_NAME}:30m .
docker push ttl.sh/${IMAGE_NAME}:30m

After building the container image, you will be able to install lockc on Kubernetes.

Development environment (Vagrant)

There is a possibility to run lockc build from source in a virtual machine using Vagrant:

vagrant up

Our Vagrantfile supports the following environment variables:

  • LOCKC_VAGRANT_CPUS - number of vCPUs
  • LOCKC_VAGRANT_MEMOORY - memory (in MB)

When VM is provisioned successfully, you can access it using:

vagrant ssh

That VM contains is running k3s. It's also running lockc as a systemd service, which can be checked with:

sudo systemctl status lockc
sudo journalctl -fu lockc

lockc source tree is available in /vagrant directory. After making changes in code, you can sync the changes (from the host):

vagrant rsync

Then build, install and restart lockc in VM (inside vagrant ssh session):

sudo systemctl stop lockc
cargo xtask build-ebpf
cargo build
cargo xtask install
sudo systemctl start lockc

Demos

This section of the book contains demos.

Mount policies

Kubernetes

The following demo shows mount policies being enforced on Kubernetes pods.

YAML files can be found here.

The policy violations in deployments-should-fail.yaml file are:

  • nginx-restricted-fail deployment trying to make a host mount while having a restricted policy
  • bpf-default-fail and bpf-baseline-fail deployment trying to mount /sys/fs/bpf while having a baseline policy
  • bpf-restricted-fail trying to mount /sys/fs/bpf while having a restricted policy.

asciicast