Introduction
lockc is open source software for providing MAC (Mandatory Access Control) type of security audit for container workloads.
The main reason why lockc exists is that containers do not contain. Containers are not as secure and isolated as VMs. By default, they expose a lot of information about host OS and provide ways to "break out" from the container. lockc aims to provide more isolation to containers and make them more secure.
The Containers do not contain documentation section explains why we mean by that phrase and what kind of behavior we want to restrict with lockc.
The main technology behind lockc is eBPF - to be more precise, its ability to attach to LSM hooks
Please note that currently lockc is an experimental project, not meant for production environments. Currently we don't publish any official binaries or packages to use, except of a Rust crate. Currently the most convenient way to use it is to use the source code and follow the guide.
Contributing
If you need help or want to talk with contributors, please come chat with
us on #lockc
channel on the Rust Cloud Native Discord server.
You can find the source code on GitHub and issues and feature requests can be posted on the GitHub issue tracker. lockc relies on the community to fix bugs and add features: if you'd like to contribute, please read the CONTRIBUTING guide and consider opening pull request.
License
lockc's userspace part is licensed under Apache License, version 2.0.
eBPF programs inside lockc/src/bpf directory are licensed under GNU General Public License, version 2.
Documentation is licensed under Mozilla Public License v2.0.
Containers do not contain
Many people assume that containers:
- provide the same or similar isolation to virtual machines
- protects the host system
- sandboxes applications
While all the points except the first one are partially true, some parts of the host filesystems are still exposed by default to containers and there are ways to gain full access.
This section highlights and explains problematic exploitation possibilities that lockc aims to fix via policies.
Please note that as lockc is still in early development stage, it doesn't protect against all examples provided at this time. However, covering them all is in the roadmap.
The goal of lockc is to eventually prevent any of those examples to be done
by a regular user. Following some examples as root by explicitly choosing the
privileged policy level in lockc is going to be still allowed. However, it is
is discouraged to use the priviliged level for containers which are not part
of Kubernetes infra (CNI plugins, operators, network meshes etc.). We might
still consider restricting some of behaviors even for privileged (i.e. it's
probably hard to justify chroot
inside containers under any ciricumstance).
Not everything is namespaced
Despite the fact that containers come with their own rootfs, some parts of the filesystem are not namespaced, which means that the content of some directories is exactly the same as on the host OS. Examples:
- Kernel filesystems under /sys
- many sysctls under /proc/sys
For non-privileged containers, the content of those directories is read-only. However, privileged containers can write to them. In both cases, we think that even exposing many of those directories without write access is unnecessary for regular containers.
To show some more concrete examples, access to those directories can allow to:
- Check and change GPU settings
❯ docker run --rm -it opensuse/tumbleweed:latest bash
f4891490a2f3:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
auto
f4891490a2f3:/ # exit
❯ docker run --rm --privileged -it opensuse/tumbleweed:latest bash
bad479286479:/ # echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
bad479286479:/ # cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
bad479286479:/ # exit
❯ cat /sys/class/drm/card0/device/power_dpm_force_performance_level
high
- look at the host OS filesystem metadata
❯ docker run --rm -it opensuse/tumbleweed:latest bash
0d35122d08f9:~ # ls /sys/fs/btrfs/a8222a26-d11e-4276-9c38-9df2812cead2/
allocation bdi bg_reclaim_threshold checksum clone_alignment devices devinfo exclusive_operation features generation label metadata_uuid nodesize qgroups quota_override read_policy sectorsize
- use fdisk in a privileged container
❯ docker run --rm -it --privileged registry.opensuse.org/opensuse/toolbox:latest bash
8b71e0119552:/ # fdisk -l
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: Samsung SSD 970 EVO Plus 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 8EEBDAB8-F965-4BA0-918A-2671BC67117C
Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 1026047 1024000 500M EFI System
/dev/nvme0n1p2 1026048 3907029134 3906003087 1.8T Linux filesystem
Host mounts
Container engines allow to bind mount any directory from the host. When using
local, non-clusterized container engines (docker, podman etc.) there are no
restrictions about what can be mounted. In case of Docker, anyone who has an
access to the socket (usually a member of docker
group) can mount anything.
That gives every member of the docker
group an access to the host OS as root:
❯ docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
efa4f6e0529a:/ # chroot /rootfs
sh-4.4#
The chroot
works without --privileged
as well:
❯ docker run --rm -it -v /:/rootfs opensuse/tumbleweed:latest bash
abb67212044d:/ # chroot /rootfs
sh-4.4#
The other approach is to mount a Docker socket. The image used here is docker
which is the official image with Docker binaries installed. After starting the
first container, we are able to list containers running on the host. Then, we
are able to run another container - from inside the first one - which is
mounting directories from the host
❯ docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock docker sh
/ # docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
066811b60d69 docker "docker-entrypoint.s…" 5 seconds ago Up 5 seconds suspicious_liskov
/ # docker run --rm --privileged -it opensuse/tumbleweed:latest bash
fcb94c1d3af6:/ # exit
/ # docker run --rm --privileged -it -v /:/rootfs opensuse/tumbleweed:latest bash
54b08e30fd9e:/ # chroot /rootfs
sh-4.4# cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.3"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.3"
PRETTY_NAME="openSUSE Leap 15.3"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.3"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
Notice the difference between Linux distibution versions. The second container image we used is openSUSE Tumbleweed, but the host is running openSUSE Leap 15.3.
Architecture
The project consists of 3 parts:
- the set of BPF programs (written in C)
- programs for monitoring processes, which detects whether new processes are running inside any container, which means applying policies on them
- programs attached to particular LSM hooks, which allow or deny actions
based on the policy applied to the container (currently all containers have
the
baseline
policy applied, the mechanism of differentiating between policies per container/pod is yet to be implemented)
- lockcd - the userspace program (written in Rust)
- loads the BPF programs into the kernel, pins them in BPFFS
- monitors runc processes, registers new containers and determines which policy should be applied to a container
- in future, it's going to serve as the configuration manager and log collector
Getting started
- Build - How to build lockc from the sources
- Install - Configuring and installing lockc
- Development environment (Terraform) - Using Terraform for setting up development environment lockc
Install
lockc provides integration with two container engines and separate installation methods for each of them:
- Kubernetes (+ containerd-cri) - installation through a Helm chart, after which lockc secures newly created pods
- Docker - installation on a single machine with Docker as a loca container engine
With Docker
This documentation section explains how to install lockc on a single machine
with Docker. In order to do that, we need to install lockcd
binary and a
systemd unit for it.
Installation methods
There are two ways to do that.
Install with cargo
If you want to install lockc on a machine where you have the source code of lockc, you can do it with cargo. You need to build lockc with Cargo before that. After building lockc, you can install it with the following command.
cargo xtask install
Do not run this command with sudo! Why?
tl;dr: you will be asked for password when necessary, don't worry!
Explanation: Running cargo with sudo ends with weird consequences like not
seing cargo content from your home directory or leaving some files owned by
root in target
. When any destination directory is owned by root, sudo will
be launched automatically by xtask install
just to perform necessary
installation steps.
By default it tries to install lockcd binary in /usr/local/bin
, but the
destination directory can be changed by the following arguments:
--destdir
- the rootfs of your system, default:/
--prefix
- prefix of the most of installation destinations, default:usr/local
--bindir
- directory for binary files, default:bin
--unitdir
- directory for systemd units, default:lib/systemd/system
--sysconfdir
- directory for configuration files, default:etc
By default, binaries are installed from the debug
target profile. If you want
to change it, use the --profile
argument. --profile release
is what you
most likely want to use when packaging or installing on the production system.
Unpack the bintar
Documentation sections about:
mention Building tarball with binary and unit. To quickly sum it up, you can build a "bintar" by doing:
dapper cargo xtask bintar
or:
cargo xtask bintar
Both commands will produce a bintar available as target/[profile]/lockc.tar.gz
(i.e. target/debug/lockc.tar.gz
).
That tarball can be copied to any machine and unpacked with the following command:
sudo tar -C / -xzf lockc.tar.gz
Verify the installation
After installing lockc, you should be able to enable and start the lockcd service:
sudo systemctl enable --now lockcd
After starting the service, you can verify that lockc is running by trying to run a "not containing" container, like:
$ docker run --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/" to rootfs at "/rootfs" caused: mount through procfd: operation not permitted: unknown.
ERRO[0020] error waiting for container: context canceled
Or you can try to run a less insecure container and try to ls
the contents
of /sys
:
$ docker run --rm -it registry.opensuse.org/opensuse/toolbox:latest
9b34d760017f:/ # ls /sys
ls: cannot open directory '/sys': Operation not permitted
9b34d760017f:/ # ls /sys/fs/btrfs
ls: cannot access '/sys/fs/btrfs': No such file or directory
9b34d760017f:/ # ls /sys/fs/cgroup
blkio cpu,cpuacct cpuset freezer memory net_cls net_prio pids systemd
cpu cpuacct devices hugetlb misc net_cls,net_prio perf_event rdma
You should be able to see cgroups (which is fine), but other parts of /sys should be hidden.
However, running insecure containers as root with privileged
policy level
should work:
$ sudo -i
# docker run --label org.lockc.policy=privileged --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest bash
8ea310609fce:/ #
On Kubernetes
This section explains how to install lockc on a Kubernetes cluster with helm.
The helm chart is available on lockc-helm-chart website. Installation with default values can be done with:
kubectl apply -f https://lockc-project.github.io/helm-charts/namespace.yaml
helm repo add lockc https://lockc-project.github.io/helm-charts/
helm install -n lockc lockc lockc/lockc
More info on lockc helm chart installation can be found here
Policies
lockc provides three policy levels for containers:
- baseline - meant for regular applications
- restricted - meant for applications for which we need to be more cautious and secure them more stricly
- privileged - meant for part of the infrastructure which can have full access to host resources - i.e. CNI plugins in Kubernetes
The default policy level is baseline. The policy level can be changed by
the pod-security.kubernetes.io/enforce
label on the namespace which
the container is running in. We make an exception for the kube-system
namespace for which the privileged policy is applied by default.
For now there is no possibility to apply policy levels on local container engines (Docker, containerd, podman), but such feature is planned in the future.
File access
lockc comes with policies about file access which is based on allow- and deny-listing. Baseline and restricted policies have their own pairs of lists. All those lists should contain path prefixes. All the children of listed paths/directories are included, since the decision is made by prefix matching.
The deny list has precedence over allow list. That's because main purpose of the deny list is specifying exceptions whose prefixes are specified in the allow list, but we don't want to allow them.
To sum it up, when any process in the container tries to access a file, lockc:
- Checks whether the given path's prefix is in the deny list. If yes, denies the access.
- Checks whether the given path's prefix is in the allow list. If yes, allows the access.
- In case of no matches, denies the access.
By default, the contents of lists are:
- baseline
- allow list
- /bin
- /dev/console
- /dev/full
- /dev/null
- /dev/pts
- /dev/tty
- /dev/urandom
- /dev/zero
- /etc
- /home
- /lib
- /proc
- /sys/fs/cgroup
- /tmp
- /usr
- /var
- deny list
- /proc/acpi
- allow list
- restricted
- allow list
- /bin
- /dev/console
- /dev/full
- /dev/null
- /dev/pts
- /dev/tty
- /dev/urandom
- /dev/zero
- /etc
- /home
- /lib
- /proc
- /sys/fs/cgroup
- /tmp
- /usr
- /var
- deny list
- /proc/acpi
- /proc/sys
- allow list
By default, with the baseline policy level, this is a good exampole of not allowed behavior:
# docker run --rm -it registry.opensuse.org/opensuse/toolbox:latest
9b34d760017f:/ # ls /sys
ls: cannot open directory '/sys': Operation not permitted
9b34d760017f:/ # ls /sys/fs/btrfs
ls: cannot access '/sys/fs/btrfs': No such file or directory
9b34d760017f:/ # ls /sys/fs/cgroup
blkio cpu,cpuacct cpuset freezer memory net_cls net_prio pids systemd
cpu cpuacct devices hugetlb misc net_cls,net_prio perf_event rdma
We are able to see cgroups (which is fine), but other parts of /sys are hidden.
Mount policies
lockc comes with the following policies about bind mounts from host filesystem
to containers (via -v
option) for each policy level:
- baseline - allow bind mounting from inside
/home
and/var/data
. - restricted - does not allow any bind mounts from host
- privileged - no restrictions, everything can be bind mounted
The baseline behavior in lockc is slightly different than in the Kubernetes
Pod Security Admission controller, which disallows all host mounts for baseline
containers as well as for restricted. The motivation behind allowing /home
and /var/data
by lockc is that they are often used in local container engines
(Docker, podman) for reasons like:
- mounting the source code to build or check
- storing database content on the host for local development
By default, with the baseline policy level, this is a good example of not allowed behavior:
# docker run --rm -it -v /:/rootfs registry.opensuse.org/opensuse/toolbox:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/" to rootfs at "/rootfs" caused: mount through procfd: operation not permitted: unknown.
Syslog
lockc comes with the following policies about access to the kernel message ring buffer for each policy level:
- baseline - not allowed
- restricted - not allowed
- privileged - allowed
By default, with the baseline policy level, checking the kernel logs from the container is not allowed:
# docker run -it --rm registry.opensuse.org/opensuse/toolbox:latest
b10f9fa4a385:/ # dmesg
dmesg: read kernel buffer failed: Operation not permitted
Tuning
This guide shows options and tricks to gain an optimal performance and resouce usage.
Memory usage
Memory usage by lockc depends mostly on BPF maps size. BPF maps are stored in
memory and the biggest BPF maps are the ones related to tracking processes and
containers. Size of those maps depends on the limit of processes (in separate
memory spaces) in the system. That limit is determined by the kernel.pid_max
sysctl. By default the limit is 32768. With such limit, memory usage by lockc
should be aproximately 10-20 MB.
If you observe too much memory being used after installing lockc, try to check
the value of kernel.pid_max
, which can be done with:
sudo sysctl kernel.pid_max
Change of that value (i.e. to 10000) can be done with:
sudo sysctl kernel.pid_max=10000
But that change will be not persistent after reboot. Changing it persistently
requires adding a configuration to /etc/sysctl.d
. I.e. we could create the
file /etc/sysctl.d/05-lockc.conf
with the following content:
kernel.pid_max = 10000
After creating that file, the lower limit is going to be persistent after reboot.
For Developers
- Repositories - Cloning and working with our git repositories
- Build - How to build lockc from the sources
- Development environment (Terraform) - Using Terraform for setting up development environment lockc
Repositories
lockc currently uses two git repositories:
- lockc-project/lockc - the main repository containing lockc source code
- lockc-project/helm-charts - repository with Helm charts to deploy lockc on Kubernetes
If you are interested in development and contributing to lockc, we recommend to fork and clone both of them. Both will be needed especially for building a development environment based on Terraform.
The latter chapters assume that you have lockc and helm-charts cloned in the same parent directory. For example, as $HOME/my-repositories/lockc and $HOME/my-repositories/helm-charts.
Building lockc
The first step to try out lockc is to build it. There are several ways to do that:
- Cargo - build binaries with Cargo (Rust build system) on the host
- convenient for local development, IDE/editor integration
- Container image - build a container image which can be deployed on
Kubernetes
- the only method to try lockc on Kubernetes
- doesn't work for Docker integration, where we rather install lockc as a binary on the host, managed by systemd
Cargo
lockc is written entirely in Rust and uses Cargo as a build system.
Prerequisites
This document assumes that you have Rust installed with rustup.
To build lockc, you will need Rust stable and nightly:
rustup install stable
rustup toolchain install nightly --component rust-src
Then you need to install bpf-linker
for linking eBPF programs:
cargo install bpf-linker
By default, bpf-linker
is trying to use the internal LLVM library available
in Rust. That might not work if you are using musl target. In such case, you
need to install LLVM with static libraries on your host system and then install
bpf-linker
with a different command:
cargo install --git https://github.com/aya-rs/bpf-linker --tag v0.9.3 --no-default-features --features system-llvm -- bpf-linker
Building lockc
After installing all needed dependencies, it's time to build lockc.
You can build and run the entire project with:
cargo xtask run
If you prefer to only build the project, it be done with:
cargo xtask build-ebpf
cargo build
Running tests:
cargo test
Running lints:
cargo clippy
Installing lockc
To install lockc on your host, use the following command:
cargo xtask install
Do not run this command with sudo! Why?
tl;dr: you will be asked for password when necessary, don't worry!
Explanation: Running cargo with sudo ends with weird consequences like not
seing cargo content from your home directory or leaving some files owned by
root in target
. When any destination directory is owned by root, sudo will
be launched automatically by xtask install
just to perform necessary
installation steps.
By default it tries to install lockcd binary in /usr/local/bin
, but the
destination directory can be changed by the following arguments:
--destdir
- the rootfs of your system, default:/
--prefix
- prefix of the most of installation destinations, default:usr/local
--bindir
- directory for binary files, default:bin
--unitdir
- directory for systemd units, default:lib/systemd/system
--sysconfdir
- directory for configuration files, default:etc
By default, binaries are installed from the debug
target profile. If you want
to change it, use the --profile
argument. --profile release
is what you
most likely want to use when packaging or installing on the production system.
Building tarball with binary and unit
To make distribution of lockc for Docker users easier, we have a possibility of
building an archive with binary and systemd unit which can be just unpacked in
/
directory. It can be done by the following command:
cargo xtask bintar
By default it archives lockcd binary in usr/local/bin
, but the
destination directory can be changed by the following arguments:
--prefix
- prefix of the most of installation destinations, default:usr/local
--bindir
- directory for binary files, default:bin
--unitdir
- directory for systemd units, default:lib/systemd/system
--sysconfdir
- directory for configuration files, default:etc
By default, binaries are installed from the debug
target profile. If you want
to change it, use the --profile
argument. --profile release
is what you
most likely want to use when creating a tarball for releases and production
systems.
The resulting binary should be available as target/[profile]/lockc.tar.gz
(i.e. target/debug/lockc.tar.gz
).
Container image
lockc repository contains a Dockerfile
which can be used for building a
container image. The main purpose of building it is ability to deploy lockc on
Kubernetes.
Building a local image can be done in a basic way, like:
docker build -t lockcd .
For quick development and usage of the image on different (virtual) machines, it's convenient to use ttl.sh which is an anonymous and ephemeral container image registry.
To build and push an image to ttl.sh, you can use the following commands:
export IMAGE_NAME=$(uuidgen)
docker build -t ttl.sh/${IMAGE_NAME}:30m .
docker push ttl.sh/${IMAGE_NAME}:30m
After building the container image, you will be able to install lockc on Kubernetes.
Development environment (Vagrant)
There is a possibility to run lockc build from source in a virtual machine using Vagrant:
vagrant up
Our Vagrantfile
supports the following environment variables:
LOCKC_VAGRANT_CPUS
- number of vCPUsLOCKC_VAGRANT_MEMOORY
- memory (in MB)
When VM is provisioned successfully, you can access it using:
vagrant ssh
That VM contains is running k3s. It's also running lockc as a systemd service, which can be checked with:
sudo systemctl status lockc
sudo journalctl -fu lockc
lockc source tree is available in /vagrant
directory. After making changes in
code, you can sync the changes (from the host):
vagrant rsync
Then build, install and restart lockc in VM (inside vagrant ssh
session):
sudo systemctl stop lockc
cargo xtask build-ebpf
cargo build
cargo xtask install
sudo systemctl start lockc
Demos
This section of the book contains demos.
- Mount - mount policies
Mount policies
Kubernetes
The following demo shows mount policies being enforced on Kubernetes pods.
YAML files can be found here.
The policy violations in deployments-should-fail.yaml file are:
- nginx-restricted-fail deployment trying to make a host mount while having a restricted policy
- bpf-default-fail and bpf-baseline-fail deployment trying to mount
/sys/fs/bpf
while having a baseline policy - bpf-restricted-fail trying to mount
/sys/fs/bpf
while having a restricted policy.