Containers: Under the Hood
Introduction <<
Nowadays, in software engineering, we take containers for granted. We rely on them for day-to-day work, we build highly available and highly scalable production environments with them. But, many of us, software engineers, are struggling to understand and consequently what containers fundamentally are. Usually, when explaining to others, we point out that they are not virtual machines, which is true, but we struggle to precisely state what they are. In this article, we will try to have a more in-depth understanding of what containers are, how they work, and how can we leverage them for building industry-standard systems.
Environment Set-Up <<
To understand containers, we would want to play around with some container runtimes . Docker is the most popular implementation of a container runtime, we will use that for this article. There are several other implementations out there, for example, Podman, LXC/LXD, rkt, and many others.
Moving on with our setup, we would want to use a Linux (Ubuntu) machine on which we can install Docker Engine following the steps from the Docker documentation . We would want to specifically use Docker Engine and not Docker Desktop. Docker Desktop will utilize a virtual machine for the host, we don't want to have that virtual machine for our current purposes.
Process Isolation <<
Containers are not virtual machines (VMs). Despite having their own hostname, filesystem, process space, and networking stack, they are not VMs. They do not have a standalone kernel, and they cannot have separate kernel modules or device drives installed. They can have multiple processes, which are isolated from the host machine's running processes.
On our Ubuntu host, we can run the following command to get information about the kernel:
root@ip-172-31-24-119:~# uname -s -r
Linux 5.15.0-1019-aws
From the output, we can see that the name of the kernel currently in use is
Linux
with the release version of the kernel
5.15.0-1019-aws
(The
aws
prefix comes from the fact that I'm using an EC2 machine on AWS).
Let's output some more information about our Linux distribution:
root@ip-172-31-24-119:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Now, let's run Rocky Linux from a Docker container using the following command:
docker run -ti rockylinux:8
The
-ti
flag will run the container in an interactive mode, prompting us to a shell inside the container. Let's fetch some OS information:
[root@3564a12dd942 /]# cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.6 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.6 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
It seems like we are connected to a different machine. But if we get information about the kernel, we will get something familiar.
[root@3564a12dd942 /]# uname -s -r
Linux 5.15.0-1019-aws
We can notice that it is the same as for the host machine. We can conclude that the container and the Ubuntu host machine are sharing the same kernel. Containers rely on the ability of the host operating system to isolate one program from another while allowing these programs to share resources between them such as CPU, memory, storage, and networking resources. This is accomplished by a capability of the Linux kernel, named namespaces .
Linux namespaces are not a new technology or a recently added feature of the kernel, they have been available for many years. The role of a process namespace is to isolate the processes running inside of it, so they should not be able to see things they shouldn't.
To watch process namespaces created by container runtimes in action, we will use
containerd
. If we followed the installation link from above, we should have
containerd
installed with Docker Engine. Docker uses
containerd
under the hood as its container runtime. A
container runtime
(container engine) provides low-level functionalities to execute containerized processes.
To access
containerd
, we can use
ctr
command. For example, to check if
containerd
was installed and works correctly, we can run
ctr images ls
, which should return a list of images in case of success, or an error. At this point we most likely don't have any images pulled, so should get an empty response. To pull a
busybox
image we can do the following:
ctr image pull docker.io/library/busybox:latest
We can check again the existing images with
ctr images ls
which should list the
busybox
image. We can run this image using:
ctr run -t --rm docker.io/library/busybox:latest v1
This command will start the image in interactive mode, meaning that we will be provided with an input shell waiting for commands. Now if we want to grab the list of currently running tasks from the host machine, we should get the following answer:
TASK PID STATUS
v1 1517 RUNNING
If we take the PID of the running container, we can get hold of the parent process of it:
root@ip-172-31-24-119:~# ps -ef | grep 1517 | grep -v grep
root 1517 1493 0 21:55 pts/0 00:00:00 sh
root@ip-172-31-24-119:~# ps -ef | grep 1493 | grep -v grep
root 1493 1 0 21:55 ? 00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace default -id v1 -address /run/containerd/containerd.sock
root 1517 1493 0 21:55 pts/0 00:00:00 sh
As we might have expected, the parent process is
containerd
. We can get the process namespaces created by
containerd
as well:
root@ip-172-31-24-119:~# lsns | grep 1517
4026532279 mnt 1 1517 root sh
4026532280 uts 1 1517 root sh
4026532281 ipc 1 1517 root sh
4026532282 pid 1 1517 root sh
4026532283 net 1 1517 root sh
containerd
is launched five different types of namespaces for isolating processes running in our
busybox
container. These are the following:
-
mnt
: mount points; -
uts
: Unix time sharing; -
ipc
: interprocess communication; -
pid
: process identifiers; -
net
: network interfaces, routing tables, and firewalls.
Network Isolation <<
containerd
is using network namespaces to have network isolation and to simplify configuration. In a lot of cases, our containers act as web servers. For being able to run a web server, we need to choose a network interface and port on which the server will listen on. To solve the issue of port collision (two or more processes listening on the same interface on the same port), container runtimes use virtual network interfaces.
If we would want to see the network namespace created by
containerd
, we will run into an issue. Unfortunately,
network namespaces created by
containerd
are invisible
. This means, if we execute
ip netns list
to list all the network namespaces present on our host machine, we most likely get no output. We can still get hold of the namespace created by
containerd
if we do the following:
- Get the PID of the currently running container:
root@ip-172-31-24-119:~# ctr task ls
TASK PID STATUS
v1 13744 RUNNING
-
Create an empty file in
/var/run/netns
location with the container identifier (we will use the container PID as the identifier):
mkdir -p /var/run/netns
touch /var/run/netns/13744
-
Bind the
net
process namespace to this file:
mount -o bind /proc/13744/ns/net /var/run/netns/13744
Now if we run
ip netns list
, we get the following:
root@ip-172-31-24-119:~# ip netns list
13744
We also can look at the interfaces on the network namespace:
root@ip-172-31-24-119:~# ip netns exec 13744 ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
Running
ip a
from inside the
busybox
container, we get similar output:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
Filesystem Isolation <<
The idea of process isolation involves preventing a process from seeing things it should not. In terms of files and folders, Linux provides filesystem permissions. The Linux kernel associates an owner and group to each file and folder, on top of that it manages read, write and execute permissions. This permissions system works well, although if a process manages to elevate its privileges, it could see files and folders which should have been forbidden.
A more advanced solution for isolation provided by a Linux kernel is to run a process in an isolated filesystem. This can be achieved by an approach known as
change root
. The
chroot
command changes the apparent root directory for the current running process and its children.
For example, we can download Alpine Linux inside a folder and launch an isolated shell using
chroot
:
ssm-user@ip-172-31-24-119:~$ ls
ssm-user@ip-172-31-24-119:~$ mkdir alpine
ssm-user@ip-172-31-24-119:~$ cd alpine
ssm-user@ip-172-31-24-119:~/alpine$ curl -o alpine.tar.gz http://dl-cdn.alpinelinux.org/alpine/v3.16/releases/x86_64/alpine-minirootfs-3.16.0-x86_64.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2649k 100 2649k 0 0 65.6M 0 --:--:-- --:--:-- --:--:-- 66.3M
ssm-user@ip-172-31-24-119:~/alpine$ ls
alpine.tar.gz
Let's unpack the archive:
ssm-user@ip-172-31-24-119:~/alpine$ tar xf alpine.tar.gz
ssm-user@ip-172-31-24-119:~/alpine$ ls
alpine.tar.gz bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
ssm-user@ip-172-31-24-119:~/alpine$ rm alpine.tar.gz
ssm-user@ip-172-31-24-119:~/alpine$ ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
We can recognize these folders from any other Linux distribution. The
alpine
folder has the necessary resources to be used as the root folder. We can run an isolated Alpine shell as follows:
ssm-user@ip-172-31-24-119:~/alpine$ cd ..
ssm-user@ip-172-31-24-119:~$ sudo chroot alpine sh
/ # ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ #
Container runtimes, such as
containerd
(or Docker) implement a similar approach to
chroot
for filesystem isolation. On top of that, they provide a more practical way of setup for the isolation by using
container images
. Container images are ready-to-use bundles that contain all the required files and folders for the base filesystem, metadata (environment variables, arguments), and other executables.
Resource Limiting <<
Until now we've seen how we can have process, networking, and filesystem isolation. There is one piece missing from the puzzle: hardware resource limiting. Even if our processes our entirely isolated, they still can "see" the host's CPU, memory, networking, and storage. In the following lines, we will discuss how can we guarantee, that a container is only using the resource allocated to it.
In the case of a Linux kernel the scheduler keeps a list of all the processes, from which it tracks all the ones that are ready to run and also how much time each process received. The scheduler is designed to be fair , meaning it tries to give time to each process to run. It also accepts input regarding the priority of each process.
In terms of prioritization of processes, we can discuss processes about real-time and non-real-time policies. Real-time processes have to react fast ("in-real-time") to outside events, so these processes get a higher priority compared to non-real-time processes.
We can use
ps
Linux command to see which process is running in real-time or non-real-time:
root@ip-172-31-19-81:~# ps -e -o pid,class,rtprio,ni,comm
PID CLS RTPRIO NI COMMAND
1 TS - 0 systemd
2 TS - 0 kthreadd
3 TS - -20 rcu_gp
4 TS - -20 rcu_par_gp
5 TS - -20 netns
...
15 FF 99 - migration/0
16 FF 50 - idle_inject/0
17 TS - 0 kworker/0:1-events
...
1374 TS - 0 bash
1385 TS - 0 ps
In the output above we can take a look at the
CLS
column, we can have the following values:
-
TS
: time-sharing, non-real-time policy -
FF
: FIFO (first in - first out), real-time policy -
RR
: round-robin, also real-time-policy
In the
RTPRIO
and
NI
columns, we can see the priority of some processes.
RTPRIO
(real-time priority) applies only to real-time processes, while
NI
("nice" level) applies to non-real-time processes and can have a value between -20 (least nice, highest priority) to 20 (nicest, lowest priority).
Real-time processes are usually not computationally intensive, but when they need CPU, they need it immediately. For all real-time processes to get the CPU they require, the Linux kernel reserves slices of CPU time for each process. The ability to provide slices of CPU time to each process represents the basis of CPU resource isolation in the case of containers. This approach is preferred because one process can not influence the scheduler by requesting more processing time for certain computationally intensive tasks.
To manage container use of CPU cores, Linux uses
control groups (
cgroups
)
. Control groups can do, even more, to cite
Wikipedia
:
cgroups
is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.
We can add processes to
cgroups
. After a process is in a
cgroup
, the kernel automatically applies the controls from that group.
The creation and configuration of
cgroups
is handled through a specific
kind of filesystem, which by default can be found under
/sys/fs/cgroup
path:
root@ip-172-31-19-81:/sys/fs/cgroup# ls -F -la
total 0
dr-xr-xr-x 11 root root 0 Oct 17 20:16 ./
drwxr-xr-x 8 root root 0 Oct 17 20:16 ../
-r--r--r-- 1 root root 0 Oct 17 20:16 cgroup.controllers
-rw-r--r-- 1 root root 0 Oct 17 20:17 cgroup.max.depth
-rw-r--r-- 1 root root 0 Oct 17 20:17 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Oct 17 20:16 cgroup.procs
-r--r--r-- 1 root root 0 Oct 17 20:17 cgroup.stat
-rw-r--r-- 1 root root 0 Oct 17 20:43 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Oct 17 20:17 cgroup.threads
-rw-r--r-- 1 root root 0 Oct 17 20:17 cpu.pressure
-r--r--r-- 1 root root 0 Oct 17 20:17 cpu.stat
-r--r--r-- 1 root root 0 Oct 17 20:17 cpuset.cpus.effective
-r--r--r-- 1 root root 0 Oct 17 20:17 cpuset.mems.effective
drwxr-xr-x 2 root root 0 Oct 17 20:17 dev-hugepages.mount/
drwxr-xr-x 2 root root 0 Oct 17 20:17 dev-mqueue.mount/
drwxr-xr-x 2 root root 0 Oct 17 20:16 init.scope/
-rw-r--r-- 1 root root 0 Oct 17 20:17 io.cost.model
-rw-r--r-- 1 root root 0 Oct 17 20:17 io.cost.qos
-rw-r--r-- 1 root root 0 Oct 17 20:17 io.pressure
-rw-r--r-- 1 root root 0 Oct 17 20:17 io.prio.class
-r--r--r-- 1 root root 0 Oct 17 20:17 io.stat
-r--r--r-- 1 root root 0 Oct 17 20:17 memory.numa_stat
-rw-r--r-- 1 root root 0 Oct 17 20:17 memory.pressure
-r--r--r-- 1 root root 0 Oct 17 20:17 memory.stat
-r--r--r-- 1 root root 0 Oct 17 20:17 misc.capacity
drwxr-xr-x 2 root root 0 Oct 17 20:17 sys-fs-fuse-connections.mount/
drwxr-xr-x 2 root root 0 Oct 17 20:17 sys-kernel-config.mount/
drwxr-xr-x 2 root root 0 Oct 17 20:17 sys-kernel-debug.mount/
drwxr-xr-x 2 root root 0 Oct 17 20:17 sys-kernel-tracing.mount/
drwxr-xr-x 35 root root 0 Oct 17 20:43 system.slice/
drwxr-xr-x 3 root root 0 Oct 17 20:20 user.slice/
Each of the entries from above defines the properties of a different resource. We can configure these properties by applying limits.
Building Container Images <<
Before building a container image ourselves, let's step a little bit back and investigate how are other, popular images built. We will use Docker to pull an
Apache httpd
image, which we will take it apart to see its content.
Let's pull the image from the Docker registry:
ssm-user@ip-172-31-24-119:~$ docker pull httpd
Using default tag: latest
latest: Pulling from library/httpd
bd159e379b3b: Already exists
36d838c2f6d6: Pull complete
b55eda22bb18: Pull complete
f6e6bfa28393: Pull complete
a1b49b7ecb8a: Pull complete
Digest: sha256:4400fb49c9d7d218d3c8109ef721e0ec1f3897028a3004b098af587d565f4ae5
Status: Downloaded newer image for httpd:latest
docker.io/library/httpd:latest
We can launch a container based on this image and connect to a shell using the command below:
ssm-user@ip-172-31-24-119:~$ docker run -it httpd /bin/bash
Having a shell, we can navigate to the root of the container and list all the files and folders:
root@17556581d317:/usr/local/apache2# cd /
root@17556581d317:/# ls -lart
total 72
drwxr-xr-x 2 root root 4096 Sep 3 12:10 home
drwxr-xr-x 2 root root 4096 Sep 3 12:10 boot
drwxr-xr-x 1 root root 4096 Oct 4 00:00 var
drwxr-xr-x 1 root root 4096 Oct 4 00:00 usr
drwxr-xr-x 2 root root 4096 Oct 4 00:00 srv
drwxr-xr-x 2 root root 4096 Oct 4 00:00 sbin
drwxr-xr-x 3 root root 4096 Oct 4 00:00 run
drwx------ 2 root root 4096 Oct 4 00:00 root
drwxr-xr-x 2 root root 4096 Oct 4 00:00 opt
drwxr-xr-x 2 root root 4096 Oct 4 00:00 mnt
drwxr-xr-x 2 root root 4096 Oct 4 00:00 media
drwxr-xr-x 2 root root 4096 Oct 4 00:00 lib64
drwxrwxrwt 1 root root 4096 Oct 5 04:09 tmp
drwxr-xr-x 1 root root 4096 Oct 5 04:10 lib
drwxr-xr-x 1 root root 4096 Oct 5 04:10 bin
drwxr-xr-x 1 root root 4096 Oct 15 20:19 etc
-rwxr-xr-x 1 root root 0 Oct 15 20:19 .dockerenv
drwxr-xr-x 1 root root 4096 Oct 15 20:19 ..
drwxr-xr-x 1 root root 4096 Oct 15 20:19 .
dr-xr-xr-x 13 root root 0 Oct 15 20:19 sys
dr-xr-xr-x 176 root root 0 Oct 15 20:19 proc
drwxr-xr-x 5 root root 360 Oct 15 20:19 dev
The
Apache httpd
image we are using is based on a Debian base image. This means it has a filesystem similar to what we would expect from the Debian Linux distribution. It contains all the necessary files and folders which would be expected by the Apache webserver to function correctly.
Also, if we take another look at the output of the
docker pull
command, we can observe that a bunch of layers was downloaded. Some layers are skipped with the message that they already exist on the host machine. Container images are made up of layers, that can be shared between images. The reason why a layer is skipped during a pull is that was already downloaded during a pull for another image or a previous version of the same image. Docker detects that more than one image has the same layer and it does not retrieve it twice. Layers are used to save space and to speed up the builds and pulls/pushes.
Layers are created when images are built. Usually, we rely on other base images when building our image. As an example, we use the
httpd
base image, on top of which we add our website, essentially creating another layer. Base images also should come from somewhere, usually, they are built from a minimal Linux filesystem. The Alpine Linux resources downloaded and used for
chroot
, could be used as the base for a container image.
There are several ways to build images, the most popular would be the Docker approach with the usage of
Dockerfiles
. A minimal Dockerfile for using
httpd
as the base image would look like this:
FROM httpd:2.4
RUN mkdir -p /usr/local/apache2/conf/sites/
COPY my-httpd.conf /usr/local/apache2/conf/httpd.conf
COPY ./public-html/ /usr/local/apache2/htdocs/
Many possible commands can be used when building Docker images. For more information, we would want to check out the Docker documentation . Some widely used commands from a Dockerfile would be the following:
FROM
: specifies the base image for the current build
ENV
: specifies an environment variable
RUN
: a command executed inside of the container while being built
COPY
: used to copy over files from the host machine to the container while it is being built
ENTRYPOINT
: specifies the initial process for the container
CMD
: sets the default parameters for the initial process
Conclusions <<
In this article, we have seen what containers are. They are not virtual machines, they are essentially a group of isolated processes with their own isolated filesystem and networking. They share the kernel modules with the host machine. Because of this, they can be lightweight, compared to a fully-fledged virtual machine. They can be part of an agile architecture since they can be spawned up and torn down quickly.
Links and References <<
- Install Docker Engine on Ubuntu: Docker docs
- Linux Namespaces: Wikipedia page
- Docker Container Network Namespace is Invisible: baeldung.com
-
chroot
: Wikipedia page - Completely Fair Scheduler: Wikipedia page
-
cgroups
: Wikipedia page - Dockerfile reference: Docker docs
Additional Reading <<
- Building containers by hand using namespaces: The net namespace: https://www.redhat.com/sysadmin/net-namespaces
- Basics of Container Isolation: https://blog.devgenius.io/basics-of-container-isolation-5eabdb258409
This article is heavily inspired by these 2 books:
- Alan Hohn - The Book of Kubernetes: amazon.com
- Liz Rice - Container Security: Fundamental Technology Concepts that Protect Containerized Applications: amazon.com