Docker run
Let’s say we have some basic Dockerfile describing an image running Ubuntu 20.04.
FROM ubuntu:20.04
We can build the image and give it a name.
docker build -t my_image .
Finally we can execute commands in a container based on our image.
docker run my_image echo "hello, world!"
Which prints “hello, world!” to the screen. Nice 🙂
Inside the container
But what does it mean to “run in a container”? Let’s open a shell inside of one and have a look.
Note that the docker run command simply spawns a new process on my MacBook.
❯ ps
...
6486 ttys001 0:00.05 docker run -it docker_image /bin/sh
...
But if we attach to a shell running inside the container process, we get a very different view of the world.
❯ docker run -it docker_image /bin/sh
# ls
bin dev home media opt root sbin sys usr
boot etc lib mnt proc run srv tmp var
# ps
PID TTY TIME CMD
1 pts/0 00:00:00 sh
8 pts/0 00:00:00 ps
# hostname
d5437e71fda7
# whoami
root
# exit
❯
We have our own hostname, process list, filesystem, etc. In other words the sub-process is isolated from the host system that created it.
This idea of isolated, lightweight processes running on one or many host machines is what allowed us to move into the cloud. I rely on containers eery time I ship my code to production.
It’s time to understand what allows containers to exist and be isolated from the host system.
LonelyContainers
Let’s try to re-create the behavior of docker containers ourselves! This C program uses clone
to spawn a new sub-process.
Clone is similar to fork
, meaning it also creates a new process. But unlike clone it gives us more control over which resources we want to share with the child process.
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int container(void *args)
{
printf("PID seen from container: %d\n", getpid());
system("/bin/bash");
}
int main()
{
pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD, NULL);
if (p == -1) {
perror("clone");
exit(1);
}
printf("PID seen from host system: %d\n", p);
waitpid(p, NULL, 0);
return 0;
}
We simply clone the current process and in the newly spawned sub-process we execute /bin/bash
. Additionally, we write the PID of the main process as well as the sub-process.
We can compile the program using gcc lonely_container.c
and run it:
❯ ./lonely_container
PID seen from host system: 11775
PID seen from container: 11775
$ echo Hello, world!
Hello, world!
$ exit
exit
❯
PID independence
Note how both processes still share the same id:
PID seen from host system: 11775
PID seen from container: 11775
Let’s change that. When we have a look at the manpage for clone
we see that the signature is
int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);
and there is one flag, called CLONE_NEWPID
which states
If CLONE_NEWPID is set, then create the process in a new PID namespace.
So, let’s call clone
like this:
int flags = CLONE_NEWPID;
pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD|flags, NULL);
And we get:
PID seen from host system: 13342
PID seen from container: 1
Nice, we’re in our own namespace of processes. From the containers perspective, there’s only one process: itself and it has PID 1.
But when we run ps
to look at other processes we see many more processes.
PID TTY TIME CMD
1038 tty2 00:03:22 Xorg
1046 tty2 00:00:00 gnome-session-b
2600 pts/0 00:00:00 zsh
2610 pts/0 00:00:00 zsh
2611 pts/0 00:00:00 zsh
2613 pts/0 00:00:00 gitstatusd
2972 pts/1 00:00:00 zsh
3007 pts/1 00:00:00 zsh
3008 pts/1 00:00:00 zsh
3010 pts/1 00:00:00 gitstatusd
7060 pts/2 00:00:00 zsh
7068 pts/2 00:00:00 zsh
7069 pts/2 00:00:00 zsh
7071 pts/2 00:00:00 gitstatusd
13114 pts/0 00:00:00 man
13122 pts/0 00:00:00 less
13617 pts/1 00:00:01 node
13636 pts/1 00:00:00 npm exec hugo s
13647 pts/1 00:00:00 hugo
14022 pts/2 00:00:00 sudo
14023 pts/2 00:00:00 lonely_containe
File system independence
The problem withg ps
above is because ps
reads the process list from /proc
. Since our sub-process still shares the file system with the host process, we can see all the host-system processes inside the container.
Let’s change that. First, we create a directory container_root
next to lonely_container.c
. This will ben the root to the filesystem of our container /
.
We now add a new flag CLONE_NEWNS
int flags = CLONE_NEWPID|CLONE_NEWIPC|CLONE_NEWNS;
pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD|flags, NULL);
which states
If CLONE_NEWNS is set, the cloned child is started in a new mount namespace
We change the container
function to chroot
into the new container_root
before executing the shell and mount /proc
in our new root filesystem
int container(void *args)
{
printf("PID seen from container: %d\n", getpid());
chroot("./container_root");
chdir("/");
mount("proc", "/proc", "proc", 0, 0);
system("/bin/bash");
return 0;
}
When we compile and run it, we can see that we have our own filesystem and no longer see the host system processes:
❯ ./lonely_container
PID seen from host system: 11775
PID seen from container: 1
$ ps -a
PID TTY TIME CMD
1 pts/2 00:00:00 bash
7 pts/2 00:00:00 ps
$ exit
exit
❯
Further independence
To try this yourself I suggest looking into the manpage for clone
and checking out what other flags you can add and what other reources you might isolate.
One example could be the hostname via CLONE_UTS
.
The next step could be to isolate CPU and memory for the sub-process, and limiting it to a certain degree. To achieve this, we will have to look into a linux feature called control groups
aka cgroups. Maybe in the next post!
so long
comments powered by Disqus