Container Primitives - Linux Namespaces
Container technology revolutionized software development and deployment, enabling unprecedented levels of efficiency, scalability, and consistency across computing environments. They allow for greater server density by avoiding rerunning a resource hungry second operating system within a virtual machine. And while Docker leaders the pack, there are numerous other implementations that all conform to the OCI standard, such as Podman, LXC, Mesos, Flatpak and others. Though they may appear different, all of these container technologies are fundamentally built upon the same basic Linux kernel primitives.
In this series, we will explain how Linux containers are constructed based on those primitives and their security implications. The roadmap for these posts are about:
- Namespaces - Process Isolation
- Capabilities - Fine-Grained access to sensitive functionality on the operating system
- CGroups - Resource Limitation
- SecComp - Runtime restrictions to system calls it would otherwise have access to.
Lets begin!
Namespaces
Originally designed in 2002, Linux Namespaces is a “kernel primitive” (building block technology) designed for process isolation. By default, all processes on a Linux system are able to “see“ and interact with each other by listing out of the processes, sending signals, using shared files, IPC, monitor network traffic and list out hardware on the system. By default, processes are only isolated by standard user and group permissions. While this accomplishes a degree of segmentation, it still allows for visibility into what else is running on a host system, by whom, what timezone they are in, what networking devices are available, etc. In Namespace parlance, they are said to share the same namespace.
Linux Namespaces take process isolation further by limiting what resources on a Linux system is even visible, much less acquirable to a process. For example, if a process is in its own Process ID (pid) namespace, it will not be able to view other running processes on the host. Running ps
will only list the running bash
prompt (or in a Docker container, whatever daemon was specified in the RUN
directive). A process that attempts to send a signal (ie, kill -9
) to another process outside of its namespace will be told that the process ID does not exist.
In total, Namespaces isolate a process in eight ways:
- CGroup - CGroups is a mechanism to limit system resources. We will discuss this in more detail in a future post.
- Inter-Process Communication (IPC), the way that Unix processes communicate directly with each other, such as via Unix signals, Sockets and shared memory.
- Network - Access and visibility of network devices
- Mount - Access to mounting drives and file system paths.
- PID - Visibility into the running processes of the OS
- Time - Time isolation, such that two processes can report different system times.
- User - Mapping and isolating user IDs and how this affect operating system permissions.
- Unix Timesharing (UTS) - A fancy way of saying different local hostnames per process.
Process can be either partially isolated, by placing the process in a single type of namespace (ie, isolating just the networking layer), or completely isolated by placing a process in its own version of every type of namespace. In effect, the latter would look and feel like a separate VM.
But this is all theoretical, lets get practical.
Unshare: Starting a process in a separate namespace.
Unshare starts a new process in a namespace as specified by the user. For example, if I run ip addr
, it will enumerate the network interfaces. However, if I run sudo unshare --net bash
, the resulting bash
process will be placed in its own unique network namespace, isolating its access to the network of the base system. If I run ip addr
I will not see the .
sudo unshare --net bash
root@lxc:/home/farhan# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@lxc:/home/farhan#
Note that no system-level interfaces are not listed and even the loopback interface is unconfigured. All subsequent processes will be within the same namespace, which is why running ip
from the bash prompt will only report the unconfigured loopback interface.
Now lets try the same with a PID namespace, by running sudo unshare --pid --mount-proc --fork bash
. From within the namespace, run ps aux
to list all processes and you will only see the running bash
process and ps
command.
Combining these together, you can meaningfully isolate a process from the rest of the system. Getting a sense of the power here?
For any code reviewers, note that the unshare
utility is based on the unshare syscall.
NS Enter: Enter an existing namespace
In the prior example, unshare
gave us a brand new namespace as specified per process. But what if you want to have two processes share the same namespace, such as sharing the same isolated network or PID table? This can be done by using the nsenter
utility.
First, you must bind a namespace to a path, as follows:
$ touch /tmp/mynetnamespace # Create an empty file
$ sudo unshare --net=/tmp/mynetnamespace bash
root@lxc:/home/farhan# ip a add 127.0.0.123/24 dev lo
root@lxc:/home/farhan# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.123/24 scope host lo
valid_lft forever preferred_lft forever
The empty file functions as a file descriptor for the namespace. Then thing we did in this namespace is set the IP address to 127.0.0.123/24
and list it out with ip
.
In another terminal on the same machine run, enter the existing namespace as follows:
$ sudo nsenter --net=/tmp/mynetnamespace
root@lxc:/home/farhan# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.123/24 scope host lo
valid_lft forever preferred_lft forever
Notice how the IP address remains, demonstrating that both processes shared the same namespace.
This method is also used to maintain persistent namespaces, most often used for network persistence. In fact, this requirement is so commonly used that network namespace functionality is built into the ip
utility as follows:
$ sudo ip netns add mynamespace # Create a binded namespace called mynamespace
$ sudo ip link add veth0 type veth peer name veth1 # Create a virtual Interface
$ sudo ip link set veth1 netns mynamespace # Add veth1 of the virtual interface to the namespace
$ sudo ip netns exec mynamespace ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
706: veth1@if707: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 26:8c:92:02:32:78 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Note, this newly created namespace is binded at /var/run/netns/mynamespace
. Similarly, Docker’s network namespace file paths can be seen by running docker inspect [container ID]
and searching for the SandboxKey
value.
Also for the developers and code reviewers, nsenter
is based on the setns syscall.
Listing Namespaces and their Processes
Lastly, if you want to list all namespaces on a system, run the utility lsns
, which will give you:
- The namespace ID (an arbitrary number)
- Type of namespace (ie, network, mount point, UTS, etc)
- Number of Processes within the namespace
- Process ID of any running processes within the namespace
- User running the command
- And the full path of the command itself.
If for any reason this utility is not available, you can view the namespace ID via ls -l ``/proc/PID/ns
Breaking out of a Namespace
A container entails a robust set of controls to prevent breaking out into the base system. Namespace-level isolation is but one level of isolation and on its own insufficient to prevent isolation. Thus, a threat actor can trivially break out of the namespace to the default namespace as follows:
$ sudo unshare --net bash
root@lxc:/home/farhan# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@lxc:/home/farhan# nsenter --net=/proc/1/ns/net bash
root@lxc:/home/farhan# echo We are now in the default network namespace!
This can be reproduced for all eight namespace types, invalidating the isolation. Needless to say, namespaces alone are insufficient.
And that’s all for today! Join us next time when we talk about the exciting world of Linux Capabilities!