Learning Namespaces: Unsharing is caring

Learning bits from Linux Containers the Hard Way workshop.

Linux namespaces are spawned by using either the clone or unshare system calls. unshare command is a Bash wrapper to the unshare system call.

$  unshare --help

Usage:
 unshare [options] [<program> [<argument>...]]

Run a program with some namespaces unshared from the parent.

Options:
 -m, --mount[=<file>]      unshare mounts namespace
 -u, --uts[=<file>]        unshare UTS namespace (hostname etc)
 -i, --ipc[=<file>]        unshare System V IPC namespace
 -n, --net[=<file>]        unshare network namespace
 -p, --pid[=<file>]        unshare pid namespace
 -U, --user[=<file>]       unshare user namespace
 -C, --cgroup[=<file>]     unshare cgroup namespace
 -T, --time[=<file>]       unshare time namespace

 -f, --fork                fork before launching <program>
 --map-user=<uid>|<name>   map current user to uid (implies --user)
 --map-group=<gid>|<name>  map current group to gid (implies --user)
 -r, --map-root-user       map current user to root (implies --user)
 -c, --map-current-user    map current user to itself (implies --user)
 --map-auto                map users and groups automatically (implies --user)
 --map-users=<outeruid>,<inneruid>,<count>
                           map count users from outeruid to inneruid (implies --user)
 --map-groups=<outergid>,<innergid>,<count>
                           map count groups from outergid to innergid (implies --user)

 --kill-child[=<signame>]  when dying, kill the forked child (implies --fork)
                             defaults to SIGKILL
 --mount-proc[=<dir>]      mount proc filesystem first (implies --mount)
 --propagation slave|shared|private|unchanged
                           modify mount propagation in mount namespace
 --setgroups allow|deny    control the setgroups syscall in user namespaces
 --keep-caps               retain capabilities granted in user namespaces

 -R, --root=<dir>          run the command with root directory set to <dir>
 -w, --wd=<dir>            change working directory to <dir>
 -S, --setuid <uid>        set uid in entered namespace
 -G, --setgid <gid>        set gid in entered namespace
 --monotonic <offset>      set clock monotonic offset (seconds) in time namespaces
 --boottime <offset>       set clock boottime offset (seconds) in time namespaces

 -h, --help                display this help
 -V, --version             display version

For more details see unshare(1).

Mount Namespaces

Create a a new mount namespace and start a Bash shell inside that namespace

unshare -m /bin/bash

Recall: The standard form of the mount command is:

mount -t type device dir

Mount a virtual memory filesystem (tmpfs) in the namespace at directory /mnt.

# mount -t tmpfs tmpfs /mnt
# mount | grep mnt
tmpfs on /mnt type tmpfs (rw,relatime,inode64)

PID Basics

Processes are structure in a Tree Structure. Every process has a parent execpt the first process.

Visualize the process tree with pstree command.

$ pstree -S

The -S flags shows Namespace transition - where new namespaces have been created.

All processes are tracked in a special file system called procfs and is usually mounted under /proc.

PID and IPC Namespaces

The PID namespace allows a process and its children to run in a new process tree that maps back to the host process tree. PID namespaces can be nested and have up to 32 nested PID namespaces.

The Inter-Process Communication (IPC) namespace limits the processes ability to share memory.

Create a new PID namespace.

$ unshare -mipf
           ||||
           ||||
           |||fork after creating the new namespaces and before starting Bash
           |||
           ||create a new PID namespace
           ||
           |creates a new IPC namespace
           |
           create a new mount namespace

If the process is not forked when creating a new PID namespaces, the following error:

$  unshare -mip
-bash: fork: Cannot allocate memory
-bash-5.1#

In this case,the current shell cannot move namespace. It exists in the native namespace and when the new PID namespace is created, the system does not know how to handle it as it will not be able to look up its own PID. Solution is to have the process fork itself. This allows the current shell to become a child process of the unshare command.

The current PID in the workspace is 1.

$ echo $$
1

However, if we list all the processes, we will is that the init system is PID 1.

# ps aux | head -n5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 173100 12476 ?        Ss   12:43   0:02 /sbin/init
root           2  0.0  0.0      0     0 ?        S    12:43   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   12:43   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   12:43   0:00 [rcu_par_gp]

ps is referencing the native PID namespace.

Since a new PID namespace has been created, a new /proc must be mounted to match the new namespace.

# mount -t proc none /proc
# ps aux | head -n5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   8352  4884 pts/1    S    13:56   0:00 -bash
root          12  0.0  0.0  10496  3664 pts/1    R+   13:56   0:00 ps aux
root          13  0.0  0.0   5956  1028 pts/1    S+   13:56   0:00 head -n5
[code-lyoko zain]#

That’s why the mount namespace is needed as well.

The proc filesystem can be mount at mountpoint using --mount-proc

# unshare -mipf --mount-proc

Network and UTS Namespaces

The Network namespace allows a new network stack to exist in the sandbox. The Unix Time Sharing(UTS) namespace existly solely for storing the system’s hostname.

Create a sandbox with a new Network and UTS namespace.

#  unshare -nu /bin/bash
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# hostname sandbox
# hostname
sandbox

User Namespace

The User Namespaces allow our sandboxed environment to have its own set of user and group IDs that will map to very high, unique, user and group IDs back on the host system. They also allow the root user in the sandbox to be mapped to another user on the host.

$ unshare -Ur /bin/bash
         ||
         |map current user to root
         unshare user namespace
# whoami
root

Root in the sandbox is mapped to the user zain outside the sandbox.

In sandbox:

# date > /tmp/test
# ls -la /tmp/test
-rw-r--r-- 1 root root 29 okt 12 21:03 /tmp/test

Outside sandbox:

ls -la /tmp/test
-rw-r--r-- 1 zain zain 29 okt 12 21:03 /tmp/test

Fully isolated environments

Put altogether:

unshare -mipfnuUr --mount-proc /bin/bash 
mount -t tmpfs none /tmp
mount -t sysfs none /sys

The new tmp filesystem prevents us from sharing a /tmp directory with the host. The new sysfs filesystem gives the Kernel a clean space to report its state with all of our new namespaces.

Container Network devices

When a container initiates a new Network namespaces, there is not network devices available to it. Devices must be added to the container’s network namespace to get network connectivity to the container.

When containers are run as root user:

a virtual adapter is created and attach to the container’s network namespace as eth0.
virtual adapter joins a bridge that is shared with the host’s physucal network interface.
bridge handle routing traffic from the internet subnet to the network.

+------------------------------------------------------+
|                 Host                                 |
|                                                      | 
|+--------------------+                                |
||   Container        |                                |
||                    |                                |
||               +----|----+       +---------+      +-----+
||               |    |    |       |         |      |     |
||               |eth0|veth|-------| bridge  |------| eth |
||               |    |    |       |         |      |     |
||               +----|----+       +---------+      +-----+
||                    |                                |
|+--------------------+                                |
+------------------------------------------------------+

When containers are run as non-root user:

slirp4netns creates a tap inside the container’s network namespace and attaches to it.
slirp4netns handles traffic routing similiar to a bridge, except in user space.

+-------------------------------------------------------+
|                 Host                                  |
|                                                       | 
|+--------------------+                                 |
||   Container        |                                 |
||                    |                                 |
||               +----|          +------------+      +-----+
||               |    |          |            |      |     |
||               |tap0|----------| slirp4netns|------| eth |
||               |    |          |            |      |     |
||               +----|          +------------+      +-----+
||                    |                                 |
|+--------------------+                                 |
+-------------------------------------------------------+

Allow rootless container to ping through the physical network

sysctl -w "net.ipv4.ping_group_range=0 2000000"

ping_group_range determines the GID range of groups which their users are allowed to create IPPROTO_ICMP sockets. By default this range is 1 0 which means no one is allowed to create IPPROTO_ICMP sockets except root.

Create a sandbox and launch a Bash shell. Get its PID.

$ unshare -unUr /bin/bash
echo $$
117140

Configure a device in the network namespace used by PID 117140. tap0 is the name of the device.

$ slirp4netns -c 117140 tap0
sent tapfd=6 for tap0
received tapfd=6
Starting slirp
* MTU:             1500
* Network:         10.0.2.0
* Netmask:         255.255.255.0
* Gateway:         10.0.2.2
* DNS:             10.0.2.3
* DHCP begin:      10.0.2.15
* DHCP end:        10.0.2.30
* Recommended IP:  10.0.2.100
WARNING: 127.0.0.1:* on the host is accessible as 10.0.2.2 (set --disable-host-loopback to prohibit connecting to 127.0.0.1:*)

The tap0 network adapted is created and configured in the sandbox.

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 62:e3:6e:95:43:8d brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0
       valid_lft forever preferred_lft forever
    inet6 fe80::60e3:6eff:fe95:438d/64 scope link
       valid_lft forever preferred_lft forever
# ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=255 time=10.3 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=255 time=8.06 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=255 time=7.19 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 7.193/8.520/10.305/1.310 ms
#

October 12, 2022