Learning bits from Linux Containers the Hard Way workshop.
Linux namespaces are spawned by using either the clone
or unshare
system calls.
unshare
command is a Bash wrapper to the unshare
system call.
$ unshare --help
Usage:
unshare [options] [<program> [<argument>...]]
Run a program with some namespaces unshared from the parent.
Options:
-m, --mount[=<file>] unshare mounts namespace
-u, --uts[=<file>] unshare UTS namespace (hostname etc)
-i, --ipc[=<file>] unshare System V IPC namespace
-n, --net[=<file>] unshare network namespace
-p, --pid[=<file>] unshare pid namespace
-U, --user[=<file>] unshare user namespace
-C, --cgroup[=<file>] unshare cgroup namespace
-T, --time[=<file>] unshare time namespace
-f, --fork fork before launching <program>
--map-user=<uid>|<name> map current user to uid (implies --user)
--map-group=<gid>|<name> map current group to gid (implies --user)
-r, --map-root-user map current user to root (implies --user)
-c, --map-current-user map current user to itself (implies --user)
--map-auto map users and groups automatically (implies --user)
--map-users=<outeruid>,<inneruid>,<count>
map count users from outeruid to inneruid (implies --user)
--map-groups=<outergid>,<innergid>,<count>
map count groups from outergid to innergid (implies --user)
--kill-child[=<signame>] when dying, kill the forked child (implies --fork)
defaults to SIGKILL
--mount-proc[=<dir>] mount proc filesystem first (implies --mount)
--propagation slave|shared|private|unchanged
modify mount propagation in mount namespace
--setgroups allow|deny control the setgroups syscall in user namespaces
--keep-caps retain capabilities granted in user namespaces
-R, --root=<dir> run the command with root directory set to <dir>
-w, --wd=<dir> change working directory to <dir>
-S, --setuid <uid> set uid in entered namespace
-G, --setgid <gid> set gid in entered namespace
--monotonic <offset> set clock monotonic offset (seconds) in time namespaces
--boottime <offset> set clock boottime offset (seconds) in time namespaces
-h, --help display this help
-V, --version display version
For more details see unshare(1).
Mount Namespaces
Create a a new mount namespace and start a Bash shell inside that namespace
unshare -m /bin/bash
Recall: The standard form of the mount command is:
mount -t type device dir
Mount a virtual memory filesystem (tmpfs) in the namespace at directory /mnt.
# mount -t tmpfs tmpfs /mnt
# mount | grep mnt
tmpfs on /mnt type tmpfs (rw,relatime,inode64)
PID Basics
Processes are structure in a Tree Structure. Every process has a parent execpt the first process.
Visualize the process tree with pstree
command.
$ pstree -S
The -S
flags shows Namespace transition - where new namespaces have been created.
All processes are tracked in a special file system called procfs and is usually mounted under /proc
.
PID and IPC Namespaces
The PID namespace allows a process and its children to run in a new process tree that maps back to the host process tree. PID namespaces can be nested and have up to 32 nested PID namespaces.
The Inter-Process Communication (IPC) namespace limits the processes ability to share memory.
Create a new PID namespace.
$ unshare -mipf
||||
||||
|||fork after creating the new namespaces and before starting Bash
|||
||create a new PID namespace
||
|creates a new IPC namespace
|
create a new mount namespace
If the process is not forked when creating a new PID namespaces, the following error:
$ unshare -mip
-bash: fork: Cannot allocate memory
-bash-5.1#
In this case,the current shell cannot move namespace.
It exists in the native namespace and when the new PID namespace is created, the system does not know how to handle it as it will not be able to look up its own PID.
Solution is to have the process fork itself. This allows the current shell to become a child process of the unshare
command.
The current PID in the workspace is 1.
$ echo $$
1
However, if we list all the processes, we will is that the init system is PID 1.
# ps aux | head -n5
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 173100 12476 ? Ss 12:43 0:02 /sbin/init
root 2 0.0 0.0 0 0 ? S 12:43 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? I< 12:43 0:00 [rcu_gp]
root 4 0.0 0.0 0 0 ? I< 12:43 0:00 [rcu_par_gp]
ps
is referencing the native PID namespace.
Since a new PID namespace has been created, a new /proc
must be mounted to match the new namespace.
# mount -t proc none /proc
# ps aux | head -n5
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 8352 4884 pts/1 S 13:56 0:00 -bash
root 12 0.0 0.0 10496 3664 pts/1 R+ 13:56 0:00 ps aux
root 13 0.0 0.0 5956 1028 pts/1 S+ 13:56 0:00 head -n5
[code-lyoko zain]#
That’s why the mount namespace is needed as well.
The proc filesystem can be mount at mountpoint using --mount-proc
# unshare -mipf --mount-proc
Network and UTS Namespaces
The Network namespace allows a new network stack to exist in the sandbox. The Unix Time Sharing(UTS) namespace existly solely for storing the system’s hostname.
Create a sandbox with a new Network and UTS namespace.
# unshare -nu /bin/bash
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# hostname sandbox
# hostname
sandbox
User Namespace
The User Namespaces allow our sandboxed environment to have its own set of user and group IDs that will map to very high, unique, user and group IDs back on the host system. They also allow the root user in the sandbox to be mapped to another user on the host.
$ unshare -Ur /bin/bash
||
|map current user to root
unshare user namespace
# whoami
root
Root in the sandbox is mapped to the user zain outside the sandbox.
In sandbox:
# date > /tmp/test
# ls -la /tmp/test
-rw-r--r-- 1 root root 29 okt 12 21:03 /tmp/test
Outside sandbox:
ls -la /tmp/test
-rw-r--r-- 1 zain zain 29 okt 12 21:03 /tmp/test
Fully isolated environments
Put altogether:
unshare -mipfnuUr --mount-proc /bin/bash
mount -t tmpfs none /tmp
mount -t sysfs none /sys
The new tmp filesystem prevents us from sharing a /tmp directory with the host. The new sysfs filesystem gives the Kernel a clean space to report its state with all of our new namespaces.
Container Network devices
When a container initiates a new Network namespaces, there is not network devices available to it. Devices must be added to the container’s network namespace to get network connectivity to the container.
When containers are run as root user:
- a virtual adapter is created and attach to the container’s network namespace as
eth0
. - virtual adapter joins a bridge that is shared with the host’s physucal network interface.
- bridge handle routing traffic from the internet subnet to the network.
+------------------------------------------------------+
| Host |
| |
|+--------------------+ |
|| Container | |
|| | |
|| +----|----+ +---------+ +-----+
|| | | | | | | |
|| |eth0|veth|-------| bridge |------| eth |
|| | | | | | | |
|| +----|----+ +---------+ +-----+
|| | |
|+--------------------+ |
+------------------------------------------------------+
When containers are run as non-root user:
- slirp4netns creates a tap inside the container’s network namespace and attaches to it.
- slirp4netns handles traffic routing similiar to a bridge, except in user space.
+-------------------------------------------------------+
| Host |
| |
|+--------------------+ |
|| Container | |
|| | |
|| +----| +------------+ +-----+
|| | | | | | |
|| |tap0|----------| slirp4netns|------| eth |
|| | | | | | |
|| +----| +------------+ +-----+
|| | |
|+--------------------+ |
+-------------------------------------------------------+
Allow rootless container to ping through the physical network
sysctl -w "net.ipv4.ping_group_range=0 2000000"
ping_group_range
determines the GID range of groups which their users are allowed to create IPPROTO_ICMP
sockets.
By default this range is 1 0
which means no one is allowed to create IPPROTO_ICMP
sockets except root.
Create a sandbox and launch a Bash shell. Get its PID.
$ unshare -unUr /bin/bash
echo $$
117140
Configure a device in the network namespace used by PID 117140.
tap0
is the name of the device.
$ slirp4netns -c 117140 tap0
sent tapfd=6 for tap0
received tapfd=6
Starting slirp
* MTU: 1500
* Network: 10.0.2.0
* Netmask: 255.255.255.0
* Gateway: 10.0.2.2
* DNS: 10.0.2.3
* DHCP begin: 10.0.2.15
* DHCP end: 10.0.2.30
* Recommended IP: 10.0.2.100
WARNING: 127.0.0.1:* on the host is accessible as 10.0.2.2 (set --disable-host-loopback to prohibit connecting to 127.0.0.1:*)
The tap0
network adapted is created and configured in the sandbox.
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether 62:e3:6e:95:43:8d brd ff:ff:ff:ff:ff:ff
inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0
valid_lft forever preferred_lft forever
inet6 fe80::60e3:6eff:fe95:438d/64 scope link
valid_lft forever preferred_lft forever
# ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=255 time=10.3 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=255 time=8.06 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=255 time=7.19 ms
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 7.193/8.520/10.305/1.310 ms
#