Rootfull, rootless containers on Btrfs and ZFS
I recently tried to automate my rootless container setup inside an Ansible playbook . It didn’t turn out well, mainly because I couldn’t handle every disk scenario (e.g. single disk or RAID; getting disk’s UUID; mounting Btrfs root volume; mount options). After a couple of days without success, I decided to write a blog post about it instead (better remind my future self of what to do, step by step).
All of my computers either have ZFS or Btrfs on root. Therefore, I’ll discuss how to make nerdctl and podman (in both rootless and rootfull mode) work on such filesystems. I don’t use Docker personally (because everyone does) so it won’t be covered here1.
Motivation §
The default overlayfs
storage backend2 of containerd
(what nerdctl uses behind the scene) and podman pretty much works out of the box. So, here are the reasons why I’m trying not to use it:
I like using non-default settings (tinkering is fun).
containerd and podman have ZFS and Btrfs storage backends (mostly since they are Copy on Write filesystems) so using them makes my systems feel more cohesive.
A lot of people just add themselves to
docker
group3. Meanwhile, setting up rootless container to work nicely isn’t usually straightforward and there are quite a bunch of shortcomings4.
With that out of the way, let’s dive into the setup.
Prerequisites §
A ZFS pool or Btrfs filesystem to store container layers if either one of these storage backends is used
cgroup v2 enabled (either with
systemd
orcgroupfs
)/etc/subuid
and/etc/subgid
properly configured5 (on AlpineLinux you’ll also need shadow-subids package)nerdctl requires at least rootlesskit and slirp4netns for rootless mode
Choosing storage backends §
Things are straightforward in rootfull mode. You can just stick with the backend (ZFS/Btrfs) that is also your filesystem with little to zero configuration6.
Rootless mode, on the other hand, comes with some limitations:
ZFS doesn’t grant unprivileged users all the capabilities needed to run containers7, so we can’t use it (yet?).
For Btrfs, podman supports it well while nerdctl currently doesn’t seem to work8.
Here’s what we’ll use for rootless mode:
nerdctl | podman | |
---|---|---|
ZFS | native | overlay9 |
Btrfs | overlay | btrfs |
Notice that we’ll make use of native
snapshotter for nerdctl on ZFS. Why not the default overlayfs? The answer is because ZFS doesn’t like having an overlayfs mount being on top of it10. Other snapshotters (fuse-overlayfs, stargz) probably will also work, but they require installing corresponding gRPC helper binaries and setting up additional services alongside containerd, so native
is the easiest choice here.
nerdctl §
Rootfull mode §
ZFS §
We’ll start with ZFS. The README in ZFS snapshotter repository already tells us what to do:
- Set up a ZFS filesystem. The ZFS filesystem name is arbitrary, but the mount point needs to be /var/lib/containerd/io.containerd.snapshotter.v1.zfs, when the containerd root is set to /var/lib/containerd/.
$ zfs create -o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs your-zpool/containerd
- Start containerd.
Pretty simple, right? For something new, I’ll do it the Ansible way:
- name: Create ZFS dataset for containerd
community.general.zfs:
name: rpool/ROOT/containerd
extra_zfs_properties:
devices: off
xattr: sa
acltype: posixacl
canmount: on
mountpoint: /var/lib/containerd/io.containerd.snapshotter.v1.zfs
Btrfs §
It’s pretty similar with Btrfs storage backend. Most Btrfs on root setups don’t mount the root subvolume on /, and you probably would want to keep container layers when switching the subvolume mounted on /, so there are some extra steps involved:
# Mount the root subvolume somewhere first, assuming it is /dev/sda1 in this case
mount -t btrfs -o rw,noatime,user_subvol_rm_allowed,subvol=/ /dev/sda1 /mnt
# Create a top level subvolume (rootid=5) for containerd's storage
btrfs subvolume create /mnt/@containerd
Then stick the newly created subvolume into /etc/fstab
and mount it:
# Replace /dev/sda1 with something more proper, such as UUID=...
/dev/sda1 /var/lib/containerd/io.containerd.snapshotter.v1.btrfs btrfs rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containerd 0 2
Configuring nerdctl §
After you are done with setting up the storage, enable containerd
service with your system’s service manager.
You can start using nerdctl immediately: nerdctl --snapshotter zfs run --rm -it alpine:edge
To avoid specifying the snapshotter backend every time in the CLI, a configuration file may be created:
/etc/nerdctl/nerdctl.toml:
snapshotter = "zfs"
Voilà! Enjoy your new rootfull container setup!
Rootless mode §
Starting containerd §
In rootless mode, you need containerd_rootless.sh script and a way to run it on user login (for convenient sake).
If you use systemd
, great! nerdctl already provides containerd-rootless-setuptools.sh
script that does the job for you. Just follow the instruction!
For other service managers, if yours support creating user services (
runit
and dinit
do), use it. Otherwise, just start containerd_rootless.sh
inside ~/.profile
(or a similar file) or use your desktop environment’s autostart mechanism.
Personally, I use AlpineLinux with OpenRC, which currently doesn’t have this functionality. For demonstration, I’ll set the daemon up using superd 11:
- First, obviously we need to start
superd
on user login (the author suggests doing it with your desktop environment) - Now, create a service file for containerd daemon (adapted from the example systemd’s one ):
~/.config/services/containerd.service:
[Unit]
Description=containerd (rootless)
[Service]
ExecStart=/full/path/to/containerd_rootless.sh
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
Type=simple
KillMode=mixed
- The final step is to enable the service and start it:
superctl enable --now containerd
Configuring nerdctl §
We’re done with starting containerd daemon inside a user namespace. It’s time to configure nerdctl. If overlay
snapshotter is chosen (when Btrfs is the underlying filesystem), there is nothing else to be done. For native
snapshotter, we, again, create a configuration file:
~/.config/nerdctl/nerdctl.toml:
# The only important field is "snapshotter"
debug_full = false
snapshotter = "native"
insecure_registry = false
At this point, running nerdctl commands should work without issues.
buildkitd §
Unlike podman, which has buildah
baked into the binary, nerdctl relies on buildkitd
daemon and buildctl
command
for building container image.
In rootfull mode, it’s just as easy as starting buildkitd
daemon the way your OS provides.
In rootless mode, buildkitd
daemon should be started after containerd
in the same user namespace. Again, I’ll demonstrate the service setting up process using superd
. Also, containerd worker will be used here instead of the default OCI worker, as it appears to speed up loading images into containerd
.
~/.config/services/buildkit.service:
[Unit]
Description=BuildKit Daemon (Rootless)
After=containerd.service
[Service]
Type=simple
# containerd uses 'default' namespace.
# For Kubernetes the namespace is 'k8s.io'.
# The default namespace of buildkitd is 'buildkit'.
ExecStart=/path/to/containerd-rootless-setuptool.sh nsenter -- /usr/bin/buildkitd --addr=unix:///run/user/<uid>/buildkit-default/buildkitd.sock --root=/home/user/.local/share/buildkit-default --containerd-worker-namespace=default --containerd-worker-snapshotter=native
ExecReload=/bin/kill -s HUP $MAINPID
RestartSec=2
Restart=on-failure
KillMode=mixed
buildkitd
can be forced to use containerd worker in its configuration file:
~/.config/buildkit/buildkitd.toml:
[worker.oci]
enabled = false
[worker.containerd]
enabled = true
rootless = true
podman §
Rootfull §
The filesystem setup in rootfull mode for podman is roughly the same process as nerdctl. You just need to change the storage mountpoint from /var/lib/containerd/io.containerd.snapshotter.v1.{zfs,btrfs}
to /var/lib/containers/storage
and create a ZFS dataset or a Btrfs subvolume there. Additionally, a system configuration is required:
/etc/containers/storage.conf:
[storage]
driver = "zfs" # or "btrfs"
runroot = "/run/containers/storage"
graphroot = "/var/lib/containers/storage"
rootless_storage_path = "$HOME/.local/share/containers/storage"
[storage.options]
pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}
# podman uses legacy mount for ZFS
[storage.options.zfs]
fsname = "rpool/ROOT/containers"
mountopt = "nodev"
podman can be run without the podman daemon, so you don’t need to do anything extra beside what’s mentioned above.
Rootless §
In Btrfs case, podman will fall back to the system configuration, and we can just use it as is (no configuration required). If you intend to use podman in both rootless and rootfull modes, a good way to manage the container storage is to use separated nested subvolumes inside the same top-level one:
btrfs subvolume create /mnt/@containers
btrfs subvolume create /mnt/@containers/your_user
btrfs subvolume create /mnt/@containers/root
chown your_user:your_user /mnt/@containers/your_user
And your /etc/fstab
should now look like this:
/dev/sda1 /var/lib/containers/storage btrfs rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containers/root 0 2
/dev/sda1 /home/your_user/.local/share/containers/storage btrfs rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containers/your_user 0 2
For ZFS, since we have to use fuse-overlayfs
, let’s override the system settings with a user’s configuration file:
~/.config/containers/storage.conf:
[storage]
driver = "overlay"
[storage.options.overlay]
force_mask = "private"
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = "nodev"
Conclusion §
I’m quite happy with the experiment so far. podman and nerdctl, since, have been working nicely throughout my simple day-to-day usage. Though, there are still things I want to try out in the future:
- docker in rootless mode (since compared to nerdctl, it supports Btrfs)
- stargz snapshotter (the new hot thing, and its features look promising)
I’ll probably write another blog post if they appear to be interesting, and if I have some free time to test them in the future. See you then, and thanks for reading this until the end!
Docker has an extensive document for setting up rootless container. I think it is good enough already. In short, Btrfs works while ZFS support is absent. ↩︎
containerd uses the term “snapshotter” while podman calls it “storage driver”. ↩︎
you can do the same with nerdctl by the way, though it is highly discouraged. ↩︎
see https://github.com/containers/podman/blob/main/rootless.md ↩︎
detailed instruction is available at https://rootlesscontaine.rs/getting-started/common/ . ↩︎
be aware that containerd’s Btrfs storage implementation has some performance issues, e.g. not using Btrfs quota (see containerd/containerd#4217 , containerd/containerd#6067 and containerd/containerd#6581 ) ↩︎
check out this answer . ↩︎
I opened issue containerd/containerd#7514 on GitHub. You can keep track of the bug there. ↩︎
the
overlay
storage driver of podman can be configured to usefuse-overlayfs
if the default overlayfs doesn’t work ↩︎see issue openzfs/zfs#8648 .
Bonus tip for AlpineLinux users: akms mounts an overlay filesystem inside/tmp/akms
to build kernel modules by default. So, you either disable this behavior in/etc/akms.conf
or don’t create <your_root_pool>/tmp dataset in the first place (mount/tmp
as tmpfs instead, which you should always do). ↩︎another option is to use s6-rc . ArtixLinux has a wonderful guide on the topic. ↩︎