Hardening systemd services

@aszlig, indeed, however those directories are either empty (/etc, /root, /usr, /var) or as they should be wrt. the configured hardening (/dev, /proc, /sys, /nix):

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/ls) -l /
Running as unit: run-u16023.service
total 0
drwxr-xr-x  20 0 0 4160 May  3 13:24 dev
drwxr-xr-x   2 0 0   40 May  4 16:35 etc
drwxr-xr-x   3 0 0   60 May  4 16:35 nix
dr-xr-xr-x 376 0 0    0 May  4 16:35 proc
drwxr-xr-x   2 0 0   40 May  4 16:35 root
drwxrwxrwt   4 0 0   80 May  4 16:35 run
dr-xr-xr-x  13 0 0    0 Apr 28 17:44 sys
drwxr-xr-x   2 0 0   40 May  4 16:35 usr
drwxr-xr-x   2 0 0   40 May  4 16:35 var

AFAIU that’s because systemd’s setup_namespace()
calls base_filesystem_create() which creates those usual top level directories.
This would explain why, as you noticed in the original PR:

Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.

A way to limit access to those mountpoints/directories:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16084.service
TARGET                                                     SOURCE                                       FSTYPE     OPTIONS
/                                                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                                                     devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| |-/dev/pts                                               devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| |-/dev/shm                                               tmpfs                                        tmpfs      rw,nosuid,nodev
| |-/dev/hugepages                                         hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
| `-/dev/mqueue                                            mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|-/nix/store                                               losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|-/proc                                                    proc                                         proc       rw,nosuid,nodev,noexec,relatime
|-/run                                                     tmpfs                                        tmpfs      rw,relatime
| |-/run/systemd/incoming                                  tmpfs[/systemd/propagate/run-u16084.service] tmpfs      ro,nosuid,nodev,size=1988236k,mode=755
| `-/run/systemd-confinement/test                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|   |-/run/systemd-confinement/test/dev                    devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
|   | |-/run/systemd-confinement/test/dev/pts              devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
|   | |-/run/systemd-confinement/test/dev/shm              tmpfs                                        tmpfs      rw,nosuid,nodev
|   | |-/run/systemd-confinement/test/dev/hugepages        hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
|   | `-/run/systemd-confinement/test/dev/mqueue           mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|   |-/run/systemd-confinement/test/nix/store              losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|   |-/run/systemd-confinement/test/proc                   proc                                         proc       rw,nosuid,nodev,noexec,relatime
|   `-/run/systemd-confinement/test/run                    tmpfs                                        tmpfs      rw,relatime
|     `-/run/systemd-confinement/test/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
`-/sys                                                     sysfs                                        sysfs      rw,nosuid,nodev,noexec,relatime
  |-/sys/kernel/security                                   securityfs                                   securityfs rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/cgroup                                         cgroup2                                      cgroup2    rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
  |-/sys/firmware/efi/efivars                              efivarfs                                     efivarfs   rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/bpf                                            bpf                                          bpf        rw,nosuid,nodev,noexec,relatime,mode=700
  |-/sys/fs/fuse/connections                               fusectl                                      fusectl    rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/pstore                                         pstore                                       pstore     rw,nosuid,nodev,noexec,relatime
  `-/sys/kernel/config                                     configfs                                     configfs   rw,nosuid,nodev,noexec,relatime

Is to use InaccessiblePaths=:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -pInaccessiblePaths=-+/run/systemd-confinement/test -pInaccessiblePaths=-+/dev -pInaccessiblePaths=-+/sys -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16079.service
TARGET                    SOURCE                                       FSTYPE OPTIONS
/                         tmpfs[/systemd-confinement/test]             tmpfs  rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755
|-/nix/store              losurdo/nix[/store]                          zfs    ro,relatime,xattr,posixacl
|-/proc                   proc                                         proc   rw,nosuid,nodev,noexec,relatime
|-/run                    tmpfs                                        tmpfs  rw,relatime
| `-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16079.service] tmpfs  ro,nosuid,nodev,size=1988236k,mode=755
`-/sys                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755

That’s why I think we should drop TemporaryFileSystem=/ in favor of a RootDirectory= inside a RuntimeDirectory=. What do you think?