Hardening systemd services

It’s going to be a lot of work, not least because a lot of these services come from upstream without this hardening (and so might break on upgrades), but this would be a really, really welcome addition to NixOS - yet another good reason to run that service on NixOS specifically.

As one place to start, there are a lot of services that NixOS itself makes, or that packages create as a way of running something (e.g. at boot or post-install or on upgrade). They’re mostly small, and mostly uninteresting from an overall attack-surface complexity viewpoint, but this doesn’t mean they’re a waste of time:

  • It would be a good place to start with a much tighter set of defaults with explicit additional permissions where needed.
  • many of them are in-tree and without (or with fewer) entanglements with upstream or external changes
  • any issues they have with tight defaults would be a good way to introduce the need for explicit permissions to Nix devs at the time of authorship

Migration from existing defaults, and test cases, is (as always) going to be challenging, but again helped by keeping the initial problem set smaller.

4 Likes

Indeed, it would be best to start with the various nix related services.

Also, we can distinguish three cases for services:

  1. services written in nixpkgs because upstream doesn’t provide any systemd unit
  2. services written in nixpkgs INSTEAD of using upstream unit (there may be various good reasons to do so)
  3. services using the upstream unit

hardening the services would certainly be done differently in case 3 compared to cases 1 and 2.

Does someone know if it’s possible to have a report or a log when a service started by a systemd unit try to go against a restriction? That would help a lot debugging.

I was trying to articulate a 4th (0th) case compared to your list, where basically there’s something that needs to be run as post activation for a package, and that gets deferred to a little “throwaway” systemd job.

Little things that making sure various paths exist with the right permissions (sometimes via systemd tmpfiles, and probably some should be migrated to that method). Other “run once for setup” type examples, but generally: not actually starting a running service in the normal sense.

See the discussion in nixos/systemd-sandbox: A generic sandboxing module by dasJ · Pull Request #87661 · NixOS/nixpkgs · GitHub, which attempted to introduce something like this already

4 Likes

Speaking of hardening, has there been any work put into making systemd.services.<name>.confinement compatible with ProtectSystem = strict?

1 Like

There is an open systemd issue about that: ProtectSystem=strict shouldn't take precedence over TemporaryFileSystem=/ · Issue #18999 · systemd/systemd · GitHub

AFAIU, the problem stems from using ProtectSystem= with:

RootDirectory = "/var/empty";
TemporaryFileSystem=/

However, thinking about it in some modules (tor, biboumi, croc, sourcehut, freeciv, transmission, public-inbox) I’ve been using ProtectSystem=/DynamicUser= without problem with this setup:

RuntimeDirectory = ["some-service/root"];
RootDirectory = "/run/some-service/root";
InaccessiblePaths = ["-+/run/some-service/root"];

AFAIU the InaccessiblePaths= is not necessary, it’s just cleaner to not have the root directory mounted twice inside the chroot (at / and /run/some-service/root).

I’ve not given it much thoughts, but maybe systemd-confinement could use a similar setup using something like:

let rootDir = "/run/systemd-confinement/${mkPathSafeName name}"; in {
RuntimeDirectory = [(removePrefix "/run/" rootDir)];
RootDirectory = rootDir;
InaccessiblePaths = ["-+${rootDir}"];
}

Ping @aszlig

Not sure whether I understand this correctly, but wouldn’t this result in a less secure environment than eg. with confinement.mode = "chroot-only" because you get additional mounts rather than just the store path you’re referencing?

@aszlig, I may be overlooking something and be wrong but AFAIU this would not change which store paths are available or not, it’s just changing where to put the RootDirectory=: instead of using TemporaryFileSystem=["/"], it would use a temporary tmpfs directory in /run/systemd-confinement/. From there BindPaths=/BindReadOnlyPaths= remain necessary for each path not implicitly handled by systemd.

See for instance what I did it in services.sourcehut.
Notice that:

  • The store path needs to be mounted explicitely with BindReadOnlyPaths=[builtins.storeDir]; here systemd-confinement would be more selective of course.
  • Both RootDirectoryStartOnly= and ProtectSystem= are enabled.
  • The hardening options currently set by systemd-confinement’s "full-apivfs" are enabled.

Ah, got it I guess. So are you implicitly saying that this would for example work around the DynamicUser issue?

edit: Just tested this and it breaks the chroot-only confinement:

subtest: chroot-only confinement
machine: must succeed: chroot-exec ls -1 / | paste -sd,
machine # [    8.520499] systemd[1]: Created slice Slice /system/test1.
machine # [    8.522179] systemd[1]: Started Confined Test Service 1 (PID 925/UID 0).
machine # [    8.536075] systemd[1]: test1@0-925-0.service: Deactivated successfully.
(finished: must succeed: chroot-exec ls -1 / | paste -sd,, in 0.09 seconds)
Test "chroot-only confinement" failed with error: "bin,dev,etc,nix,proc,root,run,sys,usr,var != bin,nix,run"

Even the full-apivfs variant fails because /etc ends up being bind-mounted into the chroot. It’s been a while but I faintly remember trying something similar already.

@aszlig, indeed, however those directories are either empty (/etc, /root, /usr, /var) or as they should be wrt. the configured hardening (/dev, /proc, /sys, /nix):

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/ls) -l /
Running as unit: run-u16023.service
total 0
drwxr-xr-x  20 0 0 4160 May  3 13:24 dev
drwxr-xr-x   2 0 0   40 May  4 16:35 etc
drwxr-xr-x   3 0 0   60 May  4 16:35 nix
dr-xr-xr-x 376 0 0    0 May  4 16:35 proc
drwxr-xr-x   2 0 0   40 May  4 16:35 root
drwxrwxrwt   4 0 0   80 May  4 16:35 run
dr-xr-xr-x  13 0 0    0 Apr 28 17:44 sys
drwxr-xr-x   2 0 0   40 May  4 16:35 usr
drwxr-xr-x   2 0 0   40 May  4 16:35 var

AFAIU that’s because systemd’s setup_namespace()
calls base_filesystem_create() which creates those usual top level directories.
This would explain why, as you noticed in the original PR:

Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.

A way to limit access to those mountpoints/directories:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16084.service
TARGET                                                     SOURCE                                       FSTYPE     OPTIONS
/                                                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                                                     devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| |-/dev/pts                                               devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| |-/dev/shm                                               tmpfs                                        tmpfs      rw,nosuid,nodev
| |-/dev/hugepages                                         hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
| `-/dev/mqueue                                            mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|-/nix/store                                               losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|-/proc                                                    proc                                         proc       rw,nosuid,nodev,noexec,relatime
|-/run                                                     tmpfs                                        tmpfs      rw,relatime
| |-/run/systemd/incoming                                  tmpfs[/systemd/propagate/run-u16084.service] tmpfs      ro,nosuid,nodev,size=1988236k,mode=755
| `-/run/systemd-confinement/test                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|   |-/run/systemd-confinement/test/dev                    devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
|   | |-/run/systemd-confinement/test/dev/pts              devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
|   | |-/run/systemd-confinement/test/dev/shm              tmpfs                                        tmpfs      rw,nosuid,nodev
|   | |-/run/systemd-confinement/test/dev/hugepages        hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
|   | `-/run/systemd-confinement/test/dev/mqueue           mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|   |-/run/systemd-confinement/test/nix/store              losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|   |-/run/systemd-confinement/test/proc                   proc                                         proc       rw,nosuid,nodev,noexec,relatime
|   `-/run/systemd-confinement/test/run                    tmpfs                                        tmpfs      rw,relatime
|     `-/run/systemd-confinement/test/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
`-/sys                                                     sysfs                                        sysfs      rw,nosuid,nodev,noexec,relatime
  |-/sys/kernel/security                                   securityfs                                   securityfs rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/cgroup                                         cgroup2                                      cgroup2    rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
  |-/sys/firmware/efi/efivars                              efivarfs                                     efivarfs   rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/bpf                                            bpf                                          bpf        rw,nosuid,nodev,noexec,relatime,mode=700
  |-/sys/fs/fuse/connections                               fusectl                                      fusectl    rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/pstore                                         pstore                                       pstore     rw,nosuid,nodev,noexec,relatime
  `-/sys/kernel/config                                     configfs                                     configfs   rw,nosuid,nodev,noexec,relatime

Is to use InaccessiblePaths=:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -pInaccessiblePaths=-+/run/systemd-confinement/test -pInaccessiblePaths=-+/dev -pInaccessiblePaths=-+/sys -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16079.service
TARGET                    SOURCE                                       FSTYPE OPTIONS
/                         tmpfs[/systemd-confinement/test]             tmpfs  rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755
|-/nix/store              losurdo/nix[/store]                          zfs    ro,relatime,xattr,posixacl
|-/proc                   proc                                         proc   rw,nosuid,nodev,noexec,relatime
|-/run                    tmpfs                                        tmpfs  rw,relatime
| `-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16079.service] tmpfs  ro,nosuid,nodev,size=1988236k,mode=755
`-/sys                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755

That’s why I think we should drop TemporaryFileSystem=/ in favor of a RootDirectory= inside a RuntimeDirectory=. What do you think?

How about we abstract these types of services on top of systemd.services.<name>?

E.g.:

  • systemd.hardware-services.<name>
  • systemd.network-services.<name>
  • systemd.privileged-services.<name>
  • systemd.unpriviliged-services.<name>

The interface would the same as systemd.services.<name> but with different defaults befitting their general purpose/class of service.

I’d actually love to have it implemented in a whitelist manner though with systemd.unpriviliged-services.<name> as the base and the purpose-specific services whitelisting privileges that most i.e. networking services need.

7 Likes

That’s a cool idea! Makes it very obvious if a service has been hardened yet, so you don’t need to go trawling through nixpkgs to check :slight_smile:

1 Like

I think it’s vague, a network-service class could have programs such as

  • unbound
  • nsd
  • samba

you can’t really have the same hardening for them, unbound could work without persistent data requirement, nsd could use only an user defined configuration, samba would need to access the filesystem.

I don’t think classes of this kind could be used.

Using systemd-analyze security is the way to see how a service is hardened, so you don’t have to dig in nixpkgs :slight_smile:

1 Like

Emphasis on default - nix allows overriding these things really easily for the specific service you need already. This proposal just reduces boilerplate, makes it easier for contributors with less experience to know what they should be setting, and advertises that these things are possible to those who don’t know yet.

Your mixins-alike approach is similar, and probably matches the underlying setting types better, but this one fits better within the existing nixpkgs module system. I think both are good suggestions.

only works if I have the service already installed, which if I’m considering whether I should be using one I probably don’t - and installing it for the sake of checking is about as much effort as looking at its module (perhaps less, considering download times vs my existing local copy of nixpkgs).

But yes, systemd-analyze security is very handy for checking the state of my running system :slight_smile:

3 Likes

Indeed, I didn’t think about the case in which you would want to know about the service without installing it.

There is a bunch of good practices and code examples about how to
securize systemd services in nix-bitcoin project:

Nginx has a bunch of recommended* options around security. These options set the defaults for other options. Maybe such a scheme is applicable to systems services as well?

systemd.services.<name>.recommendedHardening or splitting it up into multiple recommendations. Because this only sets defaults of other options, you can still easily override it for a specific service.

1 Like

Are there any general hardening guidelines after all? I don’t really like how many services run as root with full privileges…

saw wiki - it’s something, but not much after all.

Also the idea of having to harden all/most services from nixpkgs frustrates me as a beginner

4 Likes

I had an idea for something similar in form of additional attrset within systemd service specification:

harden = {
  basic = true; # basic setup, like RPC, no SUID, etc. 
  execOnlyNix = true; # allow executing only binaries from `/nix/store`
  protectKernel = true; # disable access to kernel logs, modules and tunables
  proc = true; # limit unit view into `/proc`
  systemCall = true; # limit system calls and allow only native system call facilities (important on x86-64
  onlyLocalhost = true; # allow listening only on localhost
}

I am not 100% sure about the naming of the hardening parts, but the idea is there to discuss.

I have created draft PR with implementation of such options

1 Like