Hardening systemd services

Solene · January 15, 2022, 7:02pm

Hi,

With systemd it’s possible to restrict a lot the permissions of the running services. I think it would be great to harden services when possible to reduce attack surface and potential damages in case of 0day.

I started to show what’s possible with thermald thermald: disable network access by rapenne-s · Pull Request #155142 · NixOS/nixpkgs · GitHub but we can add a really longer list of restrictions for that little service but I’m not sure we want to go that way.

what do you people think about this?

asymmetric · January 15, 2022, 7:52pm

It would be cool to have a whitelist approach instead, eg use PrivateNetwork=true by default on all units, and turn it to false when needed.

Solene · January 15, 2022, 9:06pm

A whitelist would be indeed good, but this should be opt-in at first to migrate to it. I think there is a huge list of knobs that must be set to true (mean disabled, aka more secure) that could apply to most services

accessing others /home
alter kernel tunables
can change clock
may mount FS

etc…

you can use /run/current-system/sw/bin/systemd-analyze security [service] to check the list of available systemd tunable and what they do. The whitelist approach would be good, but I suppose making some profiles in it would be even better:

services running as root / non root
services using network
services affecting hardware

you don’t really want the same default whitelist for a matrix server than fwupd that will update your hardware firmwares.

TLATER · January 15, 2022, 10:03pm

Whelp, doing this in nixpkgs proper makes a lot more sense than my idea of hardening all the services I use downstream

While I agree it’s the best way to get consistent use of this upstream, I have one concern with a global whitelist - it actively goes against the systemd defaults, and therefore might be very surprising to a new user.

In fact, I’m fairly sure the concept of cgroups in systemd hasn’t even penetrated the public concept of what systemd does (and will probably cause another wave of rumbling and reminiscing of the “good ol’” SysV/cron days). It’ll take careful advertising and documentation to ensure that this doesn’t turn into a reason people feel NixOS is too complicated.

I’d suggest, for example, keeping this opt-in for user-defined services even if it becomes opt-out in the future, as long as systemd keeps it opt-in.

shimun · January 16, 2022, 2:26pm

Couldn’t hardening be introduced as an option on systemd.services.* with some reasonable defaults? I often want to harden the services I write occasionally but often don’t see them as critical enough to spend the time to read up on systemd settings, having an option harden = true which applies some sensible defaults would really help in this case.

Solene · January 16, 2022, 2:44pm

This would be fine but depending on the service itself the hardening strategy will be different. That’s why I think using set of “profiles” that will define some knobs would be useful.

As someone maintaining a systemd unit, you would have to ask some questions:

does it run as root? no, use profile run-as-user that will turn on: NoNewPrivileges / ProtectKernelLogs / ProtectKernelModules / ProtectKernelTunables / ProtectProc / RestrictSUIDSGID …
does it do network? no, use profile no-network that will turn on: PrivateNetwork
does it need to read all FS files? no? then enable PrivateUsers / PrivateTmp / ProtectSystem / ProtectProc …

then, define harden = [ run-as-user no-network no-fs ];

lewo · January 16, 2022, 3:02pm

FYI, 2 years ago, we started to do it on mail services but there is still postfix sandboxing which is opened.

picnoir · January 16, 2022, 5:03pm

Couldn’t hardening be introduced as an option on systemd.services.* with some reasonable defaults? I

I was initially opposed to the idea of trying to generalize hardening: without a deep knowledge of the upstream software internals, it’s easy to break the service in subtle ways. As a matter of fact, I had to fix 3 services after the introduction of overzealous hardening rules in 2021. Properly hardening a service is definitely not trivial, especially for services having many options and/or poor VM test coverage.

I ended changing my mind though: the current “permissive” approach wrt. service hardening is creating a mess. More and more critical services are getting hardened by different people using different approaches about how strict the hardening is meant to be. It seems like we need some norms about what to harden in priority and how.

@andir brainstormed around the idea of hardening profiles last year: nixos/systemd: introduce hardening profiles for services · andir/nixpkgs@4d9c0cf · GitHub. @andir, what would be the next step to push this idea forward?

uep · January 17, 2022, 6:16am

It’s going to be a lot of work, not least because a lot of these services come from upstream without this hardening (and so might break on upgrades), but this would be a really, really welcome addition to NixOS - yet another good reason to run that service on NixOS specifically.

As one place to start, there are a lot of services that NixOS itself makes, or that packages create as a way of running something (e.g. at boot or post-install or on upgrade). They’re mostly small, and mostly uninteresting from an overall attack-surface complexity viewpoint, but this doesn’t mean they’re a waste of time:

It would be a good place to start with a much tighter set of defaults with explicit additional permissions where needed.
many of them are in-tree and without (or with fewer) entanglements with upstream or external changes
any issues they have with tight defaults would be a good way to introduce the need for explicit permissions to Nix devs at the time of authorship

Migration from existing defaults, and test cases, is (as always) going to be challenging, but again helped by keeping the initial problem set smaller.

Solene · January 19, 2022, 9:06pm

Indeed, it would be best to start with the various nix related services.

Also, we can distinguish three cases for services:

services written in nixpkgs because upstream doesn’t provide any systemd unit
services written in nixpkgs INSTEAD of using upstream unit (there may be various good reasons to do so)
services using the upstream unit

hardening the services would certainly be done differently in case 3 compared to cases 1 and 2.

Does someone know if it’s possible to have a report or a log when a service started by a systemd unit try to go against a restriction? That would help a lot debugging.

uep · January 20, 2022, 12:00am

I was trying to articulate a 4th (0th) case compared to your list, where basically there’s something that needs to be run as post activation for a package, and that gets deferred to a little “throwaway” systemd job.

Little things that making sure various paths exist with the right permissions (sometimes via systemd tmpfiles, and probably some should be migrated to that method). Other “run once for setup” type examples, but generally: not actually starting a running service in the normal sense.

Infinisil · January 20, 2022, 3:44am

See the discussion in nixos/systemd-sandbox: A generic sandboxing module by dasJ · Pull Request #87661 · NixOS/nixpkgs · GitHub, which attempted to introduce something like this already

gravndal · January 22, 2022, 1:14pm

Speaking of hardening, has there been any work put into making systemd.services.<name>.confinement compatible with ProtectSystem = strict?

julm · May 3, 2022, 11:12pm

There is an open systemd issue about that: ProtectSystem=strict shouldn't take precedence over TemporaryFileSystem=/ · Issue #18999 · systemd/systemd · GitHub

AFAIU, the problem stems from using ProtectSystem= with:

RootDirectory = "/var/empty";
TemporaryFileSystem=/

However, thinking about it in some modules (tor, biboumi, croc, sourcehut, freeciv, transmission, public-inbox) I’ve been using ProtectSystem=/DynamicUser= without problem with this setup:

RuntimeDirectory = ["some-service/root"];
RootDirectory = "/run/some-service/root";
InaccessiblePaths = ["-+/run/some-service/root"];

AFAIU the InaccessiblePaths= is not necessary, it’s just cleaner to not have the root directory mounted twice inside the chroot (at / and /run/some-service/root).

I’ve not given it much thoughts, but maybe systemd-confinement could use a similar setup using something like:

let rootDir = "/run/systemd-confinement/${mkPathSafeName name}"; in {
RuntimeDirectory = [(removePrefix "/run/" rootDir)];
RootDirectory = rootDir;
InaccessiblePaths = ["-+${rootDir}"];
}

Ping @aszlig

aszlig · May 4, 2022, 12:10am

julm:

I’ve not given it much thoughts, but maybe systemd-confinement could use a similar setup using something like:
let rootDir = "/run/systemd-confinement/${mkPathSafeName name}"; in {
RuntimeDirectory = [(removePrefix "/run/" rootDir)];
RootDirectory = rootDir;
InaccessiblePaths = ["-+${rootDir}"];
}

Not sure whether I understand this correctly, but wouldn’t this result in a less secure environment than eg. with confinement.mode = "chroot-only" because you get additional mounts rather than just the store path you’re referencing?

julm · May 4, 2022, 1:23am

@aszlig, I may be overlooking something and be wrong but AFAIU this would not change which store paths are available or not, it’s just changing where to put the RootDirectory=: instead of using TemporaryFileSystem=["/"], it would use a temporary tmpfs directory in /run/systemd-confinement/. From there BindPaths=/BindReadOnlyPaths= remain necessary for each path not implicitly handled by systemd.

See for instance what I did it in services.sourcehut.
Notice that:

The store path needs to be mounted explicitely with BindReadOnlyPaths=[builtins.storeDir]; here systemd-confinement would be more selective of course.
Both RootDirectoryStartOnly= and ProtectSystem= are enabled.
The hardening options currently set by systemd-confinement’s "full-apivfs" are enabled.

aszlig · May 4, 2022, 10:03am

Ah, got it I guess. So are you implicitly saying that this would for example work around the DynamicUser issue?

edit: Just tested this and it breaks the chroot-only confinement:

subtest: chroot-only confinement
machine: must succeed: chroot-exec ls -1 / | paste -sd,
machine # [    8.520499] systemd[1]: Created slice Slice /system/test1.
machine # [    8.522179] systemd[1]: Started Confined Test Service 1 (PID 925/UID 0).
machine # [    8.536075] systemd[1]: test1@0-925-0.service: Deactivated successfully.
(finished: must succeed: chroot-exec ls -1 / | paste -sd,, in 0.09 seconds)
Test "chroot-only confinement" failed with error: "bin,dev,etc,nix,proc,root,run,sys,usr,var != bin,nix,run"

Even the full-apivfs variant fails because /etc ends up being bind-mounted into the chroot. It’s been a while but I faintly remember trying something similar already.

julm · May 4, 2022, 5:07pm

@aszlig, indeed, however those directories are either empty (/etc, /root, /usr, /var) or as they should be wrt. the configured hardening (/dev, /proc, /sys, /nix):

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/ls) -l /
Running as unit: run-u16023.service
total 0
drwxr-xr-x  20 0 0 4160 May  3 13:24 dev
drwxr-xr-x   2 0 0   40 May  4 16:35 etc
drwxr-xr-x   3 0 0   60 May  4 16:35 nix
dr-xr-xr-x 376 0 0    0 May  4 16:35 proc
drwxr-xr-x   2 0 0   40 May  4 16:35 root
drwxrwxrwt   4 0 0   80 May  4 16:35 run
dr-xr-xr-x  13 0 0    0 Apr 28 17:44 sys
drwxr-xr-x   2 0 0   40 May  4 16:35 usr
drwxr-xr-x   2 0 0   40 May  4 16:35 var

AFAIU that’s because systemd’s setup_namespace()
calls base_filesystem_create() which creates those usual top level directories.
This would explain why, as you noticed in the original PR:

Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.

A way to limit access to those mountpoints/directories:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16084.service
TARGET                                                     SOURCE                                       FSTYPE     OPTIONS
/                                                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                                                     devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| |-/dev/pts                                               devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| |-/dev/shm                                               tmpfs                                        tmpfs      rw,nosuid,nodev
| |-/dev/hugepages                                         hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
| `-/dev/mqueue                                            mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|-/nix/store                                               losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|-/proc                                                    proc                                         proc       rw,nosuid,nodev,noexec,relatime
|-/run                                                     tmpfs                                        tmpfs      rw,relatime
| |-/run/systemd/incoming                                  tmpfs[/systemd/propagate/run-u16084.service] tmpfs      ro,nosuid,nodev,size=1988236k,mode=755
| `-/run/systemd-confinement/test                          tmpfs[/systemd-confinement/test]             tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
|   |-/run/systemd-confinement/test/dev                    devtmpfs                                     devtmpfs   rw,nosuid,size=397648k,nr_inodes=991015,mode=755
|   | |-/run/systemd-confinement/test/dev/pts              devpts                                       devpts     rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
|   | |-/run/systemd-confinement/test/dev/shm              tmpfs                                        tmpfs      rw,nosuid,nodev
|   | |-/run/systemd-confinement/test/dev/hugepages        hugetlbfs                                    hugetlbfs  rw,relatime,pagesize=2M
|   | `-/run/systemd-confinement/test/dev/mqueue           mqueue                                       mqueue     rw,nosuid,nodev,noexec,relatime
|   |-/run/systemd-confinement/test/nix/store              losurdo/nix[/store]                          zfs        ro,relatime,xattr,posixacl
|   |-/run/systemd-confinement/test/proc                   proc                                         proc       rw,nosuid,nodev,noexec,relatime
|   `-/run/systemd-confinement/test/run                    tmpfs                                        tmpfs      rw,relatime
|     `-/run/systemd-confinement/test/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs      rw,nosuid,nodev,size=1988236k,mode=755
`-/sys                                                     sysfs                                        sysfs      rw,nosuid,nodev,noexec,relatime
  |-/sys/kernel/security                                   securityfs                                   securityfs rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/cgroup                                         cgroup2                                      cgroup2    rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
  |-/sys/firmware/efi/efivars                              efivarfs                                     efivarfs   rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/bpf                                            bpf                                          bpf        rw,nosuid,nodev,noexec,relatime,mode=700
  |-/sys/fs/fuse/connections                               fusectl                                      fusectl    rw,nosuid,nodev,noexec,relatime
  |-/sys/fs/pstore                                         pstore                                       pstore     rw,nosuid,nodev,noexec,relatime
  `-/sys/kernel/config                                     configfs                                     configfs   rw,nosuid,nodev,noexec,relatime

Is to use InaccessiblePaths=:

$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -pInaccessiblePaths=-+/run/systemd-confinement/test -pInaccessiblePaths=-+/dev -pInaccessiblePaths=-+/sys -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16079.service
TARGET                    SOURCE                                       FSTYPE OPTIONS
/                         tmpfs[/systemd-confinement/test]             tmpfs  rw,nosuid,nodev,size=1988236k,mode=755
|-/dev                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755
|-/nix/store              losurdo/nix[/store]                          zfs    ro,relatime,xattr,posixacl
|-/proc                   proc                                         proc   rw,nosuid,nodev,noexec,relatime
|-/run                    tmpfs                                        tmpfs  rw,relatime
| `-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16079.service] tmpfs  ro,nosuid,nodev,size=1988236k,mode=755
`-/sys                    tmpfs[/systemd/inaccessible/dir]             tmpfs  ro,nosuid,nodev,noexec,size=1988236k,mode=755

That’s why I think we should drop TemporaryFileSystem=/ in favor of a RootDirectory= inside a RuntimeDirectory=. What do you think?

Atemu · May 4, 2022, 9:07pm

How about we abstract these types of services on top of systemd.services.<name>?

E.g.:

systemd.hardware-services.<name>
systemd.network-services.<name>
systemd.privileged-services.<name>
systemd.unpriviliged-services.<name>
…

The interface would the same as systemd.services.<name> but with different defaults befitting their general purpose/class of service.

I’d actually love to have it implemented in a whitelist manner though with systemd.unpriviliged-services.<name> as the base and the purpose-specific services whitelisting privileges that most i.e. networking services need.

TLATER · May 6, 2022, 6:45pm

That’s a cool idea! Makes it very obvious if a service has been hardened yet, so you don’t need to go trawling through nixpkgs to check