FYI, 2 years ago, we started to do it on mail services but there is still postfix sandboxing which is opened.
Couldn’t hardening be introduced as an option on systemd.services.* with some reasonable defaults? I
I was initially opposed to the idea of trying to generalize hardening: without a deep knowledge of the upstream software internals, it’s easy to break the service in subtle ways. As a matter of fact, I had to fix 3 services after the introduction of overzealous hardening rules in 2021. Properly hardening a service is definitely not trivial, especially for services having many options and/or poor VM test coverage.
I ended changing my mind though: the current “permissive” approach wrt. service hardening is creating a mess. More and more critical services are getting hardened by different people using different approaches about how strict the hardening is meant to be. It seems like we need some norms about what to harden in priority and how.
@andir brainstormed around the idea of hardening profiles last year: nixos/systemd: introduce hardening profiles for services · andir/nixpkgs@4d9c0cf · GitHub. @andir, what would be the next step to push this idea forward?
It’s going to be a lot of work, not least because a lot of these services come from upstream without this hardening (and so might break on upgrades), but this would be a really, really welcome addition to NixOS - yet another good reason to run that service on NixOS specifically.
As one place to start, there are a lot of services that NixOS itself makes, or that packages create as a way of running something (e.g. at boot or post-install or on upgrade). They’re mostly small, and mostly uninteresting from an overall attack-surface complexity viewpoint, but this doesn’t mean they’re a waste of time:
- It would be a good place to start with a much tighter set of defaults with explicit additional permissions where needed.
- many of them are in-tree and without (or with fewer) entanglements with upstream or external changes
- any issues they have with tight defaults would be a good way to introduce the need for explicit permissions to Nix devs at the time of authorship
Migration from existing defaults, and test cases, is (as always) going to be challenging, but again helped by keeping the initial problem set smaller.
Indeed, it would be best to start with the various nix related services.
Also, we can distinguish three cases for services:
- services written in nixpkgs because upstream doesn’t provide any systemd unit
- services written in nixpkgs INSTEAD of using upstream unit (there may be various good reasons to do so)
- services using the upstream unit
hardening the services would certainly be done differently in case 3 compared to cases 1 and 2.
Does someone know if it’s possible to have a report or a log when a service started by a systemd unit try to go against a restriction? That would help a lot debugging.
I was trying to articulate a 4th (0th) case compared to your list, where basically there’s something that needs to be run as post activation for a package, and that gets deferred to a little “throwaway” systemd job.
Little things that making sure various paths exist with the right permissions (sometimes via systemd tmpfiles, and probably some should be migrated to that method). Other “run once for setup” type examples, but generally: not actually starting a running service in the normal sense.
See the discussion in nixos/systemd-sandbox: A generic sandboxing module by dasJ · Pull Request #87661 · NixOS/nixpkgs · GitHub, which attempted to introduce something like this already
Speaking of hardening, has there been any work put into making systemd.services.<name>.confinement
compatible with ProtectSystem = strict
?
There is an open systemd
issue about that: ProtectSystem=strict shouldn't take precedence over TemporaryFileSystem=/ · Issue #18999 · systemd/systemd · GitHub
AFAIU, the problem stems from using ProtectSystem=
with:
RootDirectory = "/var/empty";
TemporaryFileSystem=/
However, thinking about it in some modules (tor
, biboumi
, croc
, sourcehut
, freeciv
, transmission
, public-inbox
) I’ve been using ProtectSystem=
/DynamicUser=
without problem with this setup:
RuntimeDirectory = ["some-service/root"];
RootDirectory = "/run/some-service/root";
InaccessiblePaths = ["-+/run/some-service/root"];
AFAIU the InaccessiblePaths=
is not necessary, it’s just cleaner to not have the root directory mounted twice inside the chroot (at /
and /run/some-service/root
).
I’ve not given it much thoughts, but maybe systemd-confinement
could use a similar setup using something like:
let rootDir = "/run/systemd-confinement/${mkPathSafeName name}"; in {
RuntimeDirectory = [(removePrefix "/run/" rootDir)];
RootDirectory = rootDir;
InaccessiblePaths = ["-+${rootDir}"];
}
Ping @aszlig
Not sure whether I understand this correctly, but wouldn’t this result in a less secure environment than eg. with confinement.mode = "chroot-only"
because you get additional mounts rather than just the store path you’re referencing?
@aszlig, I may be overlooking something and be wrong but AFAIU this would not change which store paths are available or not, it’s just changing where to put the RootDirectory=
: instead of using TemporaryFileSystem=["/"]
, it would use a temporary tmpfs directory in /run/systemd-confinement/
. From there BindPaths=
/BindReadOnlyPaths=
remain necessary for each path not implicitly handled by systemd
.
See for instance what I did it in services.sourcehut.
Notice that:
- The store path needs to be mounted explicitely with
BindReadOnlyPaths=[builtins.storeDir]
; heresystemd-confinement
would be more selective of course. - Both
RootDirectoryStartOnly=
andProtectSystem=
are enabled. - The hardening options currently set by
systemd-confinement
’s"full-apivfs"
are enabled.
Ah, got it I guess. So are you implicitly saying that this would for example work around the DynamicUser
issue?
edit: Just tested this and it breaks the chroot-only
confinement:
subtest: chroot-only confinement
machine: must succeed: chroot-exec ls -1 / | paste -sd,
machine # [ 8.520499] systemd[1]: Created slice Slice /system/test1.
machine # [ 8.522179] systemd[1]: Started Confined Test Service 1 (PID 925/UID 0).
machine # [ 8.536075] systemd[1]: test1@0-925-0.service: Deactivated successfully.
(finished: must succeed: chroot-exec ls -1 / | paste -sd,, in 0.09 seconds)
Test "chroot-only confinement" failed with error: "bin,dev,etc,nix,proc,root,run,sys,usr,var != bin,nix,run"
Even the full-apivfs
variant fails because /etc
ends up being bind-mounted into the chroot. It’s been a while but I faintly remember trying something similar already.
@aszlig, indeed, however those directories are either empty (/etc
, /root
, /usr
, /var
) or as they should be wrt. the configured hardening (/dev
, /proc
, /sys
, /nix
):
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/ls) -l /
Running as unit: run-u16023.service
total 0
drwxr-xr-x 20 0 0 4160 May 3 13:24 dev
drwxr-xr-x 2 0 0 40 May 4 16:35 etc
drwxr-xr-x 3 0 0 60 May 4 16:35 nix
dr-xr-xr-x 376 0 0 0 May 4 16:35 proc
drwxr-xr-x 2 0 0 40 May 4 16:35 root
drwxrwxrwt 4 0 0 80 May 4 16:35 run
dr-xr-xr-x 13 0 0 0 Apr 28 17:44 sys
drwxr-xr-x 2 0 0 40 May 4 16:35 usr
drwxr-xr-x 2 0 0 40 May 4 16:35 var
AFAIU that’s because systemd
’s setup_namespace()
calls base_filesystem_create()
which creates those usual top level directories.
This would explain why, as you noticed in the original PR:
Another quirk we do have right now is that systemd tries to create a
/usr
directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.
A way to limit access to those mountpoints/directories:
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16084.service
TARGET SOURCE FSTYPE OPTIONS
/ tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
|-/dev devtmpfs devtmpfs rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| |-/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| |-/dev/shm tmpfs tmpfs rw,nosuid,nodev
| |-/dev/hugepages hugetlbfs hugetlbfs rw,relatime,pagesize=2M
| `-/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
|-/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
|-/run tmpfs tmpfs rw,relatime
| |-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs ro,nosuid,nodev,size=1988236k,mode=755
| `-/run/systemd-confinement/test tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
| |-/run/systemd-confinement/test/dev devtmpfs devtmpfs rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| | |-/run/systemd-confinement/test/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| | |-/run/systemd-confinement/test/dev/shm tmpfs tmpfs rw,nosuid,nodev
| | |-/run/systemd-confinement/test/dev/hugepages hugetlbfs hugetlbfs rw,relatime,pagesize=2M
| | `-/run/systemd-confinement/test/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
| |-/run/systemd-confinement/test/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
| |-/run/systemd-confinement/test/proc proc proc rw,nosuid,nodev,noexec,relatime
| `-/run/systemd-confinement/test/run tmpfs tmpfs rw,relatime
| `-/run/systemd-confinement/test/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
`-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
|-/sys/kernel/security securityfs securityfs rw,nosuid,nodev,noexec,relatime
|-/sys/fs/cgroup cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
|-/sys/firmware/efi/efivars efivarfs efivarfs rw,nosuid,nodev,noexec,relatime
|-/sys/fs/bpf bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700
|-/sys/fs/fuse/connections fusectl fusectl rw,nosuid,nodev,noexec,relatime
|-/sys/fs/pstore pstore pstore rw,nosuid,nodev,noexec,relatime
`-/sys/kernel/config configfs configfs rw,nosuid,nodev,noexec,relatime
Is to use InaccessiblePaths=
:
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -pInaccessiblePaths=-+/run/systemd-confinement/test -pInaccessiblePaths=-+/dev -pInaccessiblePaths=-+/sys -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16079.service
TARGET SOURCE FSTYPE OPTIONS
/ tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
|-/dev tmpfs[/systemd/inaccessible/dir] tmpfs ro,nosuid,nodev,noexec,size=1988236k,mode=755
|-/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
|-/run tmpfs tmpfs rw,relatime
| `-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16079.service] tmpfs ro,nosuid,nodev,size=1988236k,mode=755
`-/sys tmpfs[/systemd/inaccessible/dir] tmpfs ro,nosuid,nodev,noexec,size=1988236k,mode=755
That’s why I think we should drop TemporaryFileSystem=/
in favor of a RootDirectory=
inside a RuntimeDirectory=
. What do you think?
How about we abstract these types of services on top of systemd.services.<name>
?
E.g.:
systemd.hardware-services.<name>
systemd.network-services.<name>
systemd.privileged-services.<name>
systemd.unpriviliged-services.<name>
- …
The interface would the same as systemd.services.<name>
but with different defaults befitting their general purpose/class of service.
I’d actually love to have it implemented in a whitelist manner though with systemd.unpriviliged-services.<name>
as the base and the purpose-specific services whitelisting privileges that most i.e. networking services need.
That’s a cool idea! Makes it very obvious if a service has been hardened yet, so you don’t need to go trawling through nixpkgs to check
I think it’s vague, a network-service class could have programs such as
- unbound
- nsd
- samba
you can’t really have the same hardening for them, unbound could work without persistent data requirement, nsd could use only an user defined configuration, samba would need to access the filesystem.
I don’t think classes of this kind could be used.
Using systemd-analyze security
is the way to see how a service is hardened, so you don’t have to dig in nixpkgs
Emphasis on default - nix allows overriding these things really easily for the specific service you need already. This proposal just reduces boilerplate, makes it easier for contributors with less experience to know what they should be setting, and advertises that these things are possible to those who don’t know yet.
Your mixins-alike approach is similar, and probably matches the underlying setting types better, but this one fits better within the existing nixpkgs module system. I think both are good suggestions.
only works if I have the service already installed, which if I’m considering whether I should be using one I probably don’t - and installing it for the sake of checking is about as much effort as looking at its module (perhaps less, considering download times vs my existing local copy of nixpkgs).
But yes, systemd-analyze security
is very handy for checking the state of my running system
Indeed, I didn’t think about the case in which you would want to know about the service without installing it.
There is a bunch of good practices and code examples about how to
securize systemd services in nix-bitcoin project:
Nginx has a bunch of recommended*
options around security. These options set the defaults for other options. Maybe such a scheme is applicable to systems services as well?
systemd.services.<name>.recommendedHardening
or splitting it up into multiple recommendations. Because this only sets defaults of other options, you can still easily override it for a specific service.