It would be cool to have a whitelist approach instead, eg use PrivateNetwork=true
by default on all units, and turn it to false
when needed.
A whitelist would be indeed good, but this should be opt-in at first to migrate to it. I think there is a huge list of knobs that must be set to true (mean disabled, aka more secure) that could apply to most services
- accessing others /home
- alter kernel tunables
- can change clock
- may mount FS
etc…
you can use /run/current-system/sw/bin/systemd-analyze security [service]
to check the list of available systemd tunable and what they do. The whitelist approach would be good, but I suppose making some profiles in it would be even better:
- services running as root / non root
- services using network
- services affecting hardware
you don’t really want the same default whitelist for a matrix server than fwupd that will update your hardware firmwares.
Whelp, doing this in nixpkgs proper makes a lot more sense than my idea of hardening all the services I use downstream
While I agree it’s the best way to get consistent use of this upstream, I have one concern with a global whitelist - it actively goes against the systemd defaults, and therefore might be very surprising to a new user.
In fact, I’m fairly sure the concept of cgroups in systemd hasn’t even penetrated the public concept of what systemd does (and will probably cause another wave of rumbling and reminiscing of the “good ol’” SysV/cron days). It’ll take careful advertising and documentation to ensure that this doesn’t turn into a reason people feel NixOS is too complicated.
I’d suggest, for example, keeping this opt-in for user-defined services even if it becomes opt-out in the future, as long as systemd keeps it opt-in.
Couldn’t hardening be introduced as an option on systemd.services.* with some reasonable defaults? I often want to harden the services I write occasionally but often don’t see them as critical enough to spend the time to read up on systemd settings, having an option harden = true which applies some sensible defaults would really help in this case.
This would be fine but depending on the service itself the hardening strategy will be different. That’s why I think using set of “profiles” that will define some knobs would be useful.
As someone maintaining a systemd unit, you would have to ask some questions:
- does it run as root? no, use profile run-as-user that will turn on: NoNewPrivileges / ProtectKernelLogs / ProtectKernelModules / ProtectKernelTunables / ProtectProc / RestrictSUIDSGID …
- does it do network? no, use profile no-network that will turn on: PrivateNetwork
- does it need to read all FS files? no? then enable PrivateUsers / PrivateTmp / ProtectSystem / ProtectProc …
then, define harden = [ run-as-user no-network no-fs ];
FYI, 2 years ago, we started to do it on mail services but there is still postfix sandboxing which is opened.
Couldn’t hardening be introduced as an option on systemd.services.* with some reasonable defaults? I
I was initially opposed to the idea of trying to generalize hardening: without a deep knowledge of the upstream software internals, it’s easy to break the service in subtle ways. As a matter of fact, I had to fix 3 services after the introduction of overzealous hardening rules in 2021. Properly hardening a service is definitely not trivial, especially for services having many options and/or poor VM test coverage.
I ended changing my mind though: the current “permissive” approach wrt. service hardening is creating a mess. More and more critical services are getting hardened by different people using different approaches about how strict the hardening is meant to be. It seems like we need some norms about what to harden in priority and how.
@andir brainstormed around the idea of hardening profiles last year: nixos/systemd: introduce hardening profiles for services · andir/nixpkgs@4d9c0cf · GitHub. @andir, what would be the next step to push this idea forward?
It’s going to be a lot of work, not least because a lot of these services come from upstream without this hardening (and so might break on upgrades), but this would be a really, really welcome addition to NixOS - yet another good reason to run that service on NixOS specifically.
As one place to start, there are a lot of services that NixOS itself makes, or that packages create as a way of running something (e.g. at boot or post-install or on upgrade). They’re mostly small, and mostly uninteresting from an overall attack-surface complexity viewpoint, but this doesn’t mean they’re a waste of time:
- It would be a good place to start with a much tighter set of defaults with explicit additional permissions where needed.
- many of them are in-tree and without (or with fewer) entanglements with upstream or external changes
- any issues they have with tight defaults would be a good way to introduce the need for explicit permissions to Nix devs at the time of authorship
Migration from existing defaults, and test cases, is (as always) going to be challenging, but again helped by keeping the initial problem set smaller.
Indeed, it would be best to start with the various nix related services.
Also, we can distinguish three cases for services:
- services written in nixpkgs because upstream doesn’t provide any systemd unit
- services written in nixpkgs INSTEAD of using upstream unit (there may be various good reasons to do so)
- services using the upstream unit
hardening the services would certainly be done differently in case 3 compared to cases 1 and 2.
Does someone know if it’s possible to have a report or a log when a service started by a systemd unit try to go against a restriction? That would help a lot debugging.
I was trying to articulate a 4th (0th) case compared to your list, where basically there’s something that needs to be run as post activation for a package, and that gets deferred to a little “throwaway” systemd job.
Little things that making sure various paths exist with the right permissions (sometimes via systemd tmpfiles, and probably some should be migrated to that method). Other “run once for setup” type examples, but generally: not actually starting a running service in the normal sense.
See the discussion in nixos/systemd-sandbox: A generic sandboxing module by dasJ · Pull Request #87661 · NixOS/nixpkgs · GitHub, which attempted to introduce something like this already
Speaking of hardening, has there been any work put into making systemd.services.<name>.confinement
compatible with ProtectSystem = strict
?
There is an open systemd
issue about that: ProtectSystem=strict shouldn't take precedence over TemporaryFileSystem=/ · Issue #18999 · systemd/systemd · GitHub
AFAIU, the problem stems from using ProtectSystem=
with:
RootDirectory = "/var/empty";
TemporaryFileSystem=/
However, thinking about it in some modules (tor
, biboumi
, croc
, sourcehut
, freeciv
, transmission
, public-inbox
) I’ve been using ProtectSystem=
/DynamicUser=
without problem with this setup:
RuntimeDirectory = ["some-service/root"];
RootDirectory = "/run/some-service/root";
InaccessiblePaths = ["-+/run/some-service/root"];
AFAIU the InaccessiblePaths=
is not necessary, it’s just cleaner to not have the root directory mounted twice inside the chroot (at /
and /run/some-service/root
).
I’ve not given it much thoughts, but maybe systemd-confinement
could use a similar setup using something like:
let rootDir = "/run/systemd-confinement/${mkPathSafeName name}"; in {
RuntimeDirectory = [(removePrefix "/run/" rootDir)];
RootDirectory = rootDir;
InaccessiblePaths = ["-+${rootDir}"];
}
Ping @aszlig
Not sure whether I understand this correctly, but wouldn’t this result in a less secure environment than eg. with confinement.mode = "chroot-only"
because you get additional mounts rather than just the store path you’re referencing?
@aszlig, I may be overlooking something and be wrong but AFAIU this would not change which store paths are available or not, it’s just changing where to put the RootDirectory=
: instead of using TemporaryFileSystem=["/"]
, it would use a temporary tmpfs directory in /run/systemd-confinement/
. From there BindPaths=
/BindReadOnlyPaths=
remain necessary for each path not implicitly handled by systemd
.
See for instance what I did it in services.sourcehut.
Notice that:
- The store path needs to be mounted explicitely with
BindReadOnlyPaths=[builtins.storeDir]
; heresystemd-confinement
would be more selective of course. - Both
RootDirectoryStartOnly=
andProtectSystem=
are enabled. - The hardening options currently set by
systemd-confinement
’s"full-apivfs"
are enabled.
Ah, got it I guess. So are you implicitly saying that this would for example work around the DynamicUser
issue?
edit: Just tested this and it breaks the chroot-only
confinement:
subtest: chroot-only confinement
machine: must succeed: chroot-exec ls -1 / | paste -sd,
machine # [ 8.520499] systemd[1]: Created slice Slice /system/test1.
machine # [ 8.522179] systemd[1]: Started Confined Test Service 1 (PID 925/UID 0).
machine # [ 8.536075] systemd[1]: test1@0-925-0.service: Deactivated successfully.
(finished: must succeed: chroot-exec ls -1 / | paste -sd,, in 0.09 seconds)
Test "chroot-only confinement" failed with error: "bin,dev,etc,nix,proc,root,run,sys,usr,var != bin,nix,run"
Even the full-apivfs
variant fails because /etc
ends up being bind-mounted into the chroot. It’s been a while but I faintly remember trying something similar already.
@aszlig, indeed, however those directories are either empty (/etc
, /root
, /usr
, /var
) or as they should be wrt. the configured hardening (/dev
, /proc
, /sys
, /nix
):
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/ls) -l /
Running as unit: run-u16023.service
total 0
drwxr-xr-x 20 0 0 4160 May 3 13:24 dev
drwxr-xr-x 2 0 0 40 May 4 16:35 etc
drwxr-xr-x 3 0 0 60 May 4 16:35 nix
dr-xr-xr-x 376 0 0 0 May 4 16:35 proc
drwxr-xr-x 2 0 0 40 May 4 16:35 root
drwxrwxrwt 4 0 0 80 May 4 16:35 run
dr-xr-xr-x 13 0 0 0 Apr 28 17:44 sys
drwxr-xr-x 2 0 0 40 May 4 16:35 usr
drwxr-xr-x 2 0 0 40 May 4 16:35 var
AFAIU that’s because systemd
’s setup_namespace()
calls base_filesystem_create()
which creates those usual top level directories.
This would explain why, as you noticed in the original PR:
Another quirk we do have right now is that systemd tries to create a
/usr
directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.
A way to limit access to those mountpoints/directories:
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16084.service
TARGET SOURCE FSTYPE OPTIONS
/ tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
|-/dev devtmpfs devtmpfs rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| |-/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| |-/dev/shm tmpfs tmpfs rw,nosuid,nodev
| |-/dev/hugepages hugetlbfs hugetlbfs rw,relatime,pagesize=2M
| `-/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
|-/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
|-/run tmpfs tmpfs rw,relatime
| |-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs ro,nosuid,nodev,size=1988236k,mode=755
| `-/run/systemd-confinement/test tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
| |-/run/systemd-confinement/test/dev devtmpfs devtmpfs rw,nosuid,size=397648k,nr_inodes=991015,mode=755
| | |-/run/systemd-confinement/test/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=3,mode=620,ptmxmode=666
| | |-/run/systemd-confinement/test/dev/shm tmpfs tmpfs rw,nosuid,nodev
| | |-/run/systemd-confinement/test/dev/hugepages hugetlbfs hugetlbfs rw,relatime,pagesize=2M
| | `-/run/systemd-confinement/test/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
| |-/run/systemd-confinement/test/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
| |-/run/systemd-confinement/test/proc proc proc rw,nosuid,nodev,noexec,relatime
| `-/run/systemd-confinement/test/run tmpfs tmpfs rw,relatime
| `-/run/systemd-confinement/test/run/systemd/incoming tmpfs[/systemd/propagate/run-u16084.service] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
`-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
|-/sys/kernel/security securityfs securityfs rw,nosuid,nodev,noexec,relatime
|-/sys/fs/cgroup cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot
|-/sys/firmware/efi/efivars efivarfs efivarfs rw,nosuid,nodev,noexec,relatime
|-/sys/fs/bpf bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700
|-/sys/fs/fuse/connections fusectl fusectl rw,nosuid,nodev,noexec,relatime
|-/sys/fs/pstore pstore pstore rw,nosuid,nodev,noexec,relatime
`-/sys/kernel/config configfs configfs rw,nosuid,nodev,noexec,relatime
Is to use InaccessiblePaths=
:
$ systemd-run -P -pPrivateMounts=1 -pRuntimeDirectory=systemd-confinement/test -pRootDirectory=/run/systemd-confinement/test -pUMask=066 -pBindReadOnlyPaths=/nix/store -pInaccessiblePaths=-+/run/systemd-confinement/test -pInaccessiblePaths=-+/dev -pInaccessiblePaths=-+/sys -- $(readlink /run/current-system/sw/bin/findmnt)
Running as unit: run-u16079.service
TARGET SOURCE FSTYPE OPTIONS
/ tmpfs[/systemd-confinement/test] tmpfs rw,nosuid,nodev,size=1988236k,mode=755
|-/dev tmpfs[/systemd/inaccessible/dir] tmpfs ro,nosuid,nodev,noexec,size=1988236k,mode=755
|-/nix/store losurdo/nix[/store] zfs ro,relatime,xattr,posixacl
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
|-/run tmpfs tmpfs rw,relatime
| `-/run/systemd/incoming tmpfs[/systemd/propagate/run-u16079.service] tmpfs ro,nosuid,nodev,size=1988236k,mode=755
`-/sys tmpfs[/systemd/inaccessible/dir] tmpfs ro,nosuid,nodev,noexec,size=1988236k,mode=755
That’s why I think we should drop TemporaryFileSystem=/
in favor of a RootDirectory=
inside a RuntimeDirectory=
. What do you think?
How about we abstract these types of services on top of systemd.services.<name>
?
E.g.:
systemd.hardware-services.<name>
systemd.network-services.<name>
systemd.privileged-services.<name>
systemd.unpriviliged-services.<name>
- …
The interface would the same as systemd.services.<name>
but with different defaults befitting their general purpose/class of service.
I’d actually love to have it implemented in a whitelist manner though with systemd.unpriviliged-services.<name>
as the base and the purpose-specific services whitelisting privileges that most i.e. networking services need.
That’s a cool idea! Makes it very obvious if a service has been hardened yet, so you don’t need to go trawling through nixpkgs to check
I think it’s vague, a network-service class could have programs such as
- unbound
- nsd
- samba
you can’t really have the same hardening for them, unbound could work without persistent data requirement, nsd could use only an user defined configuration, samba would need to access the filesystem.
I don’t think classes of this kind could be used.