I do have a working setup of cgroupsv2 and systemd-run. I don’t remember
how I arrived at it, it’s a collection of various options and hacks
which is not straight-forward from any single wiki or guide. I really
ought to document this on a blog-post but for now this mail will do.
Systemd has a complicated relationship with cgroups. Basically systemd
as a project added semantic meaning to cgroups hierarchies and some
people where not happy with that. At the time, it was known on the
kernel ML that cgroups as separated controller (blkio, cpu, memory) on
separated hierarchies led to problems and weird hacks. The idea was to
do a new version of cgroups that only had a single hierarchy and a
single process that would modify such hierarchy. That job naturally
would be for systemd on this new version.
So to make the story short, to use systemd-run to limit group of
processes, it is better to only use cgroups v2 interface. On nixos
this is straight forward as putting
boot = {
kernelParams = [ "cgroup_no_v1=all"
"systemd.unified_cgroup_hierarchy=yes"];
}
on configuration.nix. What this does is pretty clear from the options
names, so no further explanation will be given.
On boot-up, you can check your new /sys/fs/cgroup. Mine does look like
this:
.
├── init.scope
├── system.slice
│ ├── accounts-daemon.service
│ ├── atd.service
│ ├── (..)
│ ├── upower.service
│ └── wpa_supplicant.service
└── user.slice
├── user-1000.slice
│ ├── session-2.scope
│ └── user@1000.service
│ ├── geoclue-agent.service
│ └── (..)
│ └── redshift.service
└── user-78.slice
└── user@78.service
└── init.scope
On cgroupsv2, the hierarchy is different that on cgroupsv1. You can
check the full docs on the Linux kernel repo (you should, it’s
interesting), but the gist of it is
- Processes are either at the root or the leaves of the hierarchy.
No process can belong to a node with children except on the root
node.
- Hierarchy on controllers. The available controllers are at the
file /sys/fs/cgroup/cgroup.controller , but we can delegate
to lower level limits on some controller writing for example:
# echo +memory \
>>/sys/fs/cgroup/system.slice/cgroup.subtree_control
But if we do, we have to add that controller to the
subtree_control file all the way to the root. There are some
pretty good diagrams on the web on how this is set up, I can't do
a good job with pure text :-) .
So where does this leave us? we want to run systemd-run to limit some
memory intensive program. In fact I have one right in mind
module Main where
import Data.List (foldl', foldl, foldr)
main :: IO ()
main = putStrLn . show $ foldr (+) 0 [1..10000000000]
Any haskeller worth its salt will tell you this leaks (on my crappy
netbook, more so). So I compile this program, call it foldr. Let’s limit
the amount of memory it will use, so as per the previous point I need to
be sure the cgroup I will create will be able to limit memory, that is,
I need to check that the subtree_control of that cgroup has the memory
controller activated. For that first I create the cgroup with a
shell on it like this:
systemd-run --user --scope -p MemoryHigh=1800M -p MemoryMax=2G \
-p MemorySwapMax=40M zsh
This command will print to which unit (and which cgroup) it does belong.
That info we could also have gotten it from /proc/$$/cgroup . We search
for that cgroup on the hierarchy at /sys/fs/cgroup/ and read the
subtree_control file. Mine was at
cd user.slice/user-1000.slice/user@1000.service/run-r9e5d9b1bcabe4e1dac0f55a4ef1414a5.scope
Here I read the file subtree_control and it’s empty. For it to take
effect it should have ‘memory’ written on it. Luckily ‘memory’ it’s
present on the file cgroup.controller , which means I can do
# echo +memory >>cgroup.subtree_control
and enable the limits for this group. If ‘memory’ wouldn’t have been
available at cgroups.controllers, I would had to ‘cd …’ a level and try
to do the same until it succeeded and then iterate downwards.
Now we can check everything is in order. The files memory.high,
memory.max and memory.swap.max should be set to the values on you passes
to systemd-run. The ‘foldr’ program will be a child of the current shell
which is in this cgroup, when run it will also belong to this cgroup and
should be limited.
$ ~/foldr
On another shell I run
$ ps -o %mem,rss,comm -p $(pgrep foldr)
%MEM RSS COMMAND
50.6 1836320 foldr
And that remains stable not increasing from the limit I set. Depending
on the memory.high vs memory.max value you set, the OOM will be
triggered. I killed the process with a kill command on another shell. I
recover user control of the shell and execute exit to kill the cgroup.
It’s a involved setup, but it fun knowing you can remain interactive
even when using bad programs (hello Matlab on my shitty netbook). You
need some setup but it’s nothing once done and you can always use your
shell history to remember how it’s done.
Good luck