Nginx worker processes exit with signal 31 when running via systemd

Having a strange issue since upgrading a server from 20.03 to 21.05. Initially I was getting read-only filesystem errors with the logs, but I found some docs that mentioned needing to use a new ReadWritePaths option in my config as of 20.09, so I fixed that and now enters my current issue.

First, I’ve made no serious changes to my config besides the aforementioned ReadWritePaths addition. This setup has been working without issue for the past couple of years, and through at least one upgrade.

Second, the issue only occurs when I start nginx through systemd. If I start nginx manually then all is well and good; that is to say, I know there’s nothing wrong with the nginx.conf that ultimately gets generated by nix on switching.

Here are the relevant details from logs, etc.:

  1. No errors are output when I start the service as usual
$ systemctl start nginx.service
  1. After starting the service, running status tells me the workers are exiting
$ systemctl status nginx.service
● nginx.service - Nginx Web Server
     Loaded: loaded (/nix/store/cyy4wdgm32v67gpgh6gnhxx8197v15h8-unit-nginx.service/nginx.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-06-05 21:38:36 UTC; 1min 56s ago
    Process: 395995 ExecStartPre=/nix/store/50difldgpz2h0qc4g4p8mbwkgy5ihzib-unit-script-nginx-pre-start/bin/nginx-pre-start (code=exited, status=0/SUCCESS)
   Main PID: 395997 (nginx)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
      Tasks: 3 (limit: 2373)
     Memory: 52.6M
        CPU: 25.248s
     CGroup: /system.slice/nginx.service
             ├─395997 nginx: master process /nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c /nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf
             ├─398832 nginx: master process /nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c /nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf
             └─398833 nginx: master process /nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c /nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf

Jun 05 21:40:30 hostname systemd-coredump[398798]: Process 398793 (nginx) of user 0 dumped core.
Jun 05 21:40:30 hostname nginx[395997]: 2021/06/05 21:40:30 [alert] 395997#395997: worker process 398793 exited on signal 31 (core dumped)
Jun 05 21:40:31 hostname systemd-coredump[398806]: Process 398804 (nginx) of user 0 dumped core.
Jun 05 21:40:31 hostname nginx[395997]: 2021/06/05 21:40:31 [alert] 395997#395997: worker process 398804 exited on signal 31 (core dumped)
Jun 05 21:40:31 hostname systemd-coredump[398813]: Process 398808 (nginx) of user 0 dumped core.
Jun 05 21:40:31 hostname nginx[395997]: 2021/06/05 21:40:31 [alert] 395997#395997: worker process 398808 exited on signal 31 (core dumped)
Jun 05 21:40:32 hostname systemd-coredump[398828]: Process 398823 (nginx) of user 0 dumped core.
Jun 05 21:40:32 hostname systemd-coredump[398821]: Process 398816 (nginx) of user 0 dumped core.
Jun 05 21:40:32 hostname nginx[395997]: 2021/06/05 21:40:32 [alert] 395997#395997: worker process 398823 exited on signal 31 (core dumped)
Jun 05 21:40:32 hostname nginx[395997]: 2021/06/05 21:40:32 [alert] 395997#395997: worker process 398816 exited on signal 31 (core dumped)
  1. Here’s what nginx.service looks like (generated by nixos):
[Unit]
After=network.target
Description=Nginx Web Server
StartLimitIntervalSec=60

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/in621vh2kj0ayqa6qc9pqnjvx6hzj5h5-glibc-locales-2.32-46/lib/locale/locale-archive"
Environment="PATH=/nix/store/a4v1akahda85rl9gfphb07zzw79z8pb1-coreutils-8.32/bin:/nix/store/1hvm45djn8wkfg64gbmlqpfj4dnjh594-findutils-4.7.0/bin:/nix/store/7n3yzh9wza4bdqc04v01xddnfhkrwk2a-gnugrep-3.6/bin:/nix/store/g34ldykl1cal5b9ir3xinnq70m52fcnq-gnused-4.8/bin:/nix/store/r2bw74x7zci7shzxq3cikww9kp1wxc6i-systemd-247.6/bin:/nix/store/a4v1akahda85rl9gfphb07zzw79z8pb1-coreutils-8.32/sbin:/nix/store/1hvm45djn8wkfg64gbmlqpfj4dnjh594-findutils-4.7.0/sbin:/nix/store/7n3yzh9wza4bdqc04v01xddnfhkrwk2a-gnugrep-3.6/sbin:/nix/store/g34ldykl1cal5b9ir3xinnq70m52fcnq-gnused-4.8/sbin:/nix/store/r2bw74x7zci7shzxq3cikww9kp1wxc6i-systemd-247.6/sbin"
Environment="TZDIR=/nix/store/y4j4k0l6w941wriprxz13dhvz896lw3m-tzdata-2020f/share/zoneinfo"


X-StopIfChanged=false
AmbientCapabilities=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_SYS_RESOURCE
CacheDirectory=nginx
CacheDirectoryMode=0750
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_SYS_RESOURCE
ExecReload=/nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c '/nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf' -t
ExecReload=/nix/store/a4v1akahda85rl9gfphb07zzw79z8pb1-coreutils-8.32/bin/kill -HUP $MAINPID
ExecStart=/nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c '/nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf'
ExecStartPre=/nix/store/50difldgpz2h0qc4g4p8mbwkgy5ihzib-unit-script-nginx-pre-start/bin/nginx-pre-start
Group=root
LockPersonality=true
LogsDirectory=nginx
LogsDirectoryMode=0750
MemoryDenyWriteExecute=true
NoNewPrivileges=true
PrivateDevices=true
PrivateMounts=true
PrivateTmp=true
ProcSubset=pid
ProtectClock=true
ProtectControlGroups=true
ProtectHome=true
ProtectHostname=true
ProtectKernelLogs=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectProc=invisible
ProtectSystem=strict
ReadWritePaths=/var/www/
ReadWritePaths=/run/
RemoveIPC=true
Restart=always
RestartSec=10s
RestrictAddressFamilies=AF_UNIX
RestrictAddressFamilies=AF_INET
RestrictAddressFamilies=AF_INET6
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
RuntimeDirectory=nginx
RuntimeDirectoryMode=0750
SystemCallArchitectures=native
SystemCallFilter=~@cpu-emulation @debug @keyring @ipc @mount @obsolete @privileged @setuid
UMask=0027
User=root

I don’t really know much about systemd, so I’m a little confused and lost at the moment. Like I mentioned before, nginx runs without any issues if I stop the service, then take the ExecStart command from the above service file and run it directly.

$ systemctl stop nginx.service
$ /nix/store/z0rqfwaw46hl0snzaiw8wzr1sxbkjqiw-nginx-1.21.0/bin/nginx -c '/nix/store/jfd309xx69g4hs6x4p8fznkjh3lfjy2q-nginx.conf'

At this point, I can access the sites that are served on this machine from any browser.

Other details that may or may not be important:

  • The server has 2GB of RAM, with 1.4GB available. This hasn’t changed since I spun it up 2 years ago, and has never caused an issue.
  • I’ve rebooted the server a couple times since nixos-rebuild switch --upgrade, but no difference in behavior.
  • I noticed an issue with fail2ban when I initially upgraded (it failed to start during the upgrade), but since a reboot it has been running without issue.
  • No other errors encountered during the upgrade process.
  • Server config is fairly basic, no nixops involved. I really only use nginx to proxy to Docker containers and sign certs.
  • /var/www is the path used to store pretty much all the certs, logs, and html files. /run/ I only added for the default pidfile, but I don’t think it’s actually needed. I get similar results if this is removed from the ReadWritePaths option in my config.

If anyone has any insight or thoughts, it’d be much appreciated. Otherwise, if I happen to make any headway I’ll update the thread with any relevant info.

I tried analyzing the coredumps like below, but nothing is sticking out at me:

coredumpctl debug
           PID: 588916 (nginx)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 31 (SYS)
     Timestamp: Sun 2021-06-06 15:42:53 UTC (2s ago)
  Command Line: nginx: master process /nix/store/8nyvpwp7c4slmcqq987n3pyb3kz9mfa2-nginx-1.20.1/bin/nginx -c /nix/store/zy3krx9x16pqzjwg69n7xhkqlx1c1z2s-nginx.conf
    Executable: /nix/store/8nyvpwp7c4slmcqq987n3pyb3kz9mfa2-nginx-1.20.1/bin/nginx
 Control Group: /system.slice/nginx.service
          Unit: nginx.service
         Slice: system.slice
       Boot ID: a5e6061771e94436b6e0d9a928a6e3c0
    Machine ID: 942e05964aac4ebfbdb6c0ae1d66e143
      Hostname: hostname
       Storage: /var/lib/systemd/coredump/core.nginx.0.a5e6061771e94436b6e0d9a928a6e3c0.588916.1622994173000000.lz4
       Message: Process 588916 (nginx) of user 0 dumped core.

GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /nix/store/8nyvpwp7c4slmcqq987n3pyb3kz9mfa2-nginx-1.20.1/bin/nginx...
(No debugging symbols found in /nix/store/8nyvpwp7c4slmcqq987n3pyb3kz9mfa2-nginx-1.20.1/bin/nginx)

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /run/nscd/db3NG19c (deleted) during file-backed mapping note processing

warning: Can't open file /run/nscd/dbs29XSc (deleted) during file-backed mapping note processing
[New LWP 588916]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libthread_db.so.1".
Core was generated by `nginx: master process /nix/store/8nyvpwp7c4slmcqq987n3pyb3kz9mfa2-nginx-1.20.1/'.
Program terminated with signal SIGSYS, Bad system call.
#0  0x00007f60b16038aa in __nptl_setxid () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libpthread.so.0

A couple of additional notes after trying some things out:

  • Initially I was using nixpkgs.nginxMainline in my config as suggested by the nginx team.
  • Today, I’ve attempted to use both nixpkgs.nginx and nixpkgs.nginxUnstable to no avail. Same issue occurs, and coredumpctl outputs the same info as the previous post (stating there are no debug symbols in nginx, no real leads, etc.)

I’m guessing the reason that I’m getting nothing from the coredumps is due to the fact that there doesn’t appear to be anything wrong at all with nginx, but instead the way it’s invoked through systemd. However, the coredumps do indicate the executable in question is nginx.

It’s too bad there’s no way to use something other than systemd in NixOS.

It’s very likely the issue isn’t systemd, but instead NixOS module developers. Most of us are overly eager when it comes to applying systemd hardening/sandboxing options to services. If you undo the hardening options applied to the systemd service I think everything will start working for you again.

If you’re unsure how to test what I’m suggesting let me know and I can spend a few minutes coming up with a concrete configuration for you to test.

Thanks aanderse for willing to take the time!

I think I understand what you’re saying, and it makes sense with that in mind why nginx would work when not being run through systemd. I’m not an expert on NixOS, but I’m pretty good at reading and testing. If you have a doc link or a brief example of what you’re suggesting I should be able to extrapolate; just need a good nudge off the right cliff.

I’m guessing there are some options to look into under the systemd namespace to disable/undo the hardening. Or is it a little more complex like writing a custom derivation?

I’m hoping going through these steps will help me understand why I’m even having troubles in the first place, and allow me to make adjustments so my configs are more or less “by the book”, if that’s even a realistic concept in Nixland!

I did a quick diff between the systemd units in 20.09 and 21.05 and I think this is what you should add to your configuration.nix to test my theory:

systemd.services.nginx.serviceConfig = {
  ProcSubset = lib.mkForce "";
  ProtectProc = lib.mkForce "";
  ProtectClock = lib.mkForce false;
  ProtectKernelLogs = lib.mkForce false;
  RestrictNamespaces = lib.mkForce false;
  RemoveIPC = lib.mkForce false;
  SystemCallFilter = lib.mkForce "";
};

This is likely wrong and we’ll have to fiddle with this a bit (remove one by one until we find the specific option), but it is a starting point so we can figure out exactly which option added caused your issues.

SystemCallFilter = lib.mkForce ""; is the one for the initial issue.

It seems there are further hardening options inbetween 20.03 and 20.09 that may still be getting in the way as I’m getting the following issue now, related to permissions:

Jun 07 03:21:52 hostname systemd[1]: Started Nginx Web Server.
Jun 07 03:21:52 hostname nginx[712648]: 2021/06/07 03:21:52 [emerg] 712648#712648: setgid(65534) failed (1: Operation not permitted)
Jun 07 03:21:52 hostname nginx[712649]: 2021/06/07 03:21:52 [emerg] 712649#712649: setgid(65534) failed (1: Operation not permitted)
Jun 07 03:21:52 hostname nginx[712636]: 2021/06/07 03:21:52 [alert] 712636#712636: worker process 712648 exited with fatal code 2 and cannot be respawned
Jun 07 03:21:52 hostname nginx[712636]: 2021/06/07 03:21:52 [alert] 712636#712636: worker process 712649 exited with fatal code 2 and cannot be respawned

I sorted out how you came to the list of options to try, and did a comparison between 20.03 and 20.09, and found a ton more options that were added. I’ve tried them all, but still getting this permissions error. Perhaps some of the new values I’m setting are invalid in some way (or have better options to try?). I’ve tried each of these individually and all at once:

  systemd.services.nginx.serviceConfig = {
    ReadWritePaths = [ "/var/www/" ];
    SystemCallFilter = lib.mkForce "";
    NoNewPrivileges = lib.mkForce false;
    ProtectSystem = lib.mkForce "";
    ProtectHome = lib.mkForce false;
    PrivateTmp = lib.mkForce false;
    PrivateDevices = lib.mkForce false;
    ProtectHostname = lib.mkForce false;
    ProtectKernelTunables = lib.mkForce false;
    ProtectKernelModules = lib.mkForce false;
    ProtectControlGroups = lib.mkForce false;
    LockPersonality = lib.mkForce false;
    RestrictRealtime = lib.mkForce false;
    RestrictSUIDSGID = lib.mkForce false;
    PrivateMounts = lib.mkForce false;
    SystemCallArchitectures = lib.mkForce "";
    MemoryDenyWriteExecute = lib.mkForce "";
    UMask = lib.mkForce "0002";
  };

From various threads I’ve come across in a search of the error text, it may have something to do with RestrictSUIDSGID, but setting that to false showed no difference in behavior. Just reporting this info back for now and taking a break. I want to look into each of the options that aren’t boolean (including SystemCallFilter) to see if maybe there are some specific modes I could set and test those.

Side note: this is not on a critical server, so I’m not in a bad spot. I do have a production server I’d like to upgrade to 21.05, but it’s running fine as it is now until I sort things out on this machine.

1 Like

It’s been a while since I’ve been able to really dig into this, but I finally licked it (mostly)! I bashed my head against a wall for about a month and finally took a resetting break to have a fresh mind. Just reporting here for clarity in case anyone else stumbles upon similar issues and finds it useful.

Ultimately, I marked @aanderse’s previous post as the solution, since setting some of the systemd options was the ticket for my original issue.

In my previous post I mentioned having new errors that looked like permissions issues in some way. I finally had the right train of thought, and realized something about the services config in nix’s config.

In my config, I had the following set:

services.nginx = {
    enable = true;
    user = "root";
    group = "root";
}

I thought this would cause nginx to run as root, but it turns out this is only to set the User/Group in the nginx.service file. When I checked ps aux | grep nginx, I noticed the main process was run as root, but the worker processes were run as the default nobody. So I appended user root root; to the actual main scope of my nginx.conf to fix it.

My only remaining issue was not being able to write to a specific directory, so I just added that directory to ReadWritePaths, and voila! I now have a working setup under 21.05. Finally.

Thanks for nudging me in the right direction, @aanderse !