Let's encrypt on 20.09

I’ve got some issues with Let’s encrypt on NixOS 20.09, certificate creation fails for some certificates but not all with some strange errors that I cannot figure out.

security.acme.certificates = {
  dovecot2.mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.dovecot2.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/dovecot2.mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "dovecot2";
    keyType = "ec256";
    postRun = "systemctl restart dovecot2.service";
    server = null;
    user = "root";
    webroot = "/var/lib/acme/acme-challenge";
  };
  mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "nginx";
    keyType = "ec256";
    postRun = ''
      systemctl reload nginx
    '';
    server = null;
    user = "nginx";
    webroot = "/var/lib/acme/acme-challenge";
  };
  map.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.map.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/map.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "map.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "nginx";
    keyType = "ec256";
    postRun = ''
      systemctl reload nginx
    '';
    server = null;
    user = "nginx";
    webroot = "/var/lib/acme/acme-challenge";
  };
  postfix.mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.postfix.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/postfix.mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "postfix";
    keyType = "ec256";
    postRun = "systemctl restart postfix.service";
    server = null;
    user = "root";
    webroot = "/var/lib/acme/acme-challenge";
  };
};

The output is from nixos-option security.acme.certs.
The two certificates from nginx (mail.example.com and map.example.com) work fine, the other two (dovecot2… and postfix…) reliably fail with various error messages.

After upgrading from NixOS 20.03:

-- Reboot --
Nov 10 22:14:35 mail systemd[1]: Starting Renew ACME certificate for dovecot2.mail.example.com...
Nov 10 22:14:39 mail acme-dovecot2.mail.example.com-start[880]: 2020/11/10 22:14:39 Could not load RSA private key from file accounts/acme-v02.api.letsencrypt.org/kontakt@example.com/keys/kontakt@example.com.key: open accounts/acme-v02.api.letsencrypt.org/kontakt@example.com/keys/kontakt@example.com.key: permission denied
Nov 10 22:14:39 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE
Nov 10 22:14:39 mail systemd[1]: acme-dovecot2.mail.example.com.service: Failed with result 'exit-code'.
Nov 10 22:14:39 mail systemd[1]: Failed to start Renew ACME certificate for dovecot2.mail.example.com.

I simply removed /var/lib/acme followed by a reboot, I got this error:

-- Logs begin at Fri 2020-10-30 20:25:06 CET. --
Nov 10 22:17:20 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:20 [INFO] [mail.example.com] acme: Trying to solve HTTP-01
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/1234567890
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 [INFO] Unable to deactivate the authorization: https://acme-v02.api.letsencrypt.org/acme/authz-v3/1234567890
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 Could not obtain certificates:
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]:         error: one or more domains had a problem:
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/YG_thaYCj1-grYnkw6dfuj4Q_9x0WkPAGyY1RXXN1bk [2a03:1234:1234::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:
Nov 10 22:17:26 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE

Which does not make any sense for me. mail.example.com works but creating another certificate with the same domain fails?

I tried to change the group on /var/lib/acme/acme-challenge/.well-known/acme-challenge from dovecot2 to nginx, which apparently did not change anything.

At some random point (I honestly cannot figure out anymore what I did to cause that) I got

-- Logs begin at Fri 2020-10-30 20:25:06 CET, end at Tue 2020-11-10 23:04:46 CET. --
Nov 10 23:03:58 mail systemd[1]: Starting Renew ACME certificate for dovecot2.mail.example.com...
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]: 2020/11/10 23:03:59 [INFO] [mail.example.com] acme: Obtaining bundled SAN certificate
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]: 2020/11/10 23:03:59 Could not obtain certificates:
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]:         acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
Nov 10 23:03:59 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE

Some time after that I hit the rate limit for creating new accounts on LE, so I reverted back to 20.03 and restored /var/lib/acme.

Any ideas what could be wrong in my setup?

2 Likes

Yeah, unfortunately the ACME module is kind of weird on 20.09, I got bitten by this as well.

The following workaround solved the problem for me: Restructure acme module by m1cr0man · Pull Request #91121 · NixOS/nixpkgs · GitHub

Also related: ACME fails with JWS verification error · Issue #101445 · NixOS/nixpkgs · GitHub

This indeed looks like the same race condition discussed in ACME fails with JWS verification error · Issue #101445 · NixOS/nixpkgs · GitHub

The ongoing discussion in nixos: mutually exclusive services; application to acme by symphorien · Pull Request #102387 · NixOS/nixpkgs · GitHub will result in a solution but in the mean time, a reliable workaround is to add aliases to the “email” parameter for each cert. For example, in your case you could use kontakt+dovecot@example.com. This will create an accounts directory for each cert. Bare in mind there is a 5 accounts per day rate limit.

@m1cr0man I tried adding email to each certificate, but no luck. I’m still stuck at the same error.

@Ma27 I tried to replicate that, but it does not work, as I’m stuck at a different error message when I just tried again:

Nov 12 22:21:39 mail acme-dovecot2.mail.example.com-start[1695]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/-uMqnmW_dUqozOZ5KXkCvodh1LYYolGJTY19Pu8UsFM [2a03:2267:ffff:c00::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:

That’s very odd that you received the same error. Can you confirm that /var/lib/acme/.lego/accounts contained at least as many directories as you have certificates?

I guess this bug could be upstream then? If it’s not a race condition between systemd services then I can’t think of any other way it would fail.

I do not have that acme directory anymore, so I cannot check without trying again, but I only added suffixes in email addresses to those certificates that failed to renew.
Also please note that I was able to get that JWS error just a single time and haven’t been able to replicate it. I’m stuck at this error:

Nov 12 22:21:39 mail acme-dovecot2.mail.example.com-start[1695]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/-uMqnmW_dUqozOZ5KXkCvodh1LYYolGJTY19Pu8UsFM [2a03:2267:ffff:c00::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:

Which makes no sense at all to me and my best guess is that that acme client is failing to write the token into that directory.

@tokudan I’ve found a fix. The option acme.certs.<name>.group has to be set to “nginx” so that the nginx user has access to that directory.

I cannot set the group to nginx, as that would prevent dovecot and postfix from accessing those keys that were just generated for them.
If the acme client only works with the running webserver it’s broken.

What I’ve done: set the group to “sslKeys” (or whatever you’d like to call it) and added the nginx, dovecot and postfix to that group. Does that work for you?

Yeah, that’s what I’ve done now, but it still means that services have access to other services’ certificates, which is suboptimal from a security view.

1 Like

I don’t understand what the issue is. You set the ACME group individually for each cert, so the group “sslKeys” doesn’t only have to be one. You could also use groups like “sslKeysMail” for dovecot & postfix and the “nginx” group for nginx and so on…

EDIT: disregard that, now I understand. Yeah, nginx has to be able to read all but it still stops other services from reading each other’s certs.

I just updated a NixOS 20.09 machine to the latest nixos-20.09 channel (I update weekly) and now I’m getting this:

A dependency job for acme-finished-DOMAIN.target failed. See 'journalctl -xe' for details. 

This happens for every domain with a cert. It looks like this is being caused by the selfsigned cert service:

$ systemctl status acme-selfsigned-DOMAIN.service
● acme-selfsigned-DOMAIN.service - Generate self-signed certificate for DOMAIN
     Loaded: loaded (/nix/store/9mj8sh2fn2fp6s8gh5961crx6yb45w0q-unit-acme-selfsigned-hq.pmade.com.service/acme-selfsigned-DOMAIN.servic>
     Active: inactive (dead)
  Condition: start condition failed at Mon 2020-12-07 23:40:45 UTC; 27min ago
             └─ ConditionPathExists=!/var/lib/acme/DOMAIN/key.pem was not met

What in the world is going on here? Don’t we want the selfsigned service to fail if the key exists so it doesn’t overwrite it? Is there a way to keep the failure from propagating up to the finish target?

The condition here starts with an “!”, so the logic is negated. It’s such that the selfsigned certs will not be used if a key exists already. It doesn’t stop the finish target - that is happening for some other reason. You need to check the log for acme-DOMAIN.service instead.

With regards to the original issue in this thread - the 403 from Let’s Encrypt/JWS verification error. I’ve done some work to remove the dependency on systemd-tmpfiles and I’ve prevented the race conditions circulating multiple simultaneous account creation attempts. Could people give #106857 a test please and let me know here if it resolves your issue? a test that would be great. In a couple of other related tickets, it’s been successful so far.

If you are still having issues please run systemctl clean --what=state acme-DOMAIN.service and try starting the service again. This will delete the account and certs for the listed domain, and recreate them.

I found that removing the directory /var/lib/acme/.lego/ fixed it. I did:

$ sudo mv /var/lib/acme/.lego/ /var/lib/acme/.lego.backup/

This was my error:

warning: the following units failed: acme-aaronhall.dev.service, acme-blog.aaronhall.dev.service

● acme-blog.aaronhall.dev.service - Renew ACME certificate for blog.aaronhall.dev
     Loaded: loaded (e]8;;file://nixos/nix/store/xnkf6bf42za28j5q1jjvsxqwbkz321j6-unit-acme-blog.aaronhall.dev.service/acme-blog.aaronhall.dev.service/nix/store/xnkf6bf42za28j5q1jjvsxqwbkz321j6-unit-acme-blog.aaronhall.dev.service/acme-blog.aaronhall.dev.servicee]8;;; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2020-12-16 13:58:27 EST; 180ms ago
TriggeredBy: ● acme-blog.aaronhall.dev.timer
    Process: 13106 ExecStart=/nix/store/27105pv23yxr8mhvbnm7aq1nxpzf8qbx-unit-script-acme-blog.aaronhall.dev-start/bin/acme-blog.aaronhall.dev-start (code=exited, status=1/FAILURE)
   Main PID: 13106 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
        CPU: 28ms

Dec 16 13:58:27 nixos systemd[1]: Starting Renew ACME certificate for blog.aaronhall.dev...
Dec 16 13:58:27 nixos acme-blog.aaronhall.dev-start[13109]: 2020/12/16 13:58:27 Could not load RSA private key from file accounts/acme-v02.api.letsencrypt.org/<emailredacted>/keys/<emailredacted>.key: open accounts/acme-v02.api.letsencrypt.org/<emailredacted>/keys/<emailredacted>.key: permission denied
Dec 16 13:58:27 nixos systemd[1]: acme-blog.aaronhall.dev.service: Main process exited, code=exited, status=1/FAILURE
Dec 16 13:58:27 nixos systemd[1]: acme-blog.aaronhall.dev.service: Failed with result 'exit-code'.
Dec 16 13:58:27 nixos systemd[1]: Failed to start Renew ACME certificate for blog.aaronhall.dev.

Would you be able to do an ls -al of your /var/lib/acme/.lego.backup/accounts/acme-v02.api.letsencrypt.org/<emailredacted>/keys directory? Everything in there should be owned by the acme user, and the service should be running as acme. I can’t think of a scenario where these permissions become wrong, but here we are.

Looks like whatever I did about 3 months ago didn’t fully fix it. I’ve now renamed /var/lib/acme and rebooted in the hopes that a clean initialization lasts longer.
The certificate generation was successful, so I’m going to see if it actually helped in a bit over 2 months.

I don’t know if you’ve seen that the backport of the account rate limit fixes and the removal of systemd tmpfiles was merged today: nixos/acme: Backport account rate limit fixes and tmpfile removal by m1cr0man · Pull Request #112145 · NixOS/nixpkgs · GitHub

I would recommend pulling that as soon as it is available. It seems to be fixing a plethora of issues beyond just the account rate limit problems.

@m1cr0man that sounds very good and could end a lot of pain I had since I migrated to 20.09 :slight_smile: