Let's encrypt on 20.09

I’ve got some issues with Let’s encrypt on NixOS 20.09, certificate creation fails for some certificates but not all with some strange errors that I cannot figure out.

security.acme.certificates = {
  dovecot2.mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.dovecot2.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/dovecot2.mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "dovecot2";
    keyType = "ec256";
    postRun = "systemctl restart dovecot2.service";
    server = null;
    user = "root";
    webroot = "/var/lib/acme/acme-challenge";
  };
  mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "nginx";
    keyType = "ec256";
    postRun = ''
      systemctl reload nginx
    '';
    server = null;
    user = "nginx";
    webroot = "/var/lib/acme/acme-challenge";
  };
  map.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.map.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/map.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "map.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "nginx";
    keyType = "ec256";
    postRun = ''
      systemctl reload nginx
    '';
    server = null;
    user = "nginx";
    webroot = "/var/lib/acme/acme-challenge";
  };
  postfix.mail.example.com = {
    allowKeysForGroup = false;
    credentialsFile = «error: The option `security.acme.certs.postfix.mail.example.com.credentialsFile' is used but not defined.»;
    directory = "/var/lib/acme/postfix.mail.example.com";
    dnsPropagationCheck = true;
    dnsProvider = null;
    domain = "mail.example.com";
    email = "kontakt@example.com";
    extraDomains = { };
    group = "postfix";
    keyType = "ec256";
    postRun = "systemctl restart postfix.service";
    server = null;
    user = "root";
    webroot = "/var/lib/acme/acme-challenge";
  };
};

The output is from nixos-option security.acme.certs.
The two certificates from nginx (mail.example.com and map.example.com) work fine, the other two (dovecot2… and postfix…) reliably fail with various error messages.

After upgrading from NixOS 20.03:

-- Reboot --
Nov 10 22:14:35 mail systemd[1]: Starting Renew ACME certificate for dovecot2.mail.example.com...
Nov 10 22:14:39 mail acme-dovecot2.mail.example.com-start[880]: 2020/11/10 22:14:39 Could not load RSA private key from file accounts/acme-v02.api.letsencrypt.org/kontakt@example.com/keys/kontakt@example.com.key: open accounts/acme-v02.api.letsencrypt.org/kontakt@example.com/keys/kontakt@example.com.key: permission denied
Nov 10 22:14:39 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE
Nov 10 22:14:39 mail systemd[1]: acme-dovecot2.mail.example.com.service: Failed with result 'exit-code'.
Nov 10 22:14:39 mail systemd[1]: Failed to start Renew ACME certificate for dovecot2.mail.example.com.

I simply removed /var/lib/acme followed by a reboot, I got this error:

-- Logs begin at Fri 2020-10-30 20:25:06 CET. --
Nov 10 22:17:20 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:20 [INFO] [mail.example.com] acme: Trying to solve HTTP-01
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/1234567890
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 [INFO] Unable to deactivate the authorization: https://acme-v02.api.letsencrypt.org/acme/authz-v3/1234567890
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: 2020/11/10 22:17:26 Could not obtain certificates:
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]:         error: one or more domains had a problem:
Nov 10 22:17:26 mail acme-dovecot2.mail.example.com-start[951]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/YG_thaYCj1-grYnkw6dfuj4Q_9x0WkPAGyY1RXXN1bk [2a03:1234:1234::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:
Nov 10 22:17:26 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE

Which does not make any sense for me. mail.example.com works but creating another certificate with the same domain fails?

I tried to change the group on /var/lib/acme/acme-challenge/.well-known/acme-challenge from dovecot2 to nginx, which apparently did not change anything.

At some random point (I honestly cannot figure out anymore what I did to cause that) I got

-- Logs begin at Fri 2020-10-30 20:25:06 CET, end at Tue 2020-11-10 23:04:46 CET. --
Nov 10 23:03:58 mail systemd[1]: Starting Renew ACME certificate for dovecot2.mail.example.com...
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]: 2020/11/10 23:03:59 [INFO] [mail.example.com] acme: Obtaining bundled SAN certificate
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]: 2020/11/10 23:03:59 Could not obtain certificates:
Nov 10 23:03:59 mail acme-dovecot2.mail.example.com-start[1950]:         acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
Nov 10 23:03:59 mail systemd[1]: acme-dovecot2.mail.example.com.service: Main process exited, code=exited, status=1/FAILURE

Some time after that I hit the rate limit for creating new accounts on LE, so I reverted back to 20.03 and restored /var/lib/acme.

Any ideas what could be wrong in my setup?

Yeah, unfortunately the ACME module is kind of weird on 20.09, I got bitten by this as well.

The following workaround solved the problem for me: https://github.com/NixOS/nixpkgs/pull/91121#issuecomment-692180005

Also related: https://github.com/NixOS/nixpkgs/issues/101445

This indeed looks like the same race condition discussed in https://github.com/NixOS/nixpkgs/issues/101445

The ongoing discussion in https://github.com/NixOS/nixpkgs/pull/102387 will result in a solution but in the mean time, a reliable workaround is to add aliases to the “email” parameter for each cert. For example, in your case you could use kontakt+dovecot@example.com. This will create an accounts directory for each cert. Bare in mind there is a 5 accounts per day rate limit.

@m1cr0man I tried adding email to each certificate, but no luck. I’m still stuck at the same error.

@Ma27 I tried to replicate that, but it does not work, as I’m stuck at a different error message when I just tried again:

Nov 12 22:21:39 mail acme-dovecot2.mail.example.com-start[1695]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/-uMqnmW_dUqozOZ5KXkCvodh1LYYolGJTY19Pu8UsFM [2a03:2267:ffff:c00::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:

That’s very odd that you received the same error. Can you confirm that /var/lib/acme/.lego/accounts contained at least as many directories as you have certificates?

I guess this bug could be upstream then? If it’s not a race condition between systemd services then I can’t think of any other way it would fail.

I do not have that acme directory anymore, so I cannot check without trying again, but I only added suffixes in email addresses to those certificates that failed to renew.
Also please note that I was able to get that JWS error just a single time and haven’t been able to replicate it. I’m stuck at this error:

Nov 12 22:21:39 mail acme-dovecot2.mail.example.com-start[1695]: [mail.example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://mail.example.com/.well-known/acme-challenge/-uMqnmW_dUqozOZ5KXkCvodh1LYYolGJTY19Pu8UsFM [2a03:2267:ffff:c00::e]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:

Which makes no sense at all to me and my best guess is that that acme client is failing to write the token into that directory.

@tokudan I’ve found a fix. The option acme.certs.<name>.group has to be set to “nginx” so that the nginx user has access to that directory.

I cannot set the group to nginx, as that would prevent dovecot and postfix from accessing those keys that were just generated for them.
If the acme client only works with the running webserver it’s broken.

What I’ve done: set the group to “sslKeys” (or whatever you’d like to call it) and added the nginx, dovecot and postfix to that group. Does that work for you?