Discourse dying (and bringing down a bunch of things as a result), can't figure out why

Hey all! I have no idea what’s causing this issue, because it’s really difficult to actually track down.

After I’ve initially deploy Discourse, which works, I go to authenticate to set it up, but as soon as I try it just dies. And it brings down all my other web services with it.

Now, when I first tried to figure out this issue I thought it was running out of RAM or something. That’s the obvious answer. But no, it’s not, because RAM usage never exceeds 2-3 GB, and I’ve got 8 GB available on this server. My SSH connections aren’t disrupted, and it doesn’t look like postgresql is crashing, or anything else for that matter. The dmesg logs don’t show that the OOM killer is being triggered either. Yet all my web services – matrix included – just stop accepting connections completely. And they’re the only ones that go boom. Nothing else dies at all. (Granted, it’s really hard to tell if I view all the logs because so much information is thrown at me, but still…) I am using Postgresql 16.3 for my database but I don’t really understand how that could cause it because everything else using it works fine, unless discourse is doing something unusual with it.

My Discourse config is pretty much the standard config anyone would have (plus kTLS enabled but that shouldn’t be a big deal):

{
  config,
  pkgs,
  lib,
  ...
}:
{
  services.discourse = {
  admin = { ... };
    database.ignorePostgresqlVersion = true;
    enable = true;
    hostname = "forum.the-gdn.net";
    mail = {
      contactEmailAddress = "...";
      incoming = {
        enable = true;
      };
      outgoing = {
        authentication = "login";
        domain = "my-mailserver";
        forceTLS = true;
        port = 465;
        passwordFile = "...";
        serverAddress = "...";
        username = "...";
      };
    };
    secretKeyBaseFile = "...";
  };
  services.nginx.virtualHosts."forum.the-gdn.net" = {
    kTLS = true;
  };
}

Has anyone else experienced this particular problem? I’m using the latest NixOS 24.05 (24.05.4090.224042e9a303 (Uakari)).

There has to be something useful in the logs…

I would hope so, but the deluge of data I get if I just do journalctl -xfe makes it impossible to figure out what’s what, and I have no idea what I’m even looking for to begin with, which makes everything even more complicated. I’m not sure what to try, or what filters to apply (I’m no master Linux sysadmin lol).

Perhaps text search for “error”

I… Feel like that’d get way too broad. Would I filter based on unit? Or search all services?

Could do based on unit. Also perhaps you’re not aware of the -b0 flag?

IIRC looking into the journal isn’t helpful at all here, but there’s an error log panel in the admin UI showing error logs. See also Logster and our error logging strategy at Discourse

You may want to check either /var/log or the state directory of Discourse for textfiles, that’s where this reader probably reads from.

I can’t authenticate to the admin UI in the first place, so will have to check manually. Will definitely try to see if Discourse outputs any errors in there that are of use. If it doesn’t, I guess I could search the entire boot journal, but hopefully I don’t have to.

I’m hoping I’m not breaching some rule for topic resurrection or something, but I’m coming back to this issue after quite a break. I just tried again, and monitored the journald logs with priority set to 5 (notice) and below. I saw no errors at all (other than synapse complaining about 403s/404s and such). Postgresql isn’t crashing either, nor is nginx or discourse itself. The OOM killer is from what I can see not getting invoked at all (I would expect this to be at least an alert priority if not crit/emerg). I am seriously baffled here, as I was before. I mean, I’m not running a supported version of postgresql (I’m on postgresql 16) but I can’t imagine that that change, alone, would break all my web apps, and only my web apps, and not SSH or anything else, and only when I try using Discourse.

When I played around with discourse the other day, I also ran into this. What was actually happening in my case was that the nginx worker process started spinning indefinitely.

This only happens when SSL was enabled though, so disabling that could be a workaround to at least get to the admin panel.

Did you disable SSL on the Discourse side, or the nginx one? I can’t disable SSL/TLS because of HSTS… But I do wonder why the process is spinning like that? Is there some way of pulling it out of that loop somehow? I do have kTLS enabled but that in theory shouldn’t be the problem I don’t think…

Nginx, via the relevant NixOS options.

I’ve played around with it and kTLS was not the issue; the determinator was whether SSL was enabled or not.

I found one mailing list thread online that seemingly had the same issue and IIRC it was a bug in nginx. I couldn’t update my nixpkgs at that time to verify whether I could repro on the newest nginx version at that time which is why I didn’t pursue it further yet.

I just tried upgrading to nixpkgs nginx mainline (currently 1.27.2 I believe) and it still does it, so I don’t think that bug has been resolved. I would disable nginx SSL for that subdomain but I need HTTPS… I have Caddy for webfinger so I suppose I could just route it through there and then use nginx as just a router of sorts… I know, it’s overly convoluted. Lol

Edit: or I could try, anyway, never configured Discourse through Caddy so…

1 Like

If this is reproducible on the current NixOS unstable, please create an issue so that the maintainers of the module aswell as other users can chime in.

Perhaps we’re holding it wrong or something else is the matter.

When I was debugging this I had a look at the module and it configured nginx quite a bit. Not trivial to replicate using Caddy.

The double routing route (heh) would probably be easier and more sustainable.

I’d double-route it if I knew how, but Idk how to configure it through Caddy. I’ll definitely open an issue.