Can we solve the nvidia situation?

So, infamously, using an nvidia GPU on NixOS is wraught with pitfalls. I’d guesstimate a good 10% of posts here are from someone with an nvidia-related problem.

Part of this is simply because the nvidia driver is the nvidia driver, but we’re not making the situation much better. In typical NixOS fashion, there are lots of options, which is great (arguably…); Except a lot of them are poorly maintained or at least misleading. Some of my favorite pet peeves:

  • hardware.nvidia.prime
    • This entire submodule is practically nonfunctional nonfunctional for DE users as of 25.11; It (silently) does not work on any wayland compositor, which is both of the major DEs at this point. Whether or not it is needed on them is a different matter, as in theory wayland compositors have auto-detection and native multi-gpu support (though it’s unclear to me how well this is exposed to users, given that all documentation of this is deep inside the source code and no software seems to support an nvidia-offload like feature), but that does not change the fact that most of the module doesn’t actually work in practice, making users think they have something configured that they do not.
  • hardware.nvidia.videoAcceleration
    • Defaults to true, even though this is entirely third party to the nvidia driver and requires extensive additional configuration to actually work, so we’re just installing additional third party software that isn’t even functional by default.
  • hardware.nvidia.powerManagement.finegrained
    • Is for some reason (historic misunderstanding of the docs?) a sub-option of hardware.nvidia.powerManagment, even though the two options have nothing whatsoever to do with one another (except for having a similar name), one of the two is experimental and the other is (normally) enabled by default by the driver. As a result, the module by default overrides the sensible default set by the driver, and fails to install the required udev rules even though they are inert if the driver feature is flipped off.
  • boot.kernelPackages
    • Not an nvidia option, but it seems like 90% of newbies stumble upon this option and think “oh, hey, cool, let me set this to _latest” and then get confused when two months down the line their update breaks with a build error involving nvidia; I appreciate the difficulty with supporting third party kernel modules, but we should be clearly projecting to users that they’re on their own if they do this (and that their computer may spontaneously burst into flame; Using a third party kernel module against a kernel source it’s not designed for strikes me as quite risky).

This isn’t to put the spotlight on anyone in particular who may or may not have contributed to the module; the driver is an absolute PITA, and without testing and very thorough reading of the docs of both nvidia and the various compositors (as well as source code here and there) it’s bloody hard to figure this stuff out; it also changes quite significantly over time.

But it is problematic that what NixOS sets up by default is quite frankly broken. It paints a picture of very spotty maintenance to me that prime still doesn’t work on wayland in 25.11. This is probably accurate; I don’t think many people have experience with all the use cases for nvidia GPUs, let alone all the different pieces of hardware to test those use cases with. Stepping up to continuously maintain the module isn’t a light responsibility (more on this later…).

It also definitely isn’t only NixOS’ fault; different nvidia hardware setups require very different configuration. Which leads to the other coin of the configuration overload problem: People can’t just cargo cult a config and get it right. Different GPU architectures require different options. What works for one GPU will not work for another, and there is no way to see at a glance which GPU a specific configuration applies to. Prime setups confuse things even more.

Still, nvidia has by a wide margin the largest market share in GPUs (~90% today, and they have dominated for the past 10-20 years), and hence we have many users with computers that have nvidia GPUs in them. The relative and absolute numbers are growing, too. I doubt the influx of nvidia problems will slow for at least another 3-5 years, and they’re bound to get quite bad for a while in the near future with the EOL of pre-turing GPUs.


So, what can we do about this?

For one, obviously we always need better maintenance for everything. I’d like to step up, but the situation isn’t great - I can’t step up for the use cases I have experience with (post-Turing desktop use) and try to improve the situation without potentially stepping on CUDA/datacenter user’s toes. I have no way to assert that any changes I’d make to the module don’t break those use cases.

Furthermore, tweaking defaults is unlikely to actually fix anything for the large majority of users who’ve copied a version of the nvidia config from a wiki or reddit post from 4 years ago; even if it did, it’s likely to break configuration for people who did not, but use a pre-turing (and sometimes pre-ampere) nvidia GPU.

So what I’d really like is to just… Throw out the entire thing and start from scratch. In my mind, a good solution would be to introduce a new hardware.nvidia.architecture option, which is set to an enum of nvidia architectures. This would condense all the mess into a simple abstraction that users can actually grok without reading the whole nvidia readme, and allow us to deal with the maintenance in a sane way that actually matches upstream recommendations, as well as simplify the whole legacy driver problem. This would necessitate deprecating all the existing nvidia options, though, since those will remain wrong for a good chunk of users simply because they are already set.

This comes from a position of ignorance around all the non-desktop use cases though. It’s also the type of thing that feels like it should be in the domain of nixos-hardware or similar. It feels like it’d be far too bold of me to sit down and write all of that, then just drop a PR that throws out something used - statistically - by 90% of NixOS users.

So I’m writing this post. What do other people think? Should the hardware.nvidia module just be jettisoned from NixOS entirely, since it’s obviously broken and we can’t really fix the historic issues due to cargo culting, and arguably this module is overstepping the domain of NixOS anyway? Deprecated and replaced in a version or two like what was done e.g. with nixos-rebuild-ng? Left as-is, to generate support requests forever?

Also, is anyone actually actively maintaining the module?

22 Likes

All of the criticism here mainly applies to laptop setups it seems, so saying the module is “broken” feels like the wrong end to start from. Acknowledging it is mainly maintained by desktop and server Nvidia users is probably more productive. As a desktop Nvidia user, I can’t relate to most of the issues layed, the defaults work mostly fine for graphical Wayland usage thanks to enabling modesetting, GSP and open kernel modules by default nowadays. Of course, the fact that power management is not enabled by default and standby/hibernate won’t work without it is something you still have to figure out yourself as a desktop user (and the wiki / manual should probably be clearer that enabling powermanagement on Wayland is not optional at all if you do want to use standby).

I agree the videoAcceleration option is… of limited use, as it does not work work for Chromium at all and even Firefox requires a combination of environment variables / configuration to make it work. Long-term, I see the solution for this in the adoption of the Vulkan video APIs, which is a standardized API for hardware accelerated video decoding and encoding that is already supported by both Mesa and Nvidia drivers.

Removing the in-tree module in its entirety would not be an option from my perspective. NixOS is known to be very easy to use out of the box for server and desktop Nvidia users, the fact dual-GPU laptop users are currently missing out does not merit a complete removal. That being said, having an enum of presets to choose from which enable useful defaults depending on server, desktop or laptop use sounds like a great idea!

I’d also note that experimenting with a laptop specific module out-of-tree first is an option too, and it’d be fine mentioning such a module in the NixOS wiki if it becomes stable enough, and you can still consider moving it to nixpkgs or nix-hardware later on. While the nvidia module can’t currently be disabled when nvidia is added to xserver.videoDrivers, it’s implemented in a single module file, so it can be easily disabled through the disabledModules option.

5 Likes

I mean, I didn’t go over everything, and still I’d say 2/4 of these options apply to all setups. Particularly the _latest kernel use is omnipresent IME. Only desktop users without a separate iGPU (should) use the VAAPI module, so there’s even one that only applies to your use case.

A desktop user with only one, post-Turing GPU, who doesn’t take any of the wiki or other advice floating around on the internet, and ignores some of the NixOS option descriptions, is indeed arguably on the happy path. These users are rare IME, not per-se for lack of hardware dominance, but because they’re peeking at the nixos wiki and old forum posts. And still their setup lacks the advertised media acceleration.

We’re completely skipping over everyone else, and even users currently on the happy path will see their configuration suddenly breaking whenever nvidia pull the support plug, which is happening at ever shorter intervals. Maybe “broken” is a little harsh, but the module is in quite poor state, and we see the results constantly in Help; This is why I’m gravitating towards wholesale removal - it’s precisely not easy to use out of the box.

This is definitely an option, but I’d still like to do such experimentation with a longer-term view towards making the upstream module actually work out of the box, rather than just having yet another third-party “fix”.

I’d also like to emphasise again that this applies to people with wider use cases than just “laptop” - I think the module could be much better for pretty much every use case I’ve helped someone on this forum with, and this does include pure, single-gpu desktop users (like yours truly, I want to get rid of at least the VAAPI code in my repo).

3 Likes

@TLATER

First of all, thanks for your amazing efforts helping many people (including me) with the nightmare that is Nvidia configuration.

Thank you for this very important post! You call this a “rant”, but it’s actually very accurate, and I think you are the most knowledgeable person in NixOS land on this topic.

After your suggestion, I moved away from Nvidia to AMD and that has been an excellent move! :pray:

However, for better or worse, Nvidia does have a lot of deployed hardware, so it really would be great if we could get Nvidia to “just work” on NixOS.

  • Could we form some sort of a small special interest group team?
  • The team probably need a bunch of people with different Nvidia hardware, and we could attempt to systematically work trying to get them working consistently. Then document the shizzle out of it.
  • The Flox guys have been working with Nvidia to some degree, so maybe somebody from Flox could help us to engage Nvidia for some assistance?
4 Likes

Maybe this is a good topic for the upcoming Southern California Linux Expo? We could discuss it. We could even do a bit of hack-a-thon?

The idea here being that if that is set, we just ignore all the cruft and assume the user is interested in the new and sane way of doing things–possibly issuing warnings for users of old stuff?

2 Likes

Yeah, that’s the thought; I wonder if an assertion against using the “old” config options is possible, in an ideal world I’d even say using these together should be an error.

2 Likes

One of the first steps to take could be to enhance the already existing nixos-hardware templates: nixos-hardware/common/gpu/nvidia at master · NixOS/nixos-hardware · GitHub.
Also, making the official wiki page as extensive as it can would also help plenty of users to avoid navigating through random forums with sometimes wrong or outdated information.

This is obviously not a long-term solution but would probably help users faster, and I don’t know enough about Nvidia/compositor to have an informed opinion on what to do with the module. (Just think that with great warning messages to help users, it won’t be disastrous to rebuild the module from scratch.)

2 Likes

I’ve tried, but the wiki bitrots too fast. Any time I fix something it’s overwritten within weeks; I think the problem is precisely that this stuff is too divergent, so people chuck whatever they get to work in there, thinking that what was there is wrong, introducing fun new cargo cultey stuff along the way.

That is indeed a good thought, the per-arch options didn’t exist when I last looked, and I think the gpu module was slated for removal. Nice to see that didn’t happen :slight_smile:

2 Likes

FWIW, if you have the energy, you could consider making a different wiki page called “NVIDIA new solution” or something. I imagine you could maintain a link from the main wiki page - hopefully nobody would be rude enough to remove the link.

1 Like

As someone on Xorg without a DE, this does work however (as much as one can expect from nvidia). I’m not sure whether the blame for nvidia prime falls on NixOS, Nvidia, or Wayland + Nvidia. But nonfunctional seems a bit to harsh. But it can obviously still be improved.

It’s not like Xorg is EOL just feature frozen, I’m assuming we have a lot of users on Xorg.

3 Likes

Thanks for some renewed interest on the thread!

Please, can we discuss how to make things improve :pray:

Everyone’s mileage seems to vary a little, but given Nvidia is possibly the “most valuable company ever”, there’s definitely room to make the devices “just work”

… I’m probably guilty of trying to fix the wiki, based on what worked for me. Maybe that’s part of the challenge? Do we need to break this down by different [features][categories] or other dimensions?

I’m really happy you opened this discussion because I agree the Nvidia situation can be improved a lot.

This is false mainly because a lot of users are still on X11, myself included. You later state that it does not work on Wayland compositors which is true. However I wouldn’t start off saying it non-functional.

I’m torn on this, but I agree that in most cases it requires additional configuration that is not mentioned anywhere in Nix.

I agree this should come with a warning of sorts, or to have it better written in the Wiki. Which leads me to my points:

  • The Linux userspace currently is a bit conflicted mainly due to the Wayland migration of the major DEs, but a lot of software still isn’t wayland-compatible or not fully ready to be tested on Wayland. This results in other major issues down the line.
  • I’m using X11 mainly because of my setup - I’m using an eGPU and it’s just pain to get Wayland working properly without tools like all-ways-egpu which I can’t make work on NixOS at the moment (due to mainly myself and my non-complete understanding of how to migrate it to a Nix module).
  • A lot of users still use X11 due to their own reasons and shouldn’t be forced to move to Wayland.
  • X11 is still active, just not actively maintained.
  • We can set an error message that will stop a (re)build if the kernel version is too new and the Nvidia driver hasn’t been tested yet, or there are multiple issues reported with it on the internet from other distributions.

I propose we update the Wiki as best we can. There should be sections for X11 and for Wayland so that most options can be distinguished properly and hardware.nvidia.primeto be seen as X11-only.

And the options themselves can be modified, renamed, restructured in a way that is clear what they do:

  • X11-specific configuration options can be moved to hardware.nvidia.x11instead of prime. Then prime can be an option of x11.
  • hardware.nvidia.videoAcceleration should come with a warning OR set some of the necessary environment variables which aren’t a lot with the latest driver versions.
  • hardware.nvidia.powerManagement.finegrained can be moved to the new x11option and to come with a warning due to it still being experimental.

This will essentially happen partly because of the module restructuring. In my proposal the options will simply be rearranged a bit and some backwards compatibility will be left, which I see as a win.

I don’t think so. Nvidia is a PITA on Linux and will always be such, and the current module is ok-ish, but definitely needs refreshing and restructuring. It is indeed very complex, but I don’t think we should just yeet it out of existence.

Also the current way of installing GPU drivers is by adding them to the services.xserver.videoDrivers list which IMO should be changed as well nowadays especially with Wayland being the main focus of most development.

Everything can be done for 26.05 or 26.11 and be introduced as breaking changes. It all depends on the effort we can put into it and what the maintainers actually allow.

I like this idea! And I see it as the way to fix the Nvidia situation on NixOS in the long run.

My main issue with simple rearrangement is that there is sufficient cargo culting going on that this will not fix anything. People will still copy broken configs, and continue to be annoyed by them, with no signal from the module system that they’re causing their own problems.

As for the X11/wayland thing, yes, people still use X11. The “default” config you end up with when using the GUI installer (or following the manual top-to-bottom), however is gnome, which is wayland, and therefore broken.

While I appreciate that people will stay stuck on X11 for a variety of reasons, the Linux desktop ecosystem is moving along, and we should at least be keeping pace.

I’m not suggesting the X11 config submodule should be removed or anything, I merely think it rather unhelpful that the majority of desktop users are enabling an X11 config option that is literally nonfunctional on their systems.

It’s like telling you to add services.xserver.videoDrivers = [ "amdgpu" ]; to your nvidia laptop, and calling the option hardware.gpu.enable or something. Sounds far-fetched, but it’s a really close comparison; Things will still work, because nouveau gets auto-loaded, however it’s far from the intent and you end up with an unused amdgpu kernel module and miscellaneous, hopefully inert configuration. Nonfunctional is maybe a bit exaggerated, but I don’t think it’s far from true.

This is on part on the users themselves, no? NixOS isn’t a beginner distribution and requires using at least a few brain cells to install and use. I’m also kind of frustrated by the simplest frustrations and outrages by some users, but most of them have the error right there and just refuse to read a bit more. We can always add warnings or incompatibilities, but this will only hurt the overall experience if Wayland users enable X11-related options that don’t harm them in any way, and this is in part why I proposed with the simplest rearrangement and re-categorization of options.

Can you explain what precisely gets broken? The Nvidia drivers are enabled when adding them to xserver.videoDrivers and most of the hardware.nvidia options can be skipped because the default values are sufficient enough. The powerManagement.finegrained option is indeed in the wrong place, but it isn’t enabled by default and therefore not harming anyone.

I agree, but I believe this is already being handled as explained by my previous point. NixOS is Wayland-ready and has been for a while.

I understand what you mean here and this is, again, on the user. No distribution, other than maybe Ubuntu, will set up Nvidia by default properly, or even at all. This is a general Linux issue, not a NixOS-scoped one.

And in those laptop cases if you want to use the dGPU for something, explicitly launch it on the dGPU with the supplied environment variables or runner scripts - the option hardware.nvidia.prime.offload.enableOffloadCmd works for that, but then we get to talking about the options arrangement again :sweat_smile: I believe Lutris and Steam’s Proton work out of the box without an issue while on Linux-native games the environment variables need to be passed (this is from personal experience).

All in all, as I said before, we can definitely write a better Wiki page for Nvidia, but users also need to start reading more instead of blindly following some random guy on YouTube who thinks he knows all of Linux and provides broken configurations with zero explanation and no support whatsoever.

1 Like

It shouldn’t require 10 years of nvidia history knowledge just to configure a graphics card.

Other distros’ unwillingess to touch the closed-source drivers or the nvidia mess (which is actually fair), or relative incompetence, has nothing to do with our decisionmaking.

5 Likes

To add to @waffle8946 's point; you seem to be arguing for the hardware.nvidia module to be completely removed from NixOS because we shouldn’t be providing a module to configure the closed-source driver, since other distros also consider it too complex.

This is a fair opinion, it’s also one of my suggestions. It doesn’t really solve the problem either, though, most likely that’d mean moving the module to a third-party repo where there would still be a need for better UX.

It would make iterating on module UX easier, though, given that we wouldn’t be tied to NixOS release stability, and that commit bits to such a third repo would have different trust implications. A third repo is definitely the road I’ll try to go down at least initially, once I get done enough with my other pet project and have some time on my hands.


I also have to say, I really dislike the statement that we shouldn’t write user-friendly modules just because NixOS already has a reputation of being hard to use. The very concept of this seems brutally toxic to me. There are plenty of modules where I benefit from others putting in the hard work of making things user-friendly (e.g. boot.initrd, I’m blissfully unaware of the majority of what that module does). Why shouldn’t I provide the same service for others, and ultimately make the OS more easily usable for more people?

I believe that’s not what you’re trying to say, to be fair. I agree people shouldn’t be copying from random youtubers, but wikis aren’t a much better source IMO - the lack of peer review makes them inherently volatile, so I don’t think they’re much more trustworthy than a random youtuber. Things done by random youtubers/bloggers/etc. will usually end up dominant on wikis because their viewerbase will keep copying what they see elsewhere into the wiki.

Rather than continuously spending the effort to clean up inadvertent vandalism, why not fix the API and its first-party documentation so that people don’t need to rely on third-party sources? Once we’ve done that, sure, I’m happy blaming someone who uses kepler when their GPU is turing, but I think we can accept some blame for the current situation.

That’s true. And in most cases nowadays it’s not required because it just works™. However you can stop a person from copying stuff blindly without understanding it and creating even more problems.

I agree the Nvidia situation isn’t great, but it’s definitely better than it was 10 years ago.

That’s fair. And even now NixOS does a lot of the heavy lifting when dealing with Nvidia. And I believe this is part of the reason we’re having this discussion now - so that we aren’t like the other distros.

Oh, not at all. I’m sorry it has come out this way, this was never my intention. All I said was that I think we should restructure it and make it work better for both modern and “legacy” (X11) setups while also keeping the datacenter users happy.

I’ve never ever meant to sound this way. I’m all in for more user friendliness. And that’s why I proposed to have better documentation as well as maybe introduce warnings or errors to some of the options - so that users don’t shoot themselves in the foot by mistake.

As a person who was on Gentoo for many many years, and Arch before that, I disagree. A well-maintained Wiki is 1000 times better than anything other than well documented code.

And that’s why the edit history exists and people can be notified if they did something wrong on the Wiki so it’s not done consecutively.

There are many moving parts to a working system and it’s on us to do our best to keep it that way.

1 Like

You’re not the only one who’s used other distros :wink: I’m pretty sure a number of people in this thread have been around since before Arch existed as a distro, and we’ve all been distro hoppers at one point.

IMO NixOS is a very different beast from those two distros; neither of them provide infrastructure-as-code. Packages can’t provide anywhere near the same level of documentation, API or configuration surface a NixOS module can.

Ideally, the NixOS module docs and manual should make wikis completely supefluous; the only reason they don’t is because we don’t have the manpower Arch/Gentoo have for maintaining their wikis working on NixOS module docs. We also don’t have that manpower available for wiki maintenance; hence the NixOS wiki’s sorry state.

Either way, I don’t think the fact that we could be writing docs implies that we shouldn’t also be writing better API surfaces. We seem to ultimately agree:

… so I don’t really see what you’re arguing against here. That’s precisely what I’m proposing.

1 Like

I thought you were implying I said somewhere that we shouldn’t write user-friendly modules, please excuse me if that wasn’t towards me but just a general statement :sweat_smile:

Overall I’m glad we’ve come to the conclusion that we agree with each other. Now we should find a way to reach a common starting ground and hopefully speak with the maintainers of Nvidia.