Troubles with `nixpkgs.config.cudaSupport` (OOM)

nixup · August 14, 2024, 9:26pm

Hello,

recently I have bought a GTX1060 for some CUDA action for Ollama.
I looked up that the services.ollama.acceleration gets set to cuda if nixpkgs,cudaSupport is set to true, so I tried setting that to maybe get some system-wide benefits.

Also I have added cachix with cuda-maintainers cache, but it did not seem to speed up the build.
After some hours building (~ 10k/14k jobs done?) it OOMd.

I find that weird since I have ~31 GB RAM and 32 GB SWAP.
Right now I am building it once again but I suspect it will fail.

How can I prevent this from happening?

mightyiam · August 16, 2024, 6:20am

What OOMd exactly, please?

nixup · August 20, 2024, 11:09am

I cheeked systemd-oomd logs and it did not kill anything…

I have set services.ollama.acceleration to "cuda" and comething compiled.
I think the acceleration works now.
I have decided to try and enable nixpkgs.config.cudaSupport.
Here’s what happens:

# nixos-rebuild test 
building Nix...
building the system configuration...
trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoFixElfFiles is deprecated, use pkgs.autoFixElfFiles instead
trace: warning: cudaPackages.autoAddOpenGLRunpathHook is deprecated, use pkgs.autoAddDriverRunpathHook instead
activating the configuration...
setting up /etc...
reloading user units for egycobra...
restarting sysinit-reactivation.target
the following new units were started: dev-disk-by\x2duuid-<UUID>.device, srv-nfs-outer.mount

If I disable cudaSupport the warnings disappear.

Also I have possibly encountered another problem:

# nixos-option  nixpkgs.config.cudaSupport
error: error: At 'cudaSupport' in path 'nixpkgs.config.cudaSupport': error: Attribute not found
An error occurred while looking for attribute names. Are you sure that 'nixpkgs.config.cudaSupport' exists?

When I set nixpkgs.config.cudaSupport = true then ollama.service.acceleration seems to not detect it:

# nixos-option services.ollama.acceleration
Value:
null

Default:
null

Type:
"null or one of false, \"rocm\", \"cuda\""

Example:
"rocm"

Description:
''
  What interface to use for hardware acceleration.

  - `null`: default behavior
    if `nixpkgs.config.rocmSupport` is enabled, uses `"rocm"`
    if `nixpkgs.config.cudaSupport` is enabled, uses `"cuda"`
    otherwise defaults to `false`
  - `false`: disable GPU, only use CPU
  - `"rocm"`: supported by most modern AMD GPUs
  - `"cuda"`: supported by most modern NVIDIA GPUs
''

Declared by:
[ "/nix/var/nix/profiles/per-user/root/channels/nixos/nixos/modules/services/misc/ollama.nix" ]

Defined by:
[ "/nix/var/nix/profiles/per-user/root/channels/nixos/nixos/modules/services/misc/ollama.nix" ]

eljamm · August 20, 2024, 1:11pm

You need to build and switch to your system once with the cache enabled before it takes any effect, so you should only enable CUDA for packages after you switch.

To do this, you can use nixos-rebuild test, which will switch the system but won’t add an entry to your bootloader menu.

However, even if the CUDA cache is enabled (which definitely helps), you’d still need to compile the packages that aren’t cached.

This is only the case if services.ollama.acceleration is null (which is the default). If you set it to "cuda", then that should be enough to enable CUDA for ollama.

Having this enabled globally, your system will try to compile any package that has CUDA support, which might be a little resource-intensive and which might have caused your system to be OOM.

What I recommend is to just set this on a per-package basis. For example, if ollama didn’t have the acceleration attribute, you’d override the package with:

  services.ollama = {
    enable = true;
    package = pkgs.ollama.override { config.cudaSupport = true; };
  };

In general, you need to check how CUDA is enabled in the derivation. Sometimes there is a cudaSupport attribute, other times it’s config.cudaSupport or you might even find an option like services.ollama.acceleration that makes this much easier.

nixup:

trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoFixElfFiles is deprecated, use pkgs.autoFixElfFiles instead
trace: warning: cudaPackages.autoAddOpenGLRunpathHook is deprecated, use pkgs.autoAddDriverRunpathHook instead

You can safely ignore these warnings. It’s something that has been recently addressed in cudaPackages: drop outdated aliases by srhb · Pull Request #331017 · NixOS/nixpkgs · GitHub

Here, the value for ollama.service.acceleration is null and nixpkgs.config.cudaSupport is false, so CUDA support is not enabled.

nixup · August 21, 2024, 10:37am

If I have a gpu that supports CUDA i want to use it so I would rather not to do it on by-package basis.

You can safely ignore these warnings.

I see. Thanks!

I wrote false instead of true. What you are saying is correct but it does not reflect my state.
I’ll edit the original post to fix this error.

eljamm · August 21, 2024, 12:06pm

That’s absolutely a fair point.

I see what you mean now and indeed you’re correct in that ollama.service.acceleration does not detect it, but it’s not supposed to.

If you check the ollama package, you’ll see that CUDA is enabled if one of these conditions is true:

acceleration == "cuda"
acceleration == null && nixpkgs.config.cudaSupport == true

We can verify this with the following config:

  services.ollama.enable = true;
  nixpkgs.config.cudaSupport = true;

We can then use the nix repl to check the attributes:

nix-repl> :lf .
Added 15 variables.

nix-repl> nixosConfigurations.nixos.config.services.ollama.acceleration
null

nix-repl> nixosConfigurations.nixos.config.nixpkgs.config
{
  cudaSupport = true;
}

nix-repl> nixosConfigurations.nixos.pkgs.ollama.nativeBuildInputs
[
  «derivation /nix/store/bbh04pw5g9w9vyadjhi43qkz8j72y1b6-go-1.22.5.drv»
  «derivation /nix/store/asg2ccwnfxfi9xy43ls3kbniacv97lkl-cmake-3.29.6.drv»
  «derivation /nix/store/vbb1b0rvyancwsl8pdxp2y7qa0ywr1w9-cuda_nvcc-12.2.140.drv»
  «derivation /nix/store/3c58cpcn5xi5i2i72xv73w77s5m58b92-make-shell-wrapper-hook.drv»
  «derivation /nix/store/7d98sajiqfv7lqqhww527xxv3r71g2c4-auto-add-driver-runpath-hook.drv»
]

If CUDA were not enabled, enableCuda would have been false and cuda_nvcc wouldn’t have been added to the nativeBuildInputs, but we can see that it’s working here.

nixup · August 21, 2024, 1:38pm

I have not heard of nix repl before so thank’s for introducing it to me.

I see that you are loading a flake there and I do not use flakes.
How should I load the variables?

I have managed to get these variables with nixos-option:

# nixos-option nixpkgs.config
Value:
{
  allowUnfree = true;
  cudaSupport = true;
}
# nixos-option services.ollama.acceleration
Value:
null

Default:
null

Type:
"null or one of false, \"rocm\", \"cuda\""

Example:
"rocm"

Description:
''
  What interface to use for hardware acceleration.

  - `null`: default behavior
    if `nixpkgs.config.rocmSupport` is enabled, uses `"rocm"`
    if `nixpkgs.config.cudaSupport` is enabled, uses `"cuda"`
    otherwise defaults to `false`
  - `false`: disable GPU, only use CPU
  - `"rocm"`: supported by most modern AMD GPUs
  - `"cuda"`: supported by most modern NVIDIA GPUs
''

I am not that good at reading nix and I appreciate nixos-option for interpreting the code that you linked.

That’s obvious and it works.

This made me rethink what I thought before.
If these condions are met ollama enables CUDA but leaves acceleration = null, right?

eljamm · August 21, 2024, 2:17pm

This seems useful, but I don’t think it works with flakes.

I think you can just use :l . without the f in the repl or just run nix repl -f ~/nixos-config from the commandline. Afterwards, you can just hit tab and it will show you the available attributes you can access.

Alternatively, there are useful tools that allow you to visualize options in a TUI, like nix-inspect, which also allows you to set bookmarks for frequent paths (like nixosConfigurations.nixos.config in my case).

Indeed. It just uses acceleration to check what it needs to enable, but it doesn’t change it. It’s up to the user to do that.

Having an option like this is mainly useful for people who just want to enable CUDA for ollama without globally enabling cudaSupport and without having to override the package attributes (pkgs.ollama.override { ... };).

nixup · August 21, 2024, 2:36pm

From playing around with it I notice that I cannot see more than with nixos-option.

I would like to do it but I don’t know how to load any of my config files since hey begin with { ... }:. Here’s what happens:

nix-repl> :l .
error: opening file '/etc/nixos/default.nix': No such file or directory

nix-repl> :l configuration.nix
error:
       … from call site

         at «none»:0: (source not available)

       error: function 'anonymous lambda' called without required argument 'config'

       at /etc/nixos/configuration.nix:1:1:

            1| { config, lib, pkgs, ... }:
             | ^
            2|

eljamm · August 21, 2024, 3:02pm

Yeah, that would definitely be useful if more details are added, but it’s great if you just want to quickly check values.

I’m not certain how to do this for non-flake setups, but you can try using:

nix-repl> :l <nixpkgs/nixos>

Which should load the config for your system nixpkgs.

nixup · August 21, 2024, 3:47pm

This added 6 variables but none of them helped.
However you @eljamm helped very much, so thank you once again.
I am glad that all works as it should!

nixup · August 27, 2024, 3:11pm

I am afraid that the issue still persists.
nixos-upgrade service was unsuccessful (it’s return value idicated that it was OOMd).
Now i have started the upgrade manually and it is not going well.
as of right now it froze at [ 55%] Building CXX object modules/wechat_qrcode/CMakeFiles/opencv_wechat_qrcode.dir/src/zxing/common/decoder_result.cpp.o and htop shows that 30.3/30.7G RAM and 32/32G SWAP usage.
Is there a way to limit the worker count? I think it could help.

This was printed to the terminal:

FAILED: CMakeFiles/magma.dir/magmablas/zgetf2_kernels_var.cu.o 
/nix/store/fa8aq5yhk7hnd8bhg2r564iz3l41hs7q-cuda_nvcc-12.2.140/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/nix/store/c6wk0nxbrwb5hamgxlfqsgi37gcn9752-gcc-wrapper-12.3.0/bin/c++  -I/build/magma-2.7.2/build/include -I/build/magma-2.7.2/include -I/build/magma-2.7.2/control -I/build/magma-2.7.2/magmablas -I/build/magma-2.7.2/sparse/include -I/build/magma-2.7.2/sparse/control -I/build/magma-2.7.2/testing -isystem /nix/store/17yql84kfrzd3pxsw6agj5dk2gqrzxlf-cuda_nvcc-12.2.140-dev/include -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_60,code=[compute_60,sm_60]" "--generate-code=arch=compute_61,code=[compute_61,sm_61]" "--generate-code=arch=compute_70,code=[compute_70,sm_70]" "--generate-code=arch=compute_75,code=[compute_75,sm_75]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_86,code=[compute_86,sm_86]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" "--generate-code=arch=compute_90a,code=[compute_90a,sm_90a]" --compiler-options -fPIC,-DADD_ -MD -MT CMakeFiles/magma.dir/magmablas/zgetf2_kernels_var.cu.o -MF CMakeFiles/magma.dir/magmablas/zgetf2_kernels_var.cu.o.d -x cu -c /build/magma-2.7.2/magmablas/zgetf2_kernels_var.cu -o CMakeFiles/magma.dir/magmablas/zgetf2_kernels_var.cu.o
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
nvcc error   : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error   : 'cicc' core dumped

SergeK · August 27, 2024, 6:18pm

Try building with -j1 and maybe limit --cores (cf Nix build ate my RAM 😭 - #5 by SergeK for more ugly hacks)

nixup · August 27, 2024, 6:44pm

[131487.396239] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-5.scope,task=cicc,pid=173563,uid=30017
[131487.396251] Out of memory: Killed process 173563 (cicc) total-vm:4671624kB, anon-rss:410732kB, file-rss:372kB, shmem-rss:0kB, UID:30017 pgtables:9160kB oom_score_adj:0

nixup · August 27, 2024, 8:18pm

I have tried setting nix.settings.max-jobs = 1; but I still have all 12 cores running cicc.
Is there a way to limit that?

SergeK · August 27, 2024, 9:56pm

Yes, --cores (twenty characters for discourse)

waffle8946 · August 27, 2024, 10:56pm

Tuning Cores and Jobs - Nix Reference Manual for context

nixup · August 28, 2024, 11:29am

I can see that it limits the cores used for

building Nix...
building the system configuration...
these 42 derivations will be built:
...

but it has noe effect on cicc or gcc which still take up all 12 cores.
Also oom killed cicc so I think it won’t help

nixup · August 28, 2024, 12:23pm

As of right now I have decided to scale down my config until I get config.nixpkgs,cudaSupport = true; to work.
I have also disabled limiting max-jobs and cores.

Paperless

First compilation happened with paperless-ngx and RAM usage seemed to stay around (most of the time under) 5G for the most time.

It froze at

[ 97%] Linking CXX executable ../../bin/opencv_test_stitching
[ 97%] Built target opencv_test_stitching
[ 97%] Linking CXX shared library ../../lib/libopencv_cudaobjdetect.so
[ 97%] Built target opencv_cudaobjdetect
[ 97%] Building CXX object modules/cudaobjdetect/CMakeFiles/opencv_test_cudaobjdetect.dir/test/test_objdetect.cpp.o
[ 97%] Building CXX object modules/cudaobjdetect/CMakeFiles/opencv_test_cudaobjdetect.dir/test/test_main.cpp.o
[ 97%] Linking CXX executable ../../bin/opencv_test_cudaobjdetect
[ 97%] Built target opencv_test_cudaobjdetect

I got no CPU load and no disk RW for a while and then one cicc maxed out one core.

Then it emerged fine.

Ollama

No compilation, just gets pulled in from cuda maintainers cache I guess.

open-webui

Here’s the trouble.

15:27 - nixos-rebuild start time
15:27 - [1/3430] things start, 8GB RAM & 600MB swap used
15:30 [298/3430] - 12G RAM used, swao still at 600MB
15:35 [359/3430] - 29.2 GB of RAM & 9.5GB of swap used
15:45 ? - 25.2 GB of RAM & 23.2 GB of swap used
16:10 ? - 23.5 GB of RAM & 1.9GB of swap used
17:53 [7,331 / 11,625] - 28.8 GB RAM & 18.4GB swap used

Next day it was already built.
nixos-rebuild test showed errors about failing to connect to dbus but rebooting fixed this.

Summary / solution

Turning on nixpkgs.config.cudaSupport1 led to OOM errors during nixos-rebuild.
I have managed to get a successfull rebuild by turning on components of my config one by one, so that they don’t happen in pararell.

Thank you to all the people that engaged with this post.

waffle8946 · August 28, 2024, 12:37pm

If you’re building nix expressions and it’s not respecting max-cores then probably the build needs to be adjusted to respect NIX_BUILD_CORES.