Immich and CUDA-accelerated machine learning

plodocus · January 5, 2025, 2:45pm

Hi,

I’m trying to set up CUDA-acceleration for the machine learning service of Immich.

This is running on a computer with an Nvidia T500, which should be supported by Immich.

GPU is set up like this (which is more or less copy-pasted from the wiki):

{ config, ... }:
{
  allowedUnfree = [ "nvidia-x11" "nvidia-persistenced" ];

  # Enable OpenGL
  hardware.graphics = {
    enable = true;
  };
  services.xserver.videoDrivers = [ "nvidia" ];

  hardware.nvidia = {
    nvidiaPersistenced = false;
    # Modesetting is required.
    modesetting.enable = true;

    # Nvidia power management. Experimental, and can cause sleep/suspend to fail.
    # Enable this if you have graphical corruption issues or application crashes after waking
    # up from sleep. This fixes it by saving the entire VRAM memory to /tmp/ instead 
    # of just the bare essentials.
    powerManagement.enable = false;

    # Fine-grained power management. Turns off GPU when not in use.
    # Experimental and only works on modern Nvidia GPUs (Turing or newer).
    powerManagement.finegrained = false;

    # Use the NVidia open source kernel module (not to be confused with the
    # independent third-party "nouveau" open source driver).
    # Support is limited to the Turing and later architectures. Full list of 
    # supported GPUs is at: 
    # https://github.com/NVIDIA/open-gpu-kernel-modules#compatible-gpus 
    # Only available from driver 515.43.04+
    # Currently alpha-quality/buggy, so false is currently the recommended setting.
    open = false;

    # Enable the Nvidia settings menu,
    # accessible via `nvidia-settings`.
    nvidiaSettings = false;

    # Optionally, you may need to select the appropriate driver version for your specific GPU.
    package = config.boot.kernelPackages.nvidiaPackages.stable;

    prime = {
      offload = {
        enable = true;
        enableOffloadCmd = true;
      };
      intelBusId = "PCI:0:2:0";
      nvidiaBusId = "PCI:1:0:0";
    };
  };

  hardware.nvidia-container-toolkit.enable = true;
}

Immich is configured like this:

  services.immich = {
    enable = true;
    openFirewall = true;
    host = "0.0.0.0";
}

My problem is that the GPU doesn’t seem to be recognised. I test this by uploading a new image to Immich and looking at nvtop. I expect the “smart search” machine learning job to cause some load on the GPU, but never see some.

Already tried to set
services.immich.machine-learning.environment.DEVICE to cuda or nvidia and users.users.immich.extraGroups = ["video" "render"];, but didn’t succeed.

What am I missing?

plodocus · January 5, 2025, 6:33pm

Oh, it seems the package is not built with GPU support. Compare the dependencies in the package to the original project.toml.
Can someone confirm that my interpretation is correct?

alan-strohm · February 13, 2025, 3:29am

I’ve gotten immich-machine-learning to work with CUDA. Here are the things I had to do. This is in addition to all the work to get nvidia drivers working. You probably want to verify that with a different app first. I used plex hardware transcoding to verify.

1.Enable cudaSupport for onnxruntime.

Ideally, I’d be able to do this via nixpkgs.config.cudaSupport but mxnet-1.9.1 with cudaSupport is marked broken. onnxruntime is a transitive dependency via at least insightface but probably also huggingface-hub so I had to use an overlay:

nixpkgs.overlays = [
  (final: prev: {
    onnxruntime = prev.onnxruntime.override {cudaSupport = true;};
  })
];

I read in CUDA - NixOS Wiki that adding the nix-community cache might prevent me from having to recompile things, but it didn’t (maybe because I’m not using the same nvidia driver version as others? IDK).

Point LD_LIBRARY_PATH to onnxruntime.

It seems like this might be a bug in how onnxruntime is packaged. I might file one later.

services.immich.machine-learning = {
  environment.LD_LIBRARY_PATH = "${pkgs.python312Packages.onnxruntime}/lib    /python3.12/site-packages/onnxruntime/capi";
};

Patch immich-machine-learning to disable a broken test.

This is Build failure: immich-machine-learning-1.118.2 · Issue #352113 · NixOS/nixpkgs · GitHub I’m not sure why it’s failing.

nixpkgs.overlays = [
  (final: prev: {
    # Work-around https://github.com/NixOS/nixpkgs/issues/352113
    immich-machine-learning = prev.immich-machine-learning.overrideAttrs (_: {patches = [./disable_cuda_test.diff];});
  })
];

disable_cuda_test.diff:

--- a/app/test_main.py	2025-02-11 21:09:09.022378668 -0800
+++ b/app/test_main.py	2025-02-11 21:09:18.327188276 -0800
@@ -241,8 +241,6 @@
         session = OrtSession("ViT-B-32__openai")
 
         assert session.sess_options.execution_mode == ort.ExecutionMode.ORT_SEQUENTIAL
-        assert session.sess_options.inter_op_num_threads == 1
-        assert session.sess_options.intra_op_num_threads == 2
         assert session.sess_options.enable_cpu_mem_arena is False
 
     def test_sets_default_sess_options_does_not_set_threads_if_non_cpu_and_default_threads(self) -> None:

Signs that everything is working:

This error message is no longer emitted on startup of immich-machine-learning:

2022-10-28 19:54:16.5781916 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1622 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

Clicking “refresh faces” causes a process to show up on nvtop ~immediately for me.

AirswitchAsa · February 27, 2025, 1:01am

you are a real hero, and by the way, I had to put the system env this way on my machine to get it working (I am using immich-machine-learning alone):

  systemd.services.immich-machine-learning = {
    description = "Immich Machine Learning Service";
    after = ["network.target"];
    wantedBy = ["multi-user.target"];

    environment.LD_LIBRARY_PATH = "${pkgs.python312Packages.onnxruntime}/lib:${pkgs.python312Packages.onnxruntime}/lib/python3.12/site-packages/onnxruntime/capi";

    serviceConfig = {
      ExecStart = "${pkgs.immich-machine-learning}/bin/machine-learning";
      User = "immich-ml";
    };
  };

knkski · March 31, 2025, 12:31am

The above almost worked for me. I had to add this to /etc/nixos/configuration.nix to get it working:

  systemd.services.immich-machine-learning = {
    serviceConfig = {
      PrivateDevices = lib.mkForce false;
      DeviceAllow = [
        "/dev/nvidia0"
        "/dev/nvidiactl"
        "/dev/nvidia-uvm"
      ];
    };
  };

There’s a PR that’s been merged and will allow configuring this without the override:

github.com/NixOS/nixpkgs

nixos/immich: Add accelerationDevices config option for hardware acceleration

NixOS:master ← trautwein:immich-hardware-acceleration-option

opened 08:50AM - 24 Jan 25 UTC

trautwein

+13 -1

# Description of changes ## Preface Currently, due to the `PrivateDevices`… setting being set to `true`, hardware accelerated video transcoding is not possible since the `immich-server` service cannot access any devices on the host as can be witnessed in the logs: ``` Jan 21 10:43:11 nixos immich[18157]: [Nest] 18157 - 01/21/2025, 10:43:11 AM ERROR [Microservices:JobService] Error: No /dev/dri devices found. If using Docker, make sure at least one /dev/dri device is mounted Jan 21 10:43:11 nixos immich[18157]: [Nest] 18157 - 01/21/2025, 10:43:11 AM ERROR [Microservices:JobService] Unable to run job handler (videoConversion/video-conversion): Error: No /dev/dri devices found. If using Docker, make sure at least one /dev/dri device is mounted ``` This can be worked around by overriding said option using the following: ``` systemd.services."immich-server".serviceConfig.PrivateDevices = lib.mkForce false; ``` ## Changes in this PR I think adding a `accelerationDevices` configuration option as proposed in this PR is preferable. The implementation follows that of Plex (see https://github.com/NixOS/nixpkgs/pull/293118/files). I'm quite new to NixOS, so if there is more to add for this PR to get merged, please let me know. Thanks! ## Things done - Built on platform(s) - [x] x86_64-linux - [ ] aarch64-linux - [ ] x86_64-darwin - [ ] aarch64-darwin - [x] Tested, as applicable: - [NixOS test(s)](https://nixos.org/manual/nixos/unstable/index.html#sec-nixos-tests) (look inside [nixos/tests](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests)) - and/or [package tests](https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#package-tests) - or, for functions and "core" functionality, tests in [lib/tests](https://github.com/NixOS/nixpkgs/blob/master/lib/tests) or [pkgs/test](https://github.com/NixOS/nixpkgs/blob/master/pkgs/test) - made sure NixOS tests are [linked](https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#linking-nixos-module-tests-to-a-package) to the relevant packages - [x] Tested compilation of all packages that depend on this change using `nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"`. Note: all changes have to be committed, also see [nixpkgs-review usage](https://github.com/Mic92/nixpkgs-review#usage) - [ ] Tested basic functionality of all binary files (usually in `./result/bin/`) - [25.05 Release Notes](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2505.section.md) (or backporting [24.11](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2411.section.md) and [25.05](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2505.section.md) Release notes) - [ ] (Package updates) Added a release notes entry if the change is major or breaking - [ ] (Module updates) Added a release notes entry if the change is significant - [ ] (Module addition) Added a release notes entry if adding a new NixOS module - [x] Fits [CONTRIBUTING.md](https://github.com/NixOS/nixpkgs/blob/master/CONTRIBUTING.md). --- Add a :+1: [reaction] to [pull requests you find important]. [reaction]: https://github.blog/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/ [pull requests you find important]: https://github.com/NixOS/nixpkgs/pulls?q=is%3Aopen+sort%3Areactions-%2B1-desc

When that’s available, if you’re using an NVIDIA card, setting services.immich.accelerationDevices to the above DeviceAllow list should enable GPU acceleration. Otherwise you’ll get an error like this:

[E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=0 ; hostname=foo ; file=/build/source/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=65 ; expr=cudaGetDeviceCount(&num_devices);

knkski · April 14, 2025, 1:01am

I updated my machine recently and had to adjust the patch. It now looks like this:

--- a/test_main.py      2025-02-11 21:09:09.022378668 -0800
+++ b/test_main.py      2025-02-11 21:09:18.327188276 -0800
@@ -285,8 +285,6 @@
         session = OrtSession("ViT-B-32__openai")

         assert session.sess_options.execution_mode == ort.ExecutionMode.ORT_SEQUENTIAL
-        assert session.sess_options.inter_op_num_threads == 1
-        assert session.sess_options.intra_op_num_threads == 2
         assert session.sess_options.enable_cpu_mem_arena is False

     def test_sets_default_sess_options_does_not_set_threads_if_non_cpu_and_default_threads(self) -> None: