Removing the `amdgpu` module from the kernel

Hello there. Yesterday, I removed the amdgpu built-in module from the kernel, and it was an adventure. Here’s the after-action report.

Bottom line up front:

# Turn off amdgpu (conflicts with NVIDIA)
boot.kernelPatches = with inputs.nixpkgs.lib; [{
  name = "disable-amdgpu";
  patch = null;
  extraStructuredConfig = {
    DRM_AMDGPU = kernel.no;
    DRM_AMDGPU_CIK = mkForce (kernel.option kernel.no);
    DRM_AMDGPU_SI = mkForce (kernel.option kernel.no);
    DRM_AMDGPU_USERPTR = mkForce (kernel.option kernel.no);
    DRM_AMD_DC_FP = mkForce (kernel.option kernel.no);
    DRM_AMD_DC_SI = mkForce (kernel.option kernel.no);
    HSA_AMD = mkForce (kernel.option kernel.no);
  };
}];

Aside: The reason I’m doing this is that I experience a boot hang when using both amdgpu and the nvidia drivers. I had previously worked around this with boot.kernelParams = [ "module_blacklist=amdgpu" ];. I wanted to see if I could resolve it with build-time changes, rather than run-time. It’s OK to make my machine sweat a little building the kernel.

I found the CONFIG_DRM_AMDGPU setting was what turned on and off the module. Great! Following the Linux Kernel article on the NixOS wiki, I concocted the first bit:

boot.kernelPackages = pkgs.linuxPackages_latest;

# Turn off amdgpu (conflicts with NVIDIA)
boot.kernelPatches = with inputs.nixpkgs.lib; [{
  name = "disable-amdgpu";
  patch = null;
  extraStructuredConfig = {
    DRM_AMDGPU = kernel.no;
  };
}];

One sudo nixos-rebuild --flake /etc/nixos#zebul boot later, we get this error:

error: builder for '/nix/store/cxcxn67gziw66mg6kncl67s0h4y1sy00-linux-config-6.5.drv' failed with exit code 255;
       last 10 log lines:
       > GOT: # configuration written to .config
       > GOT: #
       > GOT: make[1]: Leaving directory '/build/linux-6.5/build'
       > GOT: make: Leaving directory '/build/linux-6.5'
       > error: unused option: DRM_AMDGPU_CIK
       > error: unused option: DRM_AMDGPU_SI
       > error: unused option: DRM_AMDGPU_USERPTR
       > error: unused option: DRM_AMD_DC_FP
       > error: unused option: DRM_AMD_DC_SI
       > error: unused option: HSA_AMD
       For full logs, run 'nix log /nix/store/cxcxn67gziw66mg6kncl67s0h4y1sy00-linux-config-6.5.drv'.

Hmm. I see on the wiki article that there’s an option ignoreConfigErrors that allows the build to continue. However, most of the ways to set it involved using overlays or talking about specific kernel versions. Here’s one setting that did work.

boot.kernelPackages = pkgs.linuxPackagesFor (pkgs.linux_latest.override {
  ignoreConfigErrors = true;
});

While that does work, it means that any bad configuration is going to silently work. It would be better, in my estimation, to only disable the six settings that are causing the issue. Let’s try that:

  # Turn off amdgpu (conflicts with NVIDIA)
  boot.kernelPatches = with inputs.nixpkgs.lib; [{
    name = "disable-amdgpu";
    patch = null;
    extraStructuredConfig = {
      DRM_AMDGPU = kernel.no;
      DRM_AMDGPU_CIK = kernel.no;
      DRM_AMDGPU_SI = kernel.no;
      DRM_AMDGPU_USERPTR = kernel.no;
      DRM_AMD_DC_FP = kernel.no;
      DRM_AMD_DC_SI = kernel.no;
      HSA_AMD = kernel.no;
    };
  }];

When rebuilt, this error arises:

       error: The option `settings.DRM_AMDGPU_CIK.tristate' has conflicting definition values:
       - In `pkgs/os-specific/linux/kernel/common-config.nix': "y"
       - In `<unknown-file>': "n"
       Use `lib.mkForce value` or `lib.mkDefault value` to change the priority on any of these definitions.

OK, sure, let’s mkForce it. When we do, it’s the same error about unused parameters. That makes sense. Regardless of being yes or no, they’re still unused settings. How to remove them?

Those settings come from common-config.nix. I didn’t find a way to overlay this file or patch it from my flake, other than to fork the nixpkgs repository and keep my fork up-to-date. That sounds onerous, so I stopped on that.

Doing a little data flow reading, I found that these settings eventually make their way to linux/kernel/generic.nix, then are interpreted by nixos/modules/system/boot/kernel_config.nix.

Ah! And here’s an option. Let’s take our mkForce kernel.no and turn it into mkForce (kernel.option kernel.no). And with that, we’ve accomplished all aims:

  1. The amdgpu module is not in the kernel (verified with zcat /proc/config.gz | grep AMDGPU after a reboot)
  2. Config settings that aren’t read by the kernel build are still errors. (ignoreConfigErrors is still false.)
  3. No overlays or forking of the nixpkgs repo; we stay up to date with a simple nix flake update / nixos-rebuild.

I’m posting this mostly to hit the search engines so that if others encounter this same issue, they’ll have the breadcrumbs needed to get themselves to some better state.

Any feedback on this method, or other routes I could have gone, is appreciated.

1 Like

Can the UX be improved? Make all the default kernel options use lib.mkDefault at least? Show a better error for kernel.mkOption?