Capturing Kernel Panic Logs (Kdump...) (Issues with amdgpu in dmesg)

Greetings

I am actually trying to troubleshoot an issue with my amd card

[    4.418857] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[    4.418861] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000006003000 from client 0x12 (VMC)
[    4.418864] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x0000073C
[    4.418866] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: DCEDMC (0x3)
[    4.418867] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[    4.418868] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x6
[    4.418869] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[    4.418870] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x1
[    4.418871] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0

but that is kind of beside the point, specifically when my VMs are working hard i seem to be getting crashes, I think its related to my iGPU and how im trying to setup gpu passthrough and such… but i cant get good logs on the crashes.

I understand its heretical but checking out GPT ideas it seems like a tool such as kdump may be useful but im not sure how to set it up?

the redhat docs mention

13.1. Estimating the kdump size

When planning and building your kdump environment, it is important to know how much space the >crash dump file requires.

The makedumpfile --mem-usage command estimates how much space the crash dump file requires. It >generates a memory usage report. The report helps you determine the dump level and which pages are >safe to be excluded.

however i dont see a makedumpfile command nor does anything pop up on the nixpkgs site.

Does anyone know a good way to setup kdump or do what im attempting to do? or perhaps know how to just fix this issue with my amd card :smiley:

Ryzen 9 7950x (raphael iGPU)
RX 6800 XT
nvidia gtx 750 ti (passed through)

perhaps relevant lines from config.nix

  boot.initrd.kernelModules = [ "amdgpu" ];
  # Bootloader.
  boot.loader.systemd-boot.enable = true;
  boot.loader.efi.canTouchEfiVariables = true;
  # VFIO
  boot.kernelParams = [ "amd_iommu=on" "iommu=pt" "amdgpu.ppfeaturemask=0xfffd3fff" ];
  boot.blacklistedKernelModules = [ "nvidia" "nouveau" ];
  boot.kernelModules = [ "kvm_amd" "vfio_virqfd" "vfio_pci" "vfio_iommu_type1" "vfio" ];
  boot.extraModprobeConfig = "options vfio-pci ids=10de:1380,10de:0fbc";
  # Virt-manager
  virtualisation.libvirtd.enable = true;

Best,

1 Like