NVMe drive not detecting after calameres initiates

lilHoodie · August 23, 2023, 8:13pm

Hi everyone,

I’m having issues with my NVMe drive not being detected. Strangely, this issue arose today, even though I’ve been using the system without any problems for about a month. As part of my journey in learning NixOS, this morning when I tried booting up my Dell XPS 9560 laptop, I encountered a boot issue. Here’s the error I got:

nvme nvme0: Identify Controller failed (-4)

To my surprise, I couldn’t find my NVMe drive. My initial thought was, “No worries, I’ll just rollback to a previous generation.” After attempting this, I found that the only generation that worked was the very first one the system ever created.

Deciding to take a more drastic measure, I opted to reinstall NixOS. However, I was unable to detect my NVMe drive in the Calamares installer, GParted, or lsblk during a live boot session. Intriguingly, when I booted using a Linux Mint live CD, the drive was detected perfectly. As a sanity check, I proceeded with the Linux Mint installation, and everything went smoothly without a hitch.

Here’s a list of troubleshooting steps I’ve tried

Reinstalled NixOS.
Re-downloaded a brand new image and recreated my USB live boot. (I tried GNOME, KDE, and the minimal variants)
Reset my BIOS to default and then retested.
Carefully checked BIOS settings:
- Disabled fast boot
- Disabled secure boot
- Set SATA drives to AHCI (instead of RAID)
- Disabled C-states
- Disabled speed step
- Disabled any power-saving options
- Attempted to disable TPM 2.0
Reseated the NVMe drive, RAM, and battery.
Ran modprobe nvme during the live CD session.

Some additional observations:

The drive appears in the lspci output during the live-boot session.
The drive is visible in the lsblk output only before Calamares finishes loading. I noticed a message that said something along the lines of “loading 1 module”. Interestingly, the drive remains visible up until the Calamares installer completes loading that module. Post that, it’s just invisible.

I checked the SMART data for the drive:

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 32 Celsius
Available Spare: 92%
Available Spare Threshold: 50%
Percentage Used: 0%
Data Units Read: 6,135,172 [3.14 TB]
Data Units Written: 7,358,838 [3.76 TB]
Host Read Commands: 102,455,992
Host Write Commands: 86,450,036
Controller Busy Time: 3,882
Power Cycles: 1,021
Power On Hours: 577
Unsafe Shutdowns: 134
Media and Data Integrity Errors: 49
Error Information Log Entries: 0
Warning: 0
Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 32 Celsius
Temperature Sensor 2: 39 Celsius

The closest I’ve come to identifying the root cause was when I tried running fdisk. The process seemed to hang for a while before spitting out the following error:

Unable to change power state from D3cold to D0, device inaccessible.
I/O error on dev nvme0n1, logical block 0, async page read

System Info:

Dell XPS 9560
nixos-23.05.2975
CPU: i7-7700HQ
GPU: GTX 1050
RAM: 16G DDR4-2400MHz
Storage: SKHynix PC401 NVMe 1TB revision: 80002E00

I’m really at my wit’s end here and any guidance or suggestions would be immensely appreciated.

Edit: Some information I forgot to add.

When I was running lspci the drive showed up and was recognized as an SKHynix NVMe drive, however it was the wrong size. lspci said the drive was 256G but I only have a 1TB drive installed.

Lyndeno · August 23, 2023, 8:32pm

I also have an XPS 9560 running NixOS.

Here is what I have:

Fast boot enabled
Secure boot enabled (lanzaboote)
SATA/nvme set to AHCI
C states are enabled
Speed step I am pretty sure is enabled
TPM 2.0 enabled.

I am unsure about your drive.

The original Samsung drive that came with my computer worked perfectly. I wanted more capacity and switched to a 2TB WD SN770. This drive would intermittently try to enter some power saving state and the computer would not be able to access it anymore, sometimes boots would fail, sometimes the filesystem would go read-only while booted, etc. This drive works fine on my desktop, it seemed to be some incompatibility between the laptop and the drive. I switched to a 1TB Timetec drive from amazon and it works fine.

As for your errors, I think the D3cold error is related to the NVIDIA GPU, so not your drive. However, the I/O error is similar to what I got with the WD drive. Are there any other errors in dmesg related to the nvme drive?

I cannot say for sure you are having the same problem as I was, as our drives are different. Is this a new drive? What is your bios revision?

An aside on the bios, Dell’s website does not have the latest. lvfs via fwupd has a bios several versions newer.

Lyndeno · August 23, 2023, 8:36pm

Here is the error snippet from when I had the WD drive in my laptop:

[ 1753.922566] nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10
[ 1753.922574] nvme nvme0: Does your device have a faulty power saving
mode enabled?
[ 1753.922578] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0
pcie_aspm=off" and report a bug
[ 1753.940085] nvme0n1: I/O Cmd(0x2) @ LBA 6278080, 187 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940103] I/O error, dev nvme0n1, sector 50224640 op 0x0:(READ)
flags 0x84700 phys_seg 127 prio class 3
[ 1753.940124] nvme0n1: I/O Cmd(0x2) @ LBA 6278267, 229 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940133] I/O error, dev nvme0n1, sector 50226136 op 0x0:(READ)
flags 0x84700 phys_seg 127 prio class 3
[ 1753.940143] nvme0n1: I/O Cmd(0x2) @ LBA 6278496, 3 blocks, I/O Error
(sct 0x3 / sc 0x71)
[ 1753.940149] I/O error, dev nvme0n1, sector 50227968 op 0x0:(READ)
flags 0x80700 phys_seg 2 prio class 3
[ 1753.940172] nvme0n1: I/O Cmd(0x2) @ LBA 6278499, 256 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940179] I/O error, dev nvme0n1, sector 50227992 op 0x0:(READ)
flags 0x84700 phys_seg 102 prio class 3
[ 1753.940190] nvme0n1: I/O Cmd(0x2) @ LBA 6278755, 169 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940196] I/O error, dev nvme0n1, sector 50230040 op 0x0:(READ)
flags 0x84700 phys_seg 127 prio class 3
[ 1753.940205] nvme0n1: I/O Cmd(0x2) @ LBA 6278924, 36 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940211] I/O error, dev nvme0n1, sector 50231392 op 0x0:(READ)
flags 0x80700 phys_seg 28 prio class 3
[ 1753.940227] nvme0n1: I/O Cmd(0x2) @ LBA 6278961, 129 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940233] I/O error, dev nvme0n1, sector 50231688 op 0x0:(READ)
flags 0x84700 phys_seg 127 prio class 3
[ 1753.940241] nvme0n1: I/O Cmd(0x2) @ LBA 6279090, 14 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940247] I/O error, dev nvme0n1, sector 50232720 op 0x0:(READ)
flags 0x80700 phys_seg 14 prio class 3
[ 1753.940266] nvme0n1: I/O Cmd(0x2) @ LBA 6277056, 141 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940272] I/O error, dev nvme0n1, sector 50216448 op 0x0:(READ)
flags 0x84700 phys_seg 127 prio class 3
[ 1753.940283] nvme0n1: I/O Cmd(0x2) @ LBA 6277197, 256 blocks, I/O
Error (sct 0x3 / sc 0x71)
[ 1753.940289] I/O error, dev nvme0n1, sector 50217576 op 0x0:(READ)
flags 0x84700 phys_seg 92 prio class 3
[ 1753.945614] nvme 0000:04:00.0: enabling device (0000 -> 0002)
[ 1753.945944] nvme nvme0: Disabling device after reset failure: -19

Compare your dmesg with this, maybe there are similarities.

lilHoodie · August 23, 2023, 8:49pm

Thanks for the reply. I’m currently about to leave work, but I’ll check my dmesg as soon as I get home. I bought this laptop second hand from a friend so I’m unsure if this is the OEM drive. When running a dell diagnostic test, it states OEM but I may be misinterpreting what it means.

Hard Drive 1-0-1
       OEM: SK hynix, product: PC401 NVMe SK hynix 1TB, revision: 80002E00,
       S/N: XXXXXXXXXXXXXXXXX, type: NVMe, size: 1 TB M.2, PPID:  XXXXXXXXXXXXXXXXXXXXXXXX

It’s very possible the previous owner replaced the drive. I was thinking about getting a 2TB upgrade anyways so I might just pull the trigger and get a Samsung drive.

As for the BIOS settings I was just grasping for straws with the C states, speed step, and TPM settings. Just had to try things out. I updated my BIOS about 2 months ago so I highly doubt there is a newer version but I’m running version: 1.31.0

lilHoodie · August 24, 2023, 3:42pm

So, I ended up replacing my NVMe with one I had laying around. The problem has gone away. I can consistently boot into NixOS and it detects the drive every time. Furthermore, Calamares doesn’t wait for “1 module to load” likely because it was reaching a timeout trying to get that drive to load before.

Regardless thank you for your help. I ordered a new 2TB Samsung drive which should hopefully work without issues. I can still post my DMESG in this thread for future documentation, if that’s useful. But that means I’d have to install the problematic drive again.

valentino · September 4, 2023, 9:06am

I have the same model (Dell XPS 15 9560) and it failed with a similar message:

<<< NixOS Stage 1 >>>
loading module dm_mod...
running udev...
Starting systemd-udevd version 253.6
[   33.926607] nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Device /dev/disk/by-uuid/1dbda026-dd09-4abc-8a83-f0a5591074a3 is too small.
Key File /crypto_keyfile.bin failed!
/crypto_keyfile.bin is unavailable

Luckily I was able to get back to a working system by rolling back to a previous NixOS revision, but upgrading always brings me back to this error. So I guess I also need to make a clean new Install of NixOS and/or replace my NVMe.

valentino · September 4, 2023, 8:23pm

Okay I just tried a reinstall and I am experiencing exactly the same issues that you had with a fresh live iso.

Calamares says “There are no partitions to install on” and lsblk output is only showing the usb stick.

I also first thought about the kernel module nvme not being loaded, but lsmod says that it is loaded.

I am lucky to still have a working old configuration and am about to buy a new SSD soon to get this fixed.

uep · September 5, 2023, 2:22am

@jade has been having some similar issues discussed elsewhere, seemingly related to power management but for some reason 100% reproducibly triggered via cryptsetup.

There seems to have been a regression in recent kernels (maybe 6.5), and perhaps backported to mainline/stable recently too. From her testing, 6.1.44 is ok, 6.1.49 is not.

Can others having this issue help confirm kernel versions in working and failing generations, please?

Relevant log message that seems indicative around the problem:

Sep 04 11:56:38 localhost kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Sep 04 11:56:38 localhost kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 04 11:56:38 localhost kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Sep 04 11:56:38 localhost kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 04 11:56:38 localhost kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 04 11:56:38 localhost systemd-cryptsetup[169]: Device /dev/disk/by-uuid/b80aedf8-ddd4-46fa-8d09-5215d5f286b9 READ lock released.
Sep 04 11:56:38 localhost systemd-cryptsetup[169]: IO error while decrypting keyslot.
Sep 04 11:56:38 localhost systemd-cryptsetup[169]: Keyslot 0 (luks2) open failed with -5.

valentino · September 5, 2023, 7:21am

The last working kernel on my system is 6.4.10.

Later versions I tried are all failing (6.4.11, 6.5.1).
I tried to downgrade to 6.1.51, but this one is also failing.

jade · September 5, 2023, 8:15am

this is an XPS 9560 for me too. it’s definitely a regression then, possibly combined with firmware fuck up.

jade · September 5, 2023, 8:17am

Wait, what the heck actually: mine is a replaced drive! I don’t know about the original drive behaviour but this one is a SN750 I think (SN550? regardless). So I’m guessing it’s an ACPI/firmware/hardware bug that got re-exposed by a kernel regression.

junjihashimoto · September 5, 2023, 9:38am

Have you tried disabled intel vmx?
In my case, it works with usb installer.

uep · September 5, 2023, 10:26am

I also have a 9560. I just tried updating it and it has similar issues. It’s not using LUKS, but zfs hangs pretty early.

jade · September 5, 2023, 12:18pm

I have filed a nixpkgs bug so people might find it more easily and have better collected details: Storage issues on XPS 15 9560 due to kernel regression; failing cryptsetup · Issue #253418 · NixOS/nixpkgs · GitHub

zmrocze · September 5, 2023, 12:48pm

Having a related issue, though in my case:

lsblk shows both the ssd and the nvme drives
the nixos calamares installer shows only the ssd drive

I can’t install into the drive I want.

It’s not the lack of “nvme” module, that module is loaded.

awdrius · September 5, 2023, 5:25pm

Just here to say +1 on this.

I have the same NVme drive issue with the device being disabled after the whole Does your device have a faulty power saving mode enabled? and the following message flood. I had an original m.2 drive (made by Toshiba) and TBH I used that as an opportunity to get a new drive (Crucial P3 Plus 2TB) and it resulted in the exact same error messages. After reading the message by @jade (btw, I think the link there points back to this thread, but I assume the intended target is this, I have downloaded and tried 22.11 image and it works well, the drive is found and I can slice and dice it for installation.

emmanuelrosa · September 5, 2023, 6:17pm

I had a similar issue a long time ago. I got around it with this:

    kernelParams = [
      "nvme_core.default_ps_max_latency_us=5500"
    ];

uep · September 5, 2023, 10:17pm

That seems (in quick initial testing at least) to do the trick on mine too.

EDIT: Nope. It definitely works better, boots and seems to be normal at first - but eventually the pool falls apart from too many IO errors. None of these errors were in the logs after reboot, so it looks like the device failed to accept new writes and never recovered, after a bit more than an hour in operation.

valentino · September 29, 2023, 1:20pm

Today I tried 6.5.5 and it booted correctly but somehow after a suspend and wakeup, my whole filesystem was in read-only mode. I need to investigate if both issues are correlated with each other.

fabianh · September 30, 2023, 12:29am

Have been expiriencing a similar issue as well but my system is mostly stable (unless compiling a lot of nixpkgs stuff then I can consistently get some errors with some high i/o derivations like webkitgtk)

https://github.com/NixOS/nixpkgs/issues/257159

Found this thread after searching for the “sc 0x71” and finding a lot of recent reports.