Dead SATA disk on HBA and slow boot

trumee · January 21, 2025, 6:16pm

Hello,

I have a remote server with a dead disk. Unfortunately, I will not have physical access for some time. The broken disk cause boot time to be very long with errors like this,

The disk is connected to an SAS2308 controller and the disk symlink (/dev/sge) changes every time. In the screenshot it is /dev/sde, but next boot it shows up as /dev/sda. I want to disable this disk in the system similar to this question.

There is no back plane in the system.

Dead disk

# lsscsi -v 
[0:0:3:0]    disk    ATA      WDC WD10EFRX-68J 1A01  /dev/sda 
  dir: /sys/bus/scsi/devices/0:0:3:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:09.0/0000:04:00.0/host0/port-0:3/end_device-0:3/target0:0:3/0:0:3:0]

HBA details

# lspci -nn -v -s 04:00.0
04:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087] (rev 05)
        Subsystem: Hewlett Packard Enterprise H220i [1590:0041]
        Flags: bus master, fast devsel, latency 0, IRQ 17
        I/O ports at c000 [size=256]
        Memory at dd540000 (64-bit, non-prefetchable) [size=64K]
        Memory at dd500000 (64-bit, non-prefetchable) [size=256K]
        Expansion ROM at dd400000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
        Capabilities: [68] Express Endpoint, IntMsgNum 0
        Capabilities: [d0] Vital Product Data
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1e0] Secondary PCI Express
        Capabilities: [1c0] Power Budgeting <?>
        Capabilities: [190] Dynamic Power Allocation <?>
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: mpt3sas
        Kernel modules: mpt3sas

Error in dmesg

[  267.833116] mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
[  267.833119] sd 0:0:3:0: [sda] tag#4251 CDB: Read(10) 28 00 18 ae c4 00 00 02 00 00
[  267.833127] sd 0:0:3:0: [sda] tag#4269 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[  267.833126] I/O error, dev sda, sector 414106624 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
[  267.833148] sd 0:0:3:0: [sda] tag#4269 CDB: Read(10) 28 00 00 44 c7 88 00 00 68 00
[  267.835486] I/O error, dev sda, sector 4507528 op 0x0:(READ) flags 0x80700 phys_seg 13 prio class 2
[  267.837828] sd 0:0:3:0: [sda] tag#4268 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
[  267.837835] sd 0:0:3:0: [sda] tag#4268 CDB: Read(10) 28 00 00 44 c7 00 00 00 80 00
[  267.837839] I/O error, dev sda, sector 4507392 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
[  267.975616] sd 0:0:3:0: Power-on or device reset occurred
[  271.233140] mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

How can i set the system/kernel to ignore this device?

Thanks

c0d33p · January 22, 2025, 12:30am

In NixOS documentation there is a paragraph about filesystems
https://nixos.org/manual/nixos/stable/index.html#ch-file-systems

How does you system configuration looks like in terms of filesystems / devices?
Could you please provide your fstab?

Please note you can always use uuid instead the /dev/sd{x}
like

➜ ~ ls -1 /dev/disk/by-uuid
4a5996f2-2c85-46e3-8bd7-ed89f6d497c2
5fa5d7f4-ab38-48ac-8946-7efda00eab61
E067-8C1F
f4a6d210-290a-40e0-9f20-a1967dde44d2

trumee · January 22, 2025, 2:19am

I am using disk id for system configuration. This is my /etc/fstab,

$ cat /etc/fstab 
# This is a generated file.  Do not edit!
#
# To make changes, edit the fileSystems and swapDevices NixOS options
# in your /etc/nixos/configuration.nix file.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>

# Filesystems.
rpool/root / zfs x-initrd.mount,zfsutil 0 0
/dev/disk/by-uuid/2104-D1AE /boot vfat fmask=0022,dmask=0022 0 2
rpool/home /home zfs zfsutil 0 0
/dev/disk/by-id/ata-WDC_WD200EDGZ-11BLDS0_8LK6LSTU /mnt/ata-WDC_WD200EDGZ-11BLDS0_8LK6LSTU ext4 defaults 0 2
/dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7JKG30MC-part1 /mnt/ata-WDC_WD80EMAZ-00WJTA0_7JKG30MC-part1 ext4 defaults 0 2
/dev/zvol/ssdpool/docker1 /mnt/docker1 ext4 defaults 0 2
rpool/nix /nix zfs x-initrd.mount,zfsutil 0 0
rpool/var /var zfs x-initrd.mount,zfsutil 0 0
rpool/incus /var/lib/incus zfs zfsutil 0 0


# Swap devices.
/dev/zvol/rpool/swap none swap zfsutil
/dev/zvol/rpool/swap1 none swap zfsutil
/dev/zvol/rpool/swap2 none swap zfsutil

c0d33p · January 22, 2025, 9:30am

In fstab there is information about zfs and some kind of pool “rpool”
Could you please check which disks belong to that pool?

Please list all pools using list

# zpool list
# zpool status -v <pool name>

Maybe one of them is still using the disk

trumee · January 22, 2025, 4:02pm

No, i have not referenced the disk in my config and is not part of any zfs pool. The system on bootup tried to probe all the present drives and ultimately times out with failure message.

I am looking the kernel parameter to disable this disk at boot similar to libata like this but for scsi.

hans4687 · January 22, 2025, 4:51pm

Hi,
I understand your question, but is there an kind of IPMI available to enter the boot software of the HBA and disable this harddisk. In this way the kernel won´t notice it.

c0d33p · January 22, 2025, 4:57pm

I have educated myself on the subject… there is a software called multipath
You can blacklist specific devices, below is a link to the opensuse documentation on how to blacklist it there

And there is an option in NixOS to configure this

So as I’m still green on this topic, I’m not able to solve your problem

trumee · January 22, 2025, 5:00pm

Yes, i do have IPMI. I can bring up the AVAGO HBA interface by issuing Ctrl+C. However, i dont see a way to disable the disk through the interface.

hans4687 · January 22, 2025, 5:57pm

Just a far shot. Is it possible to set the scan id to No of the hard disk during the boot. If this is possible, the host system will not see the device. I saw this in Wayback Machine.

trumee · February 9, 2025, 5:53pm

I dont get such details as mentioned on Page 132 of the pdf (7-12 of the document). Perhaps the HBA mode means that such details are not exposed in the BIOS.