An easier way to repair corrupted nix DB

mhwombat · November 23, 2023, 3:25pm

I use NixOS. My nix DB gets corrupted every few months. I want to find a way to repair this sort of problem without having to completely rebuild my system.

Note: I don’t think it’s a disk issue. I only ever have problems with the Nix store; no other files get corrupted. I’ve swapped disks several times, but the problem doesn’t go away. Might be because I’m on “unstable”.

I saw this, which looked promising…

I recently switched to btrfs, but I’m not very familiar with it yet. I guess I would use snapshots to do that?

So here’s what I did try…

❯ nix-collect-garbage
finding garbage collector roots...
deleting garbage...
0 store paths deleted, 0.00 MiB freed
error: executing SQLite statement 'delete from ValidPaths where path = '/nix/store/xfl304fnamjx1l2rj5wc0wbhkccj5lmh-context-typescripts.r60422.tar.xz.drv';': database disk image is malformed, database disk image is malformed (in '/nix/var/nix/db/db.sqlite')

❯ sudo nix-store --verify --check-contents --repair
[sudo] password for amy:
reading the Nix store...
checking path existence...
error: store path '2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv' contains illegal base-32 character ''

❯ sudo nix store delete /nix/store/2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv
path '/nix/store/2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv' does not contain a 'flake.nix', searching up
error: getting status of '/nix/store/2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv': No such file or directory

❯ ls /nix/store/2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv
ls: cannot access '/nix/store/2xw0vq0cf42p20i6h0pair49afbm6nz-text-short-0.1.5-r3.cabal.drv': No such file or directory

I followed the steps described in nix-collect-garbage: error getting outputs of ... database disk image is malformed · Issue #1353 · NixOS/nix · GitHub and Nix store sqlite db corruption

❯ sudo sqlite3 /nix/var/nix/db/db.sqlite 'pragma integrity_check'
wrong # of entries in index sqlite_autoindex_Refs_1
row 47413 missing from index sqlite_autoindex_ValidPaths_1
row 62261 missing from index sqlite_autoindex_ValidPaths_1
row 62262 missing from index sqlite_autoindex_ValidPaths_1
wrong # of entries in index sqlite_autoindex_ValidPaths_1

# cd /nix/var/nix/db

# nix-shell -p sqlite

# sqlite3 db.sqlite ".backup 'db.bak.sqlite' "

# sqlite3 db.sqlite
SQLite version 3.43.2 2023-10-10 12:14:04
Enter ".help" for usage hints.
sqlite> .output db.sql
sqlite> .dump
sqlite> 


# sqlite3 db.new.sqlite
Parse error near line 3: table ValidPaths already exists
  CREATE TABLE ValidPaths (     id               integer primary key autoincreme
               ^--- error here
Runtime error near line 14: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 15: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 16: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 17: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 18: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 19: UNIQUE constraint failed: ValidPaths.id (19)
Runtime error near line 20: UNIQUE constraint failed: ValidPaths.id (19)
. . .
Runtime error near line 677463: UNIQUE constraint failed: DerivationOutputs.drv, DerivationOutputs.id (19)
Runtime error near line 677464: UNIQUE constraint failed: DerivationOutputs.drv, DerivationOutputs.id (19)
Runtime error near line 677465: UNIQUE constraint failed: DerivationOutputs.drv, DerivationOutputs.id (19)
Parse error near line 677468: index IndexReferrer already exists
Parse error near line 677469: index IndexReference already exists
Parse error near line 677470: trigger DeleteSelfRefs already exists
  CREATE TRIGGER DeleteSelfRefs before delete on ValidPaths   begin     delete f
                 ^--- error here
Parse error near line 677474: index IndexDerivationOutputs already exists

I don’t know what else to do at this point

Atemu · November 23, 2023, 4:24pm

You really ought to find out what’s causing your Nix DB to corrupt. It’s not nixos-unstable’s fault, thousands of people use it without their DB corrupting.
This smells like data loss waiting to happen.

You have been warned.

This is assuming you already have your /nix inside a separate subvolume and your / is not the btrfs root subvolume (id=5).

You’d create a new subvolume (i.e. nix2) and then mount it under a certain location (i.e. mount <btrfs> -o subvol=nix2 /tmp/foo/nix --mkdir).
Now you should have a /tmp/foo/ directory with a nix subdirectory. Next is to bind-mount your /boot and/or efi in the same place relative to /tmp/foo/ as they are in the real system.
Then you nixos-install into /tmp/foo/ with no changes to your configuration.nix.

Now you’ve got a working Nix store with your current closure in the nix2 subvolume and a bootloader set up to boot from it. All that is left to do is swap the nix2 subvol with your regular subvol (that’s where your fstab is set up to find it) and reboot. If you’ve done everything correctly, you should boot into your current generation with a brand new Nix store.
Previous generations will be lost. You could copy those over aswell but that’s too much work to bother IMO.

After you’ve done all that, I’d recommend running memtest for a few hours. Seriously, that’s not normal.

nixinator · November 23, 2023, 4:31pm

i’d be very interested in getting to the bottom of your corruption issue’s.

Can you provide logs?

or increase logging from this point forward and see if happens again, and provide logs if it does.

Run an on boot memory test, and a some stress tests, on your system, while stressing the nix daemon, and see if you can recreate?

I presume you have the /nix/store on BTRFS , but i read that this corruption was with other filesystems, can you tell me what they were?t

Is there anything unusual about your system? like is the store mounted over a network or something strange like that? is a remote builder, what is it load characteristics, for generic and tasks, and the nix daemon?

do you think it maybe a disk or SSD failure, which is incurring bit rot???

Can you reproduce, or know a series of step than can reproduce this failure? can you provide your configuration.nix for review?

mhwombat · November 23, 2023, 4:57pm

Can you provide logs?

gist.github.com

https://gist.github.com/mhwombat/53cf76ce0f2d5e95e943a1e4874d9d50

journalctl.log

Nov 18 20:57:41 wombat11k kernel: Linux version 6.1.62 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Wed Nov  8 13:11:05 UTC 2023
Nov 18 20:57:41 wombat11k kernel: Command line: initrd=\efi\nixos\98r0svy98zr2zzz0p70lwy4x04kq9lai-initrd-linux-6.1.62-initrd.efi init=/nix/store/7xkf0i34xx41x3s57hag49zdgipiv4wk-nixos-system-wombat11k-23.11pre549786.c757e9bd77b1/init loglevel=4
Nov 18 20:57:41 wombat11k kernel: BIOS-provided physical RAM map:
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000003ffffff] usable
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x0000000004000000-0x000000000400efff] ACPI NVS
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x000000000400f000-0x0000000009df1fff] usable
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x0000000009df2000-0x0000000009ffffff] reserved
Nov 18 20:57:41 wombat11k kernel: BIOS-e820: [mem 0x000000000a000000-0x00000000981bcfff] usable

This file has been truncated. show original

or increase logging from this point forward and see if happens again, and provide logs if it does.

If there’s a setting you’d like me to change, let me know.

Run an on boot memory test, and a some stress tests, on your system, while stressing the nix daemon, and see if you can recreate?

I’ll see what I can do.

I presume you have the /nix/store on BTRFS , but i read that this corruption was with other filesystems, can you tell me what they were?t

Yes, /nix/store is currently on BTRFS. The other disks all used ext4.

Is there anything unusual about your system? like is the store mounted over a network or something strange like that? is a remote builder, what is it load characteristics, for generic and tasks, and the nix daemon?

No, it’s pretty plain vanilla, apart from the fact that I’m on “unstable”. But the problem has been happening for about 1.5 years. It’s my personal machine. It’s almost never under any real CPU load. Plenty of disk space.

do you think it maybe a disk or SSD failure, which is incurring bit rot???

I don’t think so. I replaced the SSD drive, but the problem persisted. Then I thought perhaps there was a problem with the M.2 interface on my mobo, so I switched to a SATA drive for a while. The problem always returns. Here’s the SMART info for my current drive.

❯ sudo smartctl -a /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.62] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 2TB
Serial Number:                      S69ENF0RA56775J
Firmware Version:                   3B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            118,522,916,864 [118 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 ba11b57918
Local Time is:                      Thu Nov 23 16:42:25 2023 GMT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,594,139 [1.32 TB]
Data Units Written:                 9,487,237 [4.85 TB]
Host Read Commands:                 45,123,335
Host Write Commands:                128,415,792
Controller Busy Time:               1,914
Power Cycles:                       271
Power On Hours:                     332
Unsafe Shutdowns:                   201
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               47 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

Can you reproduce, or know a series of step than can reproduce this failure?

No. I rebuild my machine; everything’s good for a few months, and then the corruption returns. I don’t see any pattern of actions or events leading up to the problem.

can you provide your configuration.nix for review?

Yes, my config is available here: GitHub - mhwombat/nixos-config: My NixOS configuration

mhwombat · November 23, 2023, 5:04pm

Fortunately I have regular backups, and most of my projects are in GitHub or Codeberg repos.

Hardware-wise, I feel I’ve ruled out everything but the mobo itself. Replacing it is on my TODO list. (I built the system about 3 years ago.) But it’s only ever the Nix store that gets corrupted,never any of my other files.

BTW, I’m not overclocking the CPU, or using any fancy BIOS settings.

nixinator · November 23, 2023, 5:15pm

yeah , I’d like to get you stable…

nixos unstable is not as unstable as you might think.

i wish they would called it nixos-rock-and-rolling.

However, i digress, maybe it interaction between BTRFS and the nix-daemon, did you say it also corrupted when on ext4 before you changed to BTRFS.

what version of nix / nix deamon are you running.

run the memory test, both from bios boot, and in the system, do the stress testing, i.e. stress test everything while doing multiple nix builds / rebuild etc etc. There are a number of disk, network and cpu stress test testers in nix, run all of them concurrently?

If you can reproduce it, then filing an issue might be a good idea…

ive only every run the nix daemon on ext4 and zfs…

I wonder if it could be thermal, I presume you’ve looked at the heath of the SSD to see if it’s loosing blocks or is having to relocate sectors all the time.

It should be trim enabled, but i think all modern kernels do that now by default on SSD’s.

cursed motherboard… what is it, sometimes a bios update actually fixes deep memory / pci / cpu bus timing voodoo …believe it or not?

even if you system is not over clocked, if your not running a matched memory pair, or the bios has not read the SPD meta data correct from the ram chips, you still can have problems.

what memory do you have, i run all my systems with ECC ram , because i’ve been bitten too many times by ram problems,…

y systems with ECC ram , becausey systems with ECC ram , because i’ve been bitten too many times by ram problems,… i’ve been bitten too many times by ram problems,…

do you think it may be related to thermal overload of some component? ?? are you tracking your thermals?

i’ll do some more research.

mhwombat · November 23, 2023, 5:32pm

Yes, it corrupted regularly even before I changed to BTRFS.

I did wonder if it was a thermal issue, so I removed and cleaned the CPU and heat sink/fans. I examined it with a magnifying glass, and found a tiny fibre. I reapplied thermal paste and re-installed everything. This was probably 6 months ago.

Off to run stress tests.

nixinator · November 23, 2023, 5:39pm

hmmm…

can you run a dmidecode as well, and dump that…

also what motherboard is it and revision.

Hmmm…

the plot thickens…could be haunted or cursed hardware…

but maybe not.

get a backup (sounds like you got the covered), and hammer the system to near death and see if can reproduce.

When i get a new machine, i set up it up as a hydra builder for a week, if it can survive that, then it’s becomes ‘good enough to use’.

LOL!

mhwombat · November 23, 2023, 5:44pm

Mobo: TRX-40 Aorus Master (Gigabyte)
CPU: AMD Threadripper 3960x

nixinator · November 23, 2023, 5:46pm

sifting through your config, i see

package = pkgs.Unstable;

so not only are you running unstable, but you also running a new version of the nix tooling.

If your not requiring the cutting edge features …this may help?

They might actually be the same, but i feel that they are different as nixpkgs and nix (the tool and daemon) are tracked and ‘released’ independently.

nixinator · November 23, 2023, 5:49pm

well how would of through it, that is exactly what i run ':-).

and it works well, i can’t say i’ve had any corruption, but i’m tend to run stable, and only use a cutting edge kernel if my hardware needs it, i cherry pick certain packages from unstable when i need it.

I really like your configuraion repo…, it’s one of the cleanest and neatest i’ve seen.

I hope we can get to the bottom of this.

mhwombat · November 23, 2023, 6:07pm

LOL! Three years ago, this little old lady spent her 60th birthday building the Wombat 11K. Which was the successor to the Wombat 7000, which replaced the Wombat 5000… I’ve lost track of how many systems I’ve built over the years. Not that I’m good at it – in fact, every time I build a system it seems like they’ve changed some fundamental things, so I never feel like I really know what I’m doing. But I still remember having to deconflict IRQs!

I also replaced the power supply earlier this year, just in case that was an issue.

So…

❯ stress-ng --cpu 20 --iomix 20 --vm 4 --vm-bytes 128M --fork 20 --timeout 300s
stress-ng: info:  [2039819] setting to a 5 mins, 0 secs run per stressor
stress-ng: info:  [2039819] dispatching hogs: 20 cpu, 20 iomix, 4 vm, 20 fork
stress-ng: info:  [2039819] skipped: 0
stress-ng: info:  [2039819] passed: 64: cpu (20) iomix (20) vm (4) fork (20)
stress-ng: info:  [2039819] failed: 0
stress-ng: info:  [2039819] metrics untrustworthy: 0
stress-ng: info:  [2039819] successful run completed in 5 mins, 0.11 secs

Meanwhile I watched the output of journalctl -f. No error messages.

t0m · November 23, 2023, 6:19pm

Hi,
I have definitely had an issue on a system before where certain files (and the file system entirely) would end up becoming corrupted because the memory was faulty. The nix db may simply become corrupted more likely because it’s a bigger file that is loaded into memory and modified frequently. You should try doing a memtest. It’s easy to install on NixOS if you have either GRUB or systemd-boot:

boot.loader.grub.memtest86.enable = true;

or

boot.loader.systemd-boot.memtest86.enable = true;

Then boot into it and simply wait for it to finish.

bbigras · November 23, 2023, 8:04pm

I skimmed the thread so apologies if it was already mentioned, but don’t let your disk fill up to 100% if you use btrfs, or you can get corruption.

I had my Nix db corrupted multiple times because of this.

mhwombat · November 23, 2023, 8:07pm

Running the test now, and it’s reporting some errors. When the test finishes, guess I’ll be removing memory, checking for cat hair/other debris, blowing compressed air on the socket, and reseating the memory. If that doesn’t do the trick, I may have to replace the suspect memory card.

I would like to thank everyone on this thread; you went above and beyond to help me diagnose what you (rightly) suspected was not a Nix issue.

t0m · November 23, 2023, 8:46pm

Glad to hear that you were able to get to the root of the problem!
If cleaning doesn’t help maybe you can still isolate the faulty RAM stick by testing again with just a single one and adding more until the issue comes back.

Good luck!

nixinator · November 24, 2023, 6:44am

may the wombat 7000 fly again!

Here’s to the wombat 1 million flying in the future!

I know ECC ram is expensive, but this mobo will take it and it can help with memory churn problems and even cosmic rays…

once you read this, about how our external environment can effect the closed world of your machine, you’ll never install non-ECC ram again. You’ll also install your machine in a cave or nuclear bunker LOL!

The rate of change in PC’s hardware is off the scale, which is a good thing as you get more power
for your $, but the abstractions change rapidly and widely every revision. So it’s amazing how you go from hero to zero in what seems a matter of months! LOL.

Fun fact and history , I actually got fired once for suggesting a user ran the linux mem test program on a windows machine. The management said that open source software was not allowed and i was not to recommend either linux or linux diagnostic boot disks as it ‘would confuse the customer’… Only windows could be recommend… in fact windows at the time lacked any tools, nothing was part of the base operating system , you got windows, and that was it… You paid for the OS…was the lack of diagnostic tools by design or by accident?

The customer did what i said, found that the memory was bad, and got a replacement memory, and it SOLVED THE PROBLEM.

The unwritten and secret narrative of the company was not to make the customer successful, but to fob the customer of with lies , avoid return at all costs, try and blame the user for as many errors as possible, and extract more money of them … this was after getting the customers to buy an ‘expensive warranty’ on machine.

So, remember folks, i got fired for using linux … that wasn’t the only time i got into corporate hot water for using open source…

I was very young, i started to realise all the free unixes, would become the hottest thing on the planet…

If it unix was easier to configure , update and had a better GUI experience…

and thats where guix comes in.

2024 is gong to be interesting…

2038 more so. (if we make it).