No hashes starting with 'e', 't', 'o' or 'u' in /nix/store?

e, t, a, o are the top four letters by frequency in English, so this seems to be a strategy of reducing the possibility of spelling any words, except that a was replaced by u… You already know why.

Here’s an amusing list of swearwords, many of which can show up in hashes :wink:

Found a nice example of finding hashes with words you prefer: the tool “masscan” asks for donations to a bitcoin wallet with the word “MASSCAN” in it https://github.com/robertdavidgraham/masscan#masscan-mass-ip-port-scanner

3 Likes

Hmm…

$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | wc -l
451
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep '[0123456789abcdefghijklmnopqrstuvwxyz]' | wc -l
451
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep -v '[eotu]' | wc -l
77
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep -v '[aket]' | wc -l
69
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep -v '[aeot]' | wc -l
67
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep -v '[aeio]' | wc -l
52
$ curl https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt | grep -v '[aiou]' | wc -l
32

So e, o, t, u is pretty good, but you can get fewer with just taking a, i, o, u. Maybe there’s some dutch swears with a bunch of t’s I don’t know about though.

$ curl -s https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt > badwords.txt && for a in {{a..z},{0..9}}; do for b in {{a..z},{0..9}}; do for c in {{a..z},{0..9}}; do for d in {{a..z},{0..9}}; do echo -e $a$b$c$d\\t$(grep -v "[$a$b$c$d]" badwords.txt | wc -l); done; done; done; done | sort -nrk2
3 Likes

Tezos also does this with protocol changes. For instance the codename is carthage and the protocol hash is “PtCarthavAMoXqbjBPVgDCRd5LgT7qqKWUPXnYii3xCaHRBMfHH” [1]

1 Like

Probably the best approach would be to remove all the vowels aeiouy and add some punctuation to get back to 32 characters. Most languages don’t have words without vowels. - and _ should be allowed in all filesystems.

Of course then you can still have a file hash like grr-_-grr-_-grr :wink:

Probably the best approach would be to remove all the vowels aeiouy and add some punctuation to get back to 32 characters. Most languages don’t have words without vowels. - and _ should be allowed in all filesystems.

Of course then you can still have a file hash like grr-_-grr-_-grr :wink:

fsckn-btrfs !

I guess this would be a partially incompatible change anyway, so we can as well move to multi-level directory structure, and then we can even afford a slightly larger length while we are at it.

At the same time, l33tsp34k teaches us o=0 i=1 e=3 a=4 t=7, so the majority of vowels and currently excluded t is back anyway

(I guess for the thesis needs having a literal swear word at an official demonstration is a small risk but awkward if happens, but for l33tsp34k cursing everyone involved can always just plausibly pretend to not notice the hash is readable)

Curses, foiled again :wink: c4f3b4b3_b00b13s_d34dc4t_etc is indeed valid under that scheme.

to go on a tangent: why would you want a multi-level directory structure? Modern filesystems can handle millions of files in a single directory just fine, they use a tree structure internally. By using a multi-level directory structure you’re actually making that tree structure less efficient.

to go on a tangent: why would you want a multi-level directory structure? Modern filesystems can handle millions of files in a single directory just fine, they use a tree structure internally. By using a multi-level directory structure you’re actually making that tree structure less efficient.

This would probably be true if Nix didn’t insist on a 0o555 store.

From time to time some program asks itself why not readdir() the store, and maybe also stat() each result, and it is a bit annoying to keep track of what not to do to avoid hitting such a behaviour.

(One example is Zsh Tab completion that has a very convenient optional feature that also happens to readdir() all directories along the path being completed)

1 Like

:thinking: actually, I think that’s a good thing. Regular services shouldn’t try to readdir() /nix/store, and if an interactive shell hangs because it’s listing a huge directory, that’s a bug that also slows down that shell in large directories…

:thinking: actually, I think that’s a good thing. Regular services shouldn’t try to readdir() /nix/store, and if an interactive shell hangs because it’s listing a huge directory, that’s a bug that also slows down that shell in large directories…

Large directories are completely avoidable, there are many convenient things that require enumerating some directory (not just Nix store), Tab-completing a Nix store path by the first 2 to 4 characters of its hash is sometimes convenient, and it is supposed to be possible to readdir() /nix/store as evidenced by Nix explicitly asserting that store has permissions that allow enumeration.

1 Like

Hmmm still not convinced…

  • On my servers and laptop I seem to have some several thousand files, which tab-completes quickly
  • On hydra storage, I would assume that if you’re looking for a certain hash it’s a quick copy and paste away
  • If you want to quickly access one of several perl packages, there could be an additional directory with a directory for each name in the store, filled with symlinks to the actual packages. So you’d visit /nix/links/perl/<tab><tab> to see all the ones you want; the symlinks could even have embedded data like the install date in their name

OTOH, if we switch to a multi-layer store, all packages need rebuilding, there will be lots of almost empty directories. On one of my servers, looking at the distribution with 2 letters by running

(cd /nix/store; ls | cut -c1-2) | sort | uniq -c | sort -n

gives me 998 buckets, with the biggest having 12 items, and 76 directories with a single item.

Your nix store auto completes fast? Which shell do you use?

Just plain bash, but I use SSD everywhere. On macOS I’ve been forced onto zsh apparently.

Ah, SSD. That makes sense. I guess my drives will keep spinning every time I hit tab in that folder for now…

Hmmm still not convinced…

  • On my servers and laptop I seem to have some several thousand files, which tab-completes quickly

So you want to say that not only NixOS is now SSD-only (via completely garbage journalctl behaviour in some cases), but Nix is intended to be so, too?

OTOH, if we switch to a multi-layer store, all packages need rebuilding,

We have a stdenv rebuild multiple times per NixOS release, look at the staging-next merges

there will be lots of almost empty directories. On one of my servers, looking at the distribution with 2 letters by running

(cd /nix/store; ls | cut -c1-2) | sort | uniq -c | sort -n

gives me 998 buckets, with the biggest having 12 items, and 76 directories with a single item.

We have so many symlinks, inode consumption by directories will not be that bad in comparison

1 Like

Since everything accesses the store all the time, I’d expect the hot parts of the tree to be in memory.

Besides that, you can make a hybrid drive very cheaply with bcachefs.

My concern is not really with inode consumption, but mostly with making the hot lookups more inefficient. Maybe that’s fairly academic though.

Since everything accesses the store all the time, I’d expect the hot parts of the tree to be in memory.

No, not enough stuff will be cached even for a simple readdir()

Besides that, you can make a hybrid drive very cheaply with bcachefs.

Hybrid drive? This assumes having a slot for an extra drive, right?

My concern is not really with inode consumption, but mostly with making the hot lookups more inefficient. Maybe that’s fairly academic though.

Hm, I operated under assumptions that for hot lookups all inodes and the relevant subset of directory contents will be in cache, and single-cached-directory traversal is negligible compared with a full syscall (and our stdenv could indeed be improved re: number of syscalls per dlopen). But yes, thanks for clarification, I have not actually checked this and I can be wrong here.

Well - neither am I sure. I can imagine it going both ways for various layouts and filesystems. It would be worthwhile to try a few layouts in a very memory-constrained system.

I have a fusion drive on this computer and tab-completing /nix/store/foo in fish took maybe 2 seconds to give me the 336 results (and upon retry it’s about half a second). And of course Fish is doing a substring match here so it’s definitely reading the whole directory. Heck, printf '%s\n' /nix/store/* | wc -l only took half a second either and gave me 58798 results (though obviously this was a warm test).

My guess is the filesystem metadata is stored on the SSD part of the fusion drive, but I can’t tell you for sure. I’ve got a 121.3GB SSD coupled with a 2TB HDD.

1 Like