Importing nixpkgs stats unneeded packages

Why does nixpkgs stat every package, even packages that are not dependencies of the specified expression?

For example, hello is a very simple package and certainly does not depend on the rogue-like game angband and yet angband (along with every other package) gets stat’ed when evaluating hello:

strace -f nix eval nixpkgs#hello |& grep angband
[pid 98169] newfstatat(AT_FDCWD, "/nix/store/shgjjmqr3hfy50dd1xlxf4hzdz1fl0v2-source/pkgs/by-name/an/angband", {st_mode=S_IFDIR|0555, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0

To be clear, I know that most packages are added automatically by reading directories and I do expect nixpkgs to read the two character an prefix directory to identify all of the available an* packages. However I can’t work out why it would need to read or stat or in any way interact with the full an/angband path.

Judging from strace, nixpkgs does about twenty thousand of these ‘superfluous’ stats and some rudimentary profiling suggests that whatever is doing all this extra work may be adding several hundred milliseconds of evaluation time to import nixpkgs, so I’d like to understand if it’s strictly necessary and, if not, what I can do to avoid it.

1 Like

The code that processes by-name uses builtins.readDir, which not only lists file and directory names within the given directory, but also distinguishes their type (file or directory), which requires stating them individually. There isn’t a nix language primitive that will read a directory without stating its entries, I’m afraid.

Given that the nix language is lazy, the file type could be left undetermined until required, but extra thunks have a cost, too. It’s not clear that would actually be beneficial.

which requires stating them individually

readDir does appear to have a fallback branch to individually stat each file if the operating system or file system doesn’t already include the type directly in directory entries but I don’t think that code will ever execute on my system.

I’m using Linux ext4 which includes the file type directly in the directory entries. I also confirmed this by dumping the directory entries using debugfs (see the third column):

$ sudo debugfs /dev/mapper/nvme0n1p3_crypt -R "ls -l $PWD" | cat
debugfs 1.47.2 (1-Jan-2025)
 24399849   40775 (2)   1000   1000    4096 30-Jun-2025 02:03 .
 24407103   40775 (2)   1000   1000   77824 30-Jun-2025 02:01 ..
 24399878  100664 (1)   1000   1000       0 30-Jun-2025 02:01 file
 24399879   40775 (2)   1000   1000    4096 30-Jun-2025 02:01 dir
 24399792   40775 (2)   1000   1000    4096 30-Jun-2025 02:01 dir2
 24399812  100664 (1)   1000   1000       0 30-Jun-2025 02:01 file2
 24399884  120777 (7)   1000   1000       4 30-Jun-2025 02:03 filelink
 24399885  120777 (7)   1000   1000       3 30-Jun-2025 02:03 dirlink

Therefore, the type should already exist in the directory entries returned by PosixSourceAccessor::readDirectory and readDir should follow the normal branch since it has no need of individually stat each file.

This makes me think its not readDir causing the stats but rather something else…?

Huh. I stand corrected. Looking at the code for the by-name overlay doesn’t reveal any clear culprits either. The only thing I can guess is that nix is stating the directories when path-type values are created within those directories, but that would be strange, as nix is perfectly fine with non-existant path-type values until you try to use them.

Eh, I’ve looked into this and it looks like DirectoryIterator is holding std::filesystem wrong.
Here it’s calling symlink_status() nix/src/libutil/posix-source-accessor.cc at acfdacc971bb411bb8b85a05a37b6fc7330c4370 · NixOS/nix · GitHub unconditionally, , which always translates to a stat call, bypassing any sort of caching that directory_entry might do by saving the result of dirent (link to libstdc++ implementation):

The C++ standard proposal that actually added the caching mechanisms seems to be this paper: Directory Entry Caching

So the only actual way to make use of the value cached from dirent is to use the myriad of accessor functions is_*. SMH, it’s times like these that make me hate the C++ standard library with a burning passion.

Thank you so much for this find!

I have a draft patch that reworks this to use the caching API and I’m seeing a speedup in the ballpark of 25% on the eval times on hello package. Down from 230 ms → 179ms. I’ll push a PR in a bit :slight_smile:

2 Likes

PR with a fix: libutil: Use caching `directory_entry` API in `PosixSourceAccessor::r… by xokdvium · Pull Request #13412 · NixOS/nix · GitHub

1 Like