With the following PR im trying to introduce (iommu supported) huge memory pages support for tmpfs (first step: /tmp backed by tmpfs).
NixOS:master
← paepckehh:tmpfs-huge
opened 11:18PM - 05 May 25 UTC
Introduce new option to boot.tmp (tmpfs)
config.boot.tmp.tmpfsHugeMemoryPages =… {never|always|within_size|advise}
Rationale:
Nixos makes heavy use of tmpfs (includes /tmp, ...) but with current (default) setup and mount options (eg. memory pages 4k) it does not profit from modern world memory controller, who can perform better with larger memory chunks. Enabling user to choose to mount /tmp tmpfs with larger than 4k memory pages allows (workload depending) a huge reduction in CPU load and a nice bump in performance.
Linux Tmpfs currenty supports 4 memory alloc options for huge pages:
- never => Do not allocate huge memory pages. This is and stays the default.
- always => Attempt to allocate huge memory page every time a new page is needed.
- within_size => Only allocate huge memory pages if it will be fully within i_size. Also respect madvise(2) hints.
- advise => Only allocate huge memory pages if requested with madvise(2).
This patch will introduce a new option to configure the tmpfs mount options on /tmp.
This patch will default to "never" and will by default not introduce any change!
Recommended option is "within_size" (do not waste any memory, but use when possible).
Pure performance option is "always". Recommended future (sane) default should be "within_size".
Benchmark Impact of "default" vs "within_size":
```code
# + Results: Ryzen5 7530U / outdated DDR4 / low-mid range laptop 2023
# + (nixpkgs/master_20250505: pkgs.linuxpackages_latest)
# + fd huge /sys/devices
# + /sys/devices/system/node/node0/hugepages/hugepages-2048kB
# + /sys/devices/system/node/node0/hugepages/hugepages-1048576kB
# + needs PR #404514
# + default: boot.tmp.tmpfsHugeMemoryPages = "never"; (rw,nosuid,nodev)
# + hugepages: **boot.tmp.tmpfsHugeMemoryPages = "within_size"; (rw,nosuid,nodev,huge=within_size)
#
# make disk
# ( touch /tmp/test && rm /tmp/test && dd if=/dev/zero of=/tmp/test oflag=direct bs=16M count=256 && rm /tmp/test )
# default: 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 2,04209 s, 2,1 GB/s
# hugepages: 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 1,09119 s, 3,9 GB/s
#
# make hyperdisk
# hyperfine --warmup 2 "make disk"
# Benchmark 1: make disk
# default: Time (mean ± σ): 2.598 s ± 0.029 s [User: 0.004 s, System: 2.571 s]
# hugepages: Time (mean ± σ): 1.186 s ± 0.010 s [User: 0.005 s, System: 1.174 s]
With a more advanced memory controller (aarch64/apple-silicon/x86_64:numa) the diff should improve even further.
```
## Things done
- Built on platform(s)
- [X] x86_64-linux
- [ ] aarch64-linux
- [ ] x86_64-darwin
- [ ] aarch64-darwin
- For non-Linux: Is sandboxing enabled in `nix.conf`? (See [Nix manual](https://nixos.org/manual/nix/stable/command-ref/conf-file.html))
- [ ] `sandbox = relaxed`
- [X] `sandbox = true`
- [X] Tested, as applicable:
- [NixOS test(s)](https://nixos.org/manual/nixos/unstable/index.html#sec-nixos-tests) (look inside [nixos/tests](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests))
- and/or [package tests](https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#package-tests)
- or, for functions and "core" functionality, tests in [lib/tests](https://github.com/NixOS/nixpkgs/blob/master/lib/tests) or [pkgs/test](https://github.com/NixOS/nixpkgs/blob/master/pkgs/test)
- made sure NixOS tests are [linked](https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#linking-nixos-module-tests-to-a-package) to the relevant packages
- [ ] Tested compilation of all packages that depend on this change using `nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"`. Note: all changes have to be committed, also see [nixpkgs-review usage](https://github.com/Mic92/nixpkgs-review#usage)
- [X] Tested basic functionality of all binary files (usually in `./result/bin/`)
- [25.05 Release Notes](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2505.section.md) (or backporting [24.11](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2411.section.md) and [25.05](https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-2505.section.md) Release notes)
- [ ] (Package updates) Added a release notes entry if the change is major or breaking
- [ ] (Module updates) Added a release notes entry if the change is significant
- [ ] (Module addition) Added a release notes entry if adding a new NixOS module
- [X] Fits [CONTRIBUTING.md](https://github.com/NixOS/nixpkgs/blob/master/CONTRIBUTING.md).
<!--
To help with the large amounts of pull requests, we would appreciate your
reviews of other pull requests, especially simple package updates. Just leave a
comment describing what you have tested in the relevant package/service.
Reviewing helps to reduce the average time-to-merge for everyone.
Thanks a lot if you do!
List of open PRs: https://github.com/NixOS/nixpkgs/pulls
Reviewing guidelines: https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#reviewing-contributions
-->
---
Add a :+1: [reaction] to [pull requests you find important].
[reaction]: https://github.blog/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/
[pull requests you find important]: https://github.com/NixOS/nixpkgs/pulls?q=is%3Aopen+sort%3Areactions-%2B1-desc
On my lab I can reproduce a (for free) 2x speedbump and as well a significant drop in cpu load for large sequential data transfer from/to tmpfs /tmp drive across all systems.
# ( touch /tmp/test && rm /tmp/test && dd if=/dev/zero of=/tmp/test oflag=direct bs=16M count=256 && rm /tmp/test )
# default: 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 2,04209 s, 2,1 GB/s
# hugepages: 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 1,09119 s, 3,9 GB/s
But im not confidant about speed and compatibility impacts on other hardware platforms. Thats why my PR currently adds a new option, but leaves it off by default.
Huge memory pages support was introduced to linux with 2.5/2.6 kernel version and successfully used by enterprise db performance tuning. So I would see this part of the memory allocator as quite stable and battle proven now.
Qemu does not help with validation, because it will not correctly emulate the hardware iommu hardware on the target platform.
Aarch64 (eg. apple silicon. ampera), risc64, and numa x86_64 would be very interesting.
But any other hardware (with huge table / iommu) support would be welcome as well.
Thank you!
5 Likes
This sounds quite useful. I’ve got a couple x86_64 NUMA systems that I’ll spotcheck this on.
1 Like
I’ll look into this on Ampere Altra Max M128-26