Hi all,
I finally decided to start experimenting with llama.cpp. On my M4 MBP (64G), I’m seeing significantly faster tokens / sec building from source vs using the packaged version in spite of using metal and blas for both. Why might that be?
Source build using this:
#!/usr/bin/env nix-shell
#!nix-shell -i bash -p bash --pure
#!nix-shell -p cmake
#!nix-shell -p git
#!nix-shell -p llvmPackages.openmp
#!nix-shell -p openssl
set -Eeuf -o pipefail
set -x
main() {
rm -rf ./build
cmake -B build
cmake --build build --config Release -j "$(nproc)"
}
main "$@"
Benchmark results (intentionally kept out of codeblocks so the md table renders):
$ build/bin/llama-bench \
-o md \
-m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q4_K - Medium | 45.08 GiB | 79.67 B | MTL,BLAS | 12 | pp512 | 803.48 ± 8.01 |
| qwen3next 80B.A3B Q4_K - Medium | 45.08 GiB | 79.67 B | MTL,BLAS | 12 | tg128 | 45.88 ± 0.52 |
build: 01d8eaa28 (8054)
Using nixpkgs version instead:
$ nix-shell \
-I nixpkgs=flake:github:nixos/nixpkgs \
-p 'llama-cpp.override { blasSupport = true; }' \
--run \
'llama-bench -o md -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf'
unpacking ‘github:nixos/nixpkgs/a5b1db765309855b657b52ac29170e7898c9c96a’ into the Git cache…
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.005 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q4_K - Medium | 45.08 GiB | 79.67 B | MTL,BLAS | 12 | pp512 | 622.26 ± 28.07 |
| qwen3next 80B.A3B Q4_K - Medium | 45.08 GiB | 79.67 B | MTL,BLAS | 12 | tg128 | 41.60 ± 0.07 |
build: 8872ad2 (7966)
I’ve repeated a few times and it seems consistent. ~800 vs ~600 is a pretty substantial difference. Are there build flags we could use to improve the version in nixpkgs?