Llama.cpp on Apple Silicon -- native build much faster than nixpkgs

Hi all,

I finally decided to start experimenting with llama.cpp. On my M4 MBP (64G), I’m seeing significantly faster tokens / sec building from source vs using the packaged version in spite of using metal and blas for both. Why might that be?

Source build using this:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p bash --pure
#!nix-shell -p cmake
#!nix-shell -p git
#!nix-shell -p llvmPackages.openmp
#!nix-shell -p openssl

set -Eeuf -o pipefail
set -x

main() {
  rm -rf ./build
  cmake -B build
  cmake --build build --config Release -j "$(nproc)"
}
main "$@"

Benchmark results (intentionally kept out of codeblocks so the md table renders):

$ build/bin/llama-bench \
    -o md \
    -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model size params backend threads test t/s
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 pp512 803.48 ± 8.01
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 tg128 45.88 ± 0.52

build: 01d8eaa28 (8054)

Using nixpkgs version instead:

$ nix-shell \
    -I nixpkgs=flake:github:nixos/nixpkgs \
    -p 'llama-cpp.override { blasSupport = true; }' \
    --run \
    'llama-bench -o md -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf'

unpacking ‘github:nixos/nixpkgs/a5b1db765309855b657b52ac29170e7898c9c96a’ into the Git cache…
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.005 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model size params backend threads test t/s
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 pp512 622.26 ± 28.07
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 tg128 41.60 ± 0.07

build: 8872ad2 (7966)

I’ve repeated a few times and it seems consistent. ~800 vs ~600 is a pretty substantial difference. Are there build flags we could use to improve the version in nixpkgs?

hi @n8henrie - I suspect the performance discrepancy has something to do with the different build versions used to run the models. Maybe there were performance improvements between 7966 and 8054? Although, looking at the release history, seems this is updated on a frequent basis as 7966 is only a week old.

source:


build: 01d8eaa28 (8054)

nixpkgs:


build: 8872ad2 (7966)

In order to rule this out, please build from source using the same version as nixpkgs.

I would try it myself but I do not have time to download a 46G model.

Sure enough, this build number matches and so does its benchmark. Thanks!

$ git describe --tag
b7966
$ git rev-parse HEAD
8872ad2125336d209a9911a82101f80095a9831d
$ ./build.sh
$ # ...
$ build/bin/llama-bench \
    -o md \
    -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.628 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model size params backend threads test t/s
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 pp512 635.70 ± 13.68
qwen3next 80B.A3B Q4_K - Medium 45.08 GiB 79.67 B MTL,BLAS 12 tg128 41.90 ± 0.96

build: 8872ad212 (7966)

1 Like