Llama.cpp on Apple Silicon -- native build much faster than nixpkgs

n8henrie · February 14, 2026, 5:27pm

Hi all,

I finally decided to start experimenting with llama.cpp. On my M4 MBP (64G), I’m seeing significantly faster tokens / sec building from source vs using the packaged version in spite of using metal and blas for both. Why might that be?

Source build using this:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p bash --pure
#!nix-shell -p cmake
#!nix-shell -p git
#!nix-shell -p llvmPackages.openmp
#!nix-shell -p openssl

set -Eeuf -o pipefail
set -x

main() {
  rm -rf ./build
  cmake -B build
  cmake --build build --config Release -j "$(nproc)"
}
main "$@"

Benchmark results (intentionally kept out of codeblocks so the md table renders):

$ build/bin/llama-bench \
    -o md \
    -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model	size	params	backend	threads	test	t/s
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	pp512	803.48 ± 8.01
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	tg128	45.88 ± 0.52

build: 01d8eaa28 (8054)

Using nixpkgs version instead:

$ nix-shell \
    -I nixpkgs=flake:github:nixos/nixpkgs \
    -p 'llama-cpp.override { blasSupport = true; }' \
    --run \
    'llama-bench -o md -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf'

unpacking ‘github:nixos/nixpkgs/a5b1db765309855b657b52ac29170e7898c9c96a’ into the Git cache…
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.005 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model	size	params	backend	threads	test	t/s
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	pp512	622.26 ± 28.07
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	tg128	41.60 ± 0.07

build: 8872ad2 (7966)

I’ve repeated a few times and it seems consistent. ~800 vs ~600 is a pretty substantial difference. Are there build flags we could use to improve the version in nixpkgs?

malloc · February 14, 2026, 6:28pm

hi @n8henrie - I suspect the performance discrepancy has something to do with the different build versions used to run the models. Maybe there were performance improvements between 7966 and 8054? Although, looking at the release history, seems this is updated on a frequent basis as 7966 is only a week old.

source:

…
build: 01d8eaa28 (8054)

nixpkgs:

…
build: 8872ad2 (7966)

In order to rule this out, please build from source using the same version as nixpkgs.

Release b7966 · ggml-org/llama.cpp · GitHub (release for b7966)
https://github.com/ggml-org/llama.cpp/archive/refs/tags/b7966.tar.gz (source code archive)

I would try it myself but I do not have time to download a 46G model.

n8henrie · February 14, 2026, 6:45pm

Sure enough, this build number matches and so does its benchmark. Thanks!

$ git describe --tag
b7966
$ git rev-parse HEAD
8872ad2125336d209a9911a82101f80095a9831d
$ ./build.sh
$ # ...
$ build/bin/llama-bench \
    -o md \
    -m ~/Library/Caches/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M_Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.628 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB

model	size	params	backend	threads	test	t/s
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	pp512	635.70 ± 13.68
qwen3next 80B.A3B Q4_K - Medium	45.08 GiB	79.67 B	MTL,BLAS	12	tg128	41.90 ± 0.96

build: 8872ad212 (7966)