How to Llama on AMD Strix Halo?

woile · February 25, 2026, 10:44am

Some benchmarks on Qwen3.5-35B-A3B

llama-cli \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40

[ Prompt: 78,5 t/s | Generation: 18,9 t/s ] # vulkan

[ Prompt: 61,0 t/s | Generation: 18,3 t/s ] # rocm

woile · April 3, 2026, 5:49am

Using the env variable export HSA_OVERRIDE_GFX_VERSION='11.5.1' with unstable releases yield these results for llama-cpp now:

For Qwen3-Coder-Next-GGUF there’s an improvement

[ Prompt: 44,3 t/s | Generation: 18,6 t/s ]

For unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL :

[ Prompt: 47,9 t/s | Generation: 19,1 t/s ]

Seems like t/s is getting better. Thanks @Lun for all your work!

It looks like upcoming llama-cpp release starts using rocm 7.2.1

woile · April 16, 2026, 5:01pm

The new qwen3.6-35B-A3B works okayish by default

# coding
export HSA_OVERRIDE_GFX_VERSION='11.5.1'
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --predict 32768 \
    --ctx-size 128000 \
    -p "briefly explain systemd"

Yields a generous:

[ Prompt: 42,9 t/s | Generation: 15,7 t/s ]

Next step would be to have a nix module to easily run these modules pre-configured.

With llama-cpp 8770 (latest in unstable as of now)

[ Prompt: 46,8 t/s | Generation: 16,0 t/s ]

I’ve decided to ditch ollama, in favor of llama-cpp. I think we should have a huggingface integration module in nix, instead of the services.ollama.loadModels

7c6f434c · April 17, 2026, 6:32pm

If we are talking llama.cpp, you might want to experiment with --no-host, and --mmap / --no-mmap

woile · April 19, 2026, 6:58am

Thanks for the tip!

I took a look at GitHub - kyuz0/amd-strix-halo-toolboxes · GitHub , and it also mentions to include -fa 1

Running this increases a tiny bit the speed, but not much:

llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --predict 32768 \
    --ctx-size 128000 \
    -ngl 999 \
    --no-mmap \
    -fa 1 \
    -p "briefly explain systemd in one paragraph"

Yields:

[ Prompt: 55,4 t/s | Generation: 17,5 t/s ]

Crown · April 26, 2026, 3:42am

I noticed my speeds seem much faster than the ones being listed here.

Are we measuring differently?

Qwen3.6 35B-A3B MoE benchmark results on AMD Strix Halo / Radeon 8060S

System:

- OS: NixOS 25.11 (Xantusia), build 25.11.8107.1073dad219cb

- Kernel: Linux 6.19.9

- APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S

- CPU: 16 cores / 32 threads

- System memory: 128GB unified memory

- GPU: AMD Strix Halo / Radeon 8060S

- PCI ID: 1002:1586

- Kernel driver: amdgpu

- Vulkan device: Radeon 8060S Graphics (RADV GFX1151)

- Vulkan notes: UMA, fp16 enabled

- VRAM sysfs total: 64 GiB / 68719476736 bytes

- GTT sysfs total: ~31.2 GiB / 33522102272 bytes

ROCm:

- ROCm version: 7.2.2

- GPU exposed as: gfx1151

- ROCm env used: HSA_OVERRIDE_GFX_VERSION=11.5.1

- Best ROCm run also used: ROCBLAS_USE_HIPBLASLT=1

llama.cpp:

- Main Vulkan benchmark build reported: build unknown (0)

- Fresh upstream comparison build used llama.cpp commit f65bc34c688f9ab68c312b5ce0c0885cca94cf1d / short f65bc34

- Separate Vulkan and ROCm builds were tested

Benchmark shape:

llama-bench -m <model.gguf> -ngl 999 -fa <0|1> -p -n 128 -r -o md -t 16

Notes:

- -ngl 999 was used to offload as much as possible to GPU.

- -fa 1 enables Flash Attention.

- -n 128 was used for token generation.

- Later Q8 / Qwen3.6 35B runs explicitly used -t 16.

- Older Q4 131072-context runs did not explicitly pass -t 16 in the manifest.

Results for base Qwen3.6 35B-A3B MoE models:

|—|—|—:|—|—:|—:|—|—|

Takeaway:

On this AMD Strix Halo / Radeon 8060S setup, Vulkan/RADV was faster than ROCm for Qwen3.6-35B-A3B-Q8_0 in llama.cpp. Flash Attention was very important for high prompt-processing throughput at ~128k context: on the Q4_K_XL run, FA on gave 435.41 t/s PP versus 210.58 t/s with FA off, while TG stayed roughly similar.

woile · April 26, 2026, 4:40am

Thanks for sharing!

We are not measuring differently, since starting this thread, I learned that I don’t have a Strix Halo (amd ai 395), but a Strix Point (amd ai hx 370). The post should actually be called “How to local AI on Strix Arch”

I think your numbers reflect the correct numbers, as amd ai 395 has higher memory bandwith

System	Processor	Memory Bandwidth	Est Tokens/Sec (Qwen-35B-A3B)
AMD AI 395 Max+	Ryzen AI MAX+ 395	256 GB/s	38 – 48 T/s
AMD AI HX 370	Ryzen AI 9 HX 370	~90 GB/s	15 – 22 T/s

Crown · April 26, 2026, 6:13am

If you’ve not tried it, I use pi.dev to manage my nixos system and it does an amazing job with qwen 3.6 35b moe. I think you’ll find the 15-20 tokens/second acceptable with it.

woile · May 19, 2026, 8:29am

I’ve added a new entry in the wiki for llama-cpp, it’s much better than Ollama, and I think it can replace Ollama in NixOS now with the new service.llama-cpp

crertel · May 22, 2026, 7:19am

Oh shit that’s a service now? Glad to hear it!