How to Llama on AMD Strix Halo?

Some benchmarks on Qwen3.5-35B-A3B

llama-cli \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40
[ Prompt: 78,5 t/s | Generation: 18,9 t/s ] # vulkan
[ Prompt: 61,0 t/s | Generation: 18,3 t/s ] # rocm

Using the env variable export HSA_OVERRIDE_GFX_VERSION='11.5.1' with unstable releases yield these results for llama-cpp now:

For Qwen3-Coder-Next-GGUF there’s an improvement

[ Prompt: 44,3 t/s | Generation: 18,6 t/s ]

For unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL :

[ Prompt: 47,9 t/s | Generation: 19,1 t/s ]

Seems like t/s is getting better. Thanks @Lun for all your work!

It looks like upcoming llama-cpp release starts using rocm 7.2.1

The new qwen3.6-35B-A3B works okayish by default

# coding
export HSA_OVERRIDE_GFX_VERSION='11.5.1'
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --predict 32768 \
    --ctx-size 128000 \
    -p "briefly explain systemd"

Yields a generous:

[ Prompt: 42,9 t/s | Generation: 15,7 t/s ]

Next step would be to have a nix module to easily run these modules pre-configured.

With llama-cpp 8770 (latest in unstable as of now)

[ Prompt: 46,8 t/s | Generation: 16,0 t/s ]

I’ve decided to ditch ollama, in favor of llama-cpp. I think we should have a huggingface integration module in nix, instead of the services.ollama.loadModels

1 Like

If we are talking llama.cpp, you might want to experiment with --no-host, and --mmap / --no-mmap

Thanks for the tip!

I took a look at GitHub - kyuz0/amd-strix-halo-toolboxes · GitHub , and it also mentions to include -fa 1

Running this increases a tiny bit the speed, but not much:

llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --predict 32768 \
    --ctx-size 128000 \
    -ngl 999 \
    --no-mmap \
    -fa 1 \
    -p "briefly explain systemd in one paragraph"

Yields:

[ Prompt: 55,4 t/s | Generation: 17,5 t/s ]

I noticed my speeds seem much faster than the ones being listed here.

Are we measuring differently?

Qwen3.6 35B-A3B MoE benchmark results on AMD Strix Halo / Radeon 8060S

System:

- OS: NixOS 25.11 (Xantusia), build 25.11.8107.1073dad219cb

- Kernel: Linux 6.19.9

- APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S

- CPU: 16 cores / 32 threads

- System memory: 128GB unified memory

- GPU: AMD Strix Halo / Radeon 8060S

- PCI ID: 1002:1586

- Kernel driver: amdgpu

- Vulkan device: Radeon 8060S Graphics (RADV GFX1151)

- Vulkan notes: UMA, fp16 enabled

- VRAM sysfs total: 64 GiB / 68719476736 bytes

- GTT sysfs total: ~31.2 GiB / 33522102272 bytes

ROCm:

- ROCm version: 7.2.2

- GPU exposed as: gfx1151

- ROCm env used: HSA_OVERRIDE_GFX_VERSION=11.5.1

- Best ROCm run also used: ROCBLAS_USE_HIPBLASLT=1

llama.cpp:

- Main Vulkan benchmark build reported: build unknown (0)

- Fresh upstream comparison build used llama.cpp commit f65bc34c688f9ab68c312b5ce0c0885cca94cf1d / short f65bc34

- Separate Vulkan and ROCm builds were tested

Benchmark shape:

llama-bench -m <model.gguf> -ngl 999 -fa <0|1> -p -n 128 -r -o md -t 16

Notes:

- -ngl 999 was used to offload as much as possible to GPU.

- -fa 1 enables Flash Attention.

- -n 128 was used for token generation.

- Later Q8 / Qwen3.6 35B runs explicitly used -t 16.

- Older Q4 131072-context runs did not explicitly pass -t 16 in the manifest.

Results for base Qwen3.6 35B-A3B MoE models:

| Model | Backend | Context | Settings | PP | TG | Memory | Status |

|—|—|—:|—|—:|—:|—|—|

| Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf | Vulkan / RADV | 131072 | -ngl 999 -fa 1 -p 131072 -n 128 -r 2 -o md | 435.41 t/s | 56.67 t/s | sys 19.59 GiB, VRAM 29.32 GiB, GTT 0.83 GiB | valid |

| Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf | Vulkan / RADV | 131072 | -ngl 999 -fa 0 -p 131072 -n 128 -r 2 -o md | 210.58 t/s | 57.53 t/s | sys 25.71 GiB, VRAM 33.03 GiB, GTT 1.09 GiB | valid, but FA off was much worse for prompt processing |

| Qwen3.6-35B-A3B-Q8_0.gguf | Vulkan / RADV | 128000 | -ngl 999 -fa 1 -p 128000 -n 128 -r 1 -o md -t 16 | 454.13 t/s | 53.05 t/s | ~40.3 GiB VRAM mid-run | valid |

| Qwen3.6-35B-A3B-Q8_0.gguf | ROCm + hipBLASLt | 128000 | HSA_OVERRIDE_GFX_VERSION=11.5.1 ROCBLAS_USE_HIPBLASLT=1, -ngl 999 -fa 1 -p 128000 -n 128 -r 1 -o md -t 16 | 445.18 t/s | 46.25 t/s | ~40.4 GiB VRAM mid-run | valid, but slower than Vulkan |

Takeaway:

On this AMD Strix Halo / Radeon 8060S setup, Vulkan/RADV was faster than ROCm for Qwen3.6-35B-A3B-Q8_0 in llama.cpp. Flash Attention was very important for high prompt-processing throughput at ~128k context: on the Q4_K_XL run, FA on gave 435.41 t/s PP versus 210.58 t/s with FA off, while TG stayed roughly similar.

2 Likes

Thanks for sharing!

We are not measuring differently, since starting this thread, I learned that I don’t have a Strix Halo (amd ai 395), but a Strix Point (amd ai hx 370). The post should actually be called “How to local AI on Strix Arch” :sweat_smile:

I think your numbers reflect the correct numbers, as amd ai 395 has higher memory bandwith

System Processor Memory Bandwidth Est Tokens/Sec (Qwen-35B-A3B)
AMD AI 395 Max+ Ryzen AI MAX+ 395 256 GB/s 38 – 48 T/s
AMD AI HX 370 Ryzen AI 9 HX 370 ~90 GB/s 15 – 22 T/s
1 Like

If you’ve not tried it, I use pi.dev to manage my nixos system and it does an amazing job with qwen 3.6 35b moe. I think you’ll find the 15-20 tokens/second acceptable with it.

I’ve added a new entry in the wiki for llama-cpp, it’s much better than Ollama, and I think it can replace Ollama in NixOS now with the new service.llama-cpp

1 Like

Oh shit that’s a service now? Glad to hear it!