I noticed my speeds seem much faster than the ones being listed here.
Are we measuring differently?
Qwen3.6 35B-A3B MoE benchmark results on AMD Strix Halo / Radeon 8060S
System:
- OS: NixOS 25.11 (Xantusia), build 25.11.8107.1073dad219cb
- Kernel: Linux 6.19.9
- APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S
- CPU: 16 cores / 32 threads
- System memory: 128GB unified memory
- GPU: AMD Strix Halo / Radeon 8060S
- PCI ID: 1002:1586
- Kernel driver: amdgpu
- Vulkan device: Radeon 8060S Graphics (RADV GFX1151)
- Vulkan notes: UMA, fp16 enabled
- VRAM sysfs total: 64 GiB / 68719476736 bytes
- GTT sysfs total: ~31.2 GiB / 33522102272 bytes
ROCm:
- ROCm version: 7.2.2
- GPU exposed as: gfx1151
- ROCm env used: HSA_OVERRIDE_GFX_VERSION=11.5.1
- Best ROCm run also used: ROCBLAS_USE_HIPBLASLT=1
llama.cpp:
- Main Vulkan benchmark build reported: build unknown (0)
- Fresh upstream comparison build used llama.cpp commit f65bc34c688f9ab68c312b5ce0c0885cca94cf1d / short f65bc34
- Separate Vulkan and ROCm builds were tested
Benchmark shape:
llama-bench -m <model.gguf> -ngl 999 -fa <0|1> -p -n 128 -r -o md -t 16
Notes:
- -ngl 999 was used to offload as much as possible to GPU.
- -fa 1 enables Flash Attention.
- -n 128 was used for token generation.
- Later Q8 / Qwen3.6 35B runs explicitly used -t 16.
- Older Q4 131072-context runs did not explicitly pass -t 16 in the manifest.
Results for base Qwen3.6 35B-A3B MoE models:
| Model | Backend | Context | Settings | PP | TG | Memory | Status |
|—|—|—:|—|—:|—:|—|—|
| Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf | Vulkan / RADV | 131072 | -ngl 999 -fa 1 -p 131072 -n 128 -r 2 -o md | 435.41 t/s | 56.67 t/s | sys 19.59 GiB, VRAM 29.32 GiB, GTT 0.83 GiB | valid |
| Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf | Vulkan / RADV | 131072 | -ngl 999 -fa 0 -p 131072 -n 128 -r 2 -o md | 210.58 t/s | 57.53 t/s | sys 25.71 GiB, VRAM 33.03 GiB, GTT 1.09 GiB | valid, but FA off was much worse for prompt processing |
| Qwen3.6-35B-A3B-Q8_0.gguf | Vulkan / RADV | 128000 | -ngl 999 -fa 1 -p 128000 -n 128 -r 1 -o md -t 16 | 454.13 t/s | 53.05 t/s | ~40.3 GiB VRAM mid-run | valid |
| Qwen3.6-35B-A3B-Q8_0.gguf | ROCm + hipBLASLt | 128000 | HSA_OVERRIDE_GFX_VERSION=11.5.1 ROCBLAS_USE_HIPBLASLT=1, -ngl 999 -fa 1 -p 128000 -n 128 -r 1 -o md -t 16 | 445.18 t/s | 46.25 t/s | ~40.4 GiB VRAM mid-run | valid, but slower than Vulkan |
Takeaway:
On this AMD Strix Halo / Radeon 8060S setup, Vulkan/RADV was faster than ROCm for Qwen3.6-35B-A3B-Q8_0 in llama.cpp. Flash Attention was very important for high prompt-processing throughput at ~128k context: on the Q4_K_XL run, FA on gave 435.41 t/s PP versus 210.58 t/s with FA off, while TG stayed roughly similar.