How to Ollama on AMD Strix Halo?

Hi everyone!

tl;dr: as of now, running ollama-vulkan is the best option on Strix Halo.

I wanted to open this topic to discuss, how to run models on Ollama on the new Strix Halo CPU/GPU, as we don’t have any topic here yet. I would like to hear about your configurations, discuss the current status quo, and share what I’ve tried so far.

I’m personally interested in running small models as fast as possible, and spoilers, I’m doing a terrible job at it :sweat_smile:

My laptop setup:

Kernel      -  6.18.5
Distro      -  NixOS 26.05 (Yarara) [unstable actually]
DE          -  KDE
CPU         -  AMD Ryzen AI 9 HX 370 w/ Radeon 890M (24)
Memory      -  7.4 GB / 131.0 GB
Power       -  100W

My ollama (0.13.5) config:

  services.ollama = {
    enable = true;
    package = pkgs.ollama-rocm; # or set pkgs.ollama-vulkan
    loadModels = [
      "ministral-3:14b"
      "ministral-3:8b"
    ];
    rocmOverrideGfx = "11.5.1";
    environmentVariables = {
      # Hopefully helps with offloading layers to GPU, it didn't
      HSA_ENABLE_SDMA = "0";
      OLLAMA_DEBUG = "1";
    };
  };

I noticed that models like ministral-3:8b were going slow, around 11.72 tokens/s, and thus, the rabbit-hole began. An online calculator says that it should be ~30 tokens/s, I haven’t been able to find proper benchmark unfortunately.

I started with ollama-rocm, and I remember reading a random comment, on reddit: “just use vulkan”. So I switched to ollama-vulkan and believe it or not, it ended up being the fastest, I continued testing different rocm options, but still, they all end up slow. In theory, rocm should be faster, as vulkan is a more generic solution, but in reality, rocm on strix halo, is very green.

These are my benchmark notes:

Backend Configuration / Overrides Eval Rate
ollama-vulkan Default 14.11 tokens/s
ollama-rocm rocmOverrideGfx = "11.0.2" 12.73 tokens/s
ollama-rocm rocmOverrideGfx = "11.5.0" 12.65 tokens/s
ollama-rocm rocmOverrideGfx = "11.5.1" (SDMA disabled) 12.60 tokens/s
ollama-rocm rocmOverrideGfx = "11.5.0" (SDMA disabled) 12.38 tokens/s
ollama-rocm rocmOverrideGfx = "11.0.0" (SDMA disabled) 12.17 tokens/s
ollama-rocm rocmOverrideGfx = "11.0.3" 11.88 tokens/s
ollama-rocm rocmOverrideGfx = "11.0.0" 11.72 tokens/s

I’ve been running this command, which outputs a report at the end:

ollama run ministral-3:8b \
  "In 2 paragraphs explain what journalctl is and what it does, no examples" \
  --verbose

And, what I noticed, by looking at the ollama logs

sudo journalctl -u ollama -f

Is that the ollama-rocm somehow fails to “offload layers to the GPU”

msg="insufficient VRAM to load any model layers"
msg="new layout created" layers=[]
msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
msg="model weights" device=CPU size="5.6 GiB"
msg="kv cache" device=CPU size="544.0 MiB"
msg="compute graph" device=CPU size="765.1 MiB"
msg="total memory" size="6.9 GiB"
msg="loaded runners" count=1
msg="offloading 0 repeating layers to GPU"
msg="offloading output layer to CPU"
msg="offloaded 0/35 layers to GPU"

After some investigation, it looks like the AMD 890m is not supported by ollama yet, although some people suggest changing the VRAM assigned to the card in the BIOS (I’m gonna test this next, and update the post I couldn’t do it on my BIOS) .

I’ve also seen other tools, which are not available in nixpkgs yet, which, allegedly, support Strix Halo: lemonade-server, amd/gaia and FastFlowLM / FastFlowLM

Some related resources:

Hopefully, with the outcomes of this thread, we can update the wiki.

To close the long post, I would like to hear experiences from the community.

Thanks!

3 Likes

I’ve also tried assigning more VRAM to the GPU in the kernel with no success.

boot.kernelParams = [
    # assign more VRAM to the GPU (32GB)
    "ttm.pages_limit=8388608"
];

I still see with ollama-rocm the error:

insufficient VRAM to load any model layers

No difference with vulkan

  1. Have you also tried llama.cpp with Vulkan enabled, which we have packaged but maybe not in cache?
  2. Have you checked what vulkaninfo says about memory available?
  3. If you manage to get iGPU offload, you might want to try something MoE, it looks like fewer active parameters does help even while the total size is larger (but you need to convince the system to treat more RAM as VRAM when requested)
  1. I haven’t, I’m not that familiar with how to use llama.cpp
  2. There’s a lot of info in vulkaninfo, and is it used by rocm? Is the following info useful?
memoryHeaps: count = 1
        memoryHeaps[0]:
                size   = 134134628352 (0x1f3b0c0000) (124.92 GiB)
                budget = 134134628352 (0x1f3b0c0000) (124.92 GiB)
                usage  = 7613366272 (0x1c5cac000) (7.09 GiB)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
  1. I will, I had some hope ttm.pages_limit was gonna work :sad_but_relieved_face:

Some more outputs to review:

$ sudo dmesg | grep -i "amdgpu" | grep -i "GTT"
[    1.616011] amdgpu 0000:65:00.0: amdgpu: amdgpu: 32768M of GTT memory ready.
$ sudo cat /sys/module/amdgpu/parameters/gttsize
-1

AFAIK -1 here means that it’s in “Auto”.

Rocminfo:

nix run nixpkgs#"rocmPackages.rocminfo" -- --run "rocminfo" | grep "gfx"
  Name:                    gfx1150
      Name:                    amdgcn-amd-amdhsa--gfx1150
      Name:                    amdgcn-amd-amdhsa--gfx11-generic

Then amdgpu_top shows:

$ nix-shell -p amdgpu_top --run "amdgpu_top --dump"
Device Name              : [AMD Radeon 890M Graphics]
PCI (domain:bus:dev.func): 0000:65:00.0
DeviceID.RevID           : 0x150E.0xC1
gfx_target_version       : gfx1150

GPU Type  : APU
Family    : GC 11.5.0
ASIC Name : GFX1150/Strix Point
Chip Class: GFX11_5
# ...
VRAM              : usage   442 MiB, total   512 MiB (usable   338 MiB)
CPU-Visible VRAM  : usage   442 MiB, total   512 MiB (usable   338 MiB)
GTT               : usage  7994 MiB, total 32768 MiB (usable 32754 MiB)

There’s a new version of ollama: 0.14.1 in unstable. It seems to solve the issue with offloading:

I now see:

msg="offloaded 35/35 layers to GPU"

The performance is a tiny bit better. Gonna do some checks again.

Edit:

It seems to detect the layers with rocmOverrideGfx = “11.5.1”; and rocmOverrideGfx = “11.0.0”; With 11.5.0 it goes back to 0/35 layers used.

There are some performance gains, I guess… but in the “prompt eval rate”, the “eval rate” remains the same.

rocmOverrideGfx = “11.5.0”;

$ ollama run ministral-3:8b "In 2 paragraphs explain what journalctl is and what it does, no examples" --verbose
total duration:       33.322368575s
load duration:        4.121463777s
prompt eval count:    569 token(s)
prompt eval duration: 9.955842628s
prompt eval rate:     57.15 tokens/s
eval count:           243 token(s)
eval duration:        19.175080554s
eval rate:            12.67 tokens/s

rocmOverrideGfx = “11.5.1”;

$ ollama run ministral-3:8b "In 2 paragraphs explain what journalctl is and what it does, no examples" --verbose
total duration:       36.117319075s
load duration:        7.890915699s
prompt eval count:    569 token(s)
prompt eval duration: 1.40330411s
prompt eval rate:     405.47 tokens/s
eval count:           339 token(s)
eval duration:        26.739219768s
eval rate:            12.68 tokens/s

*edit*

rocmOverrideGfx = “11.0.0”; and without ttm.pages_limit as described below

total duration:       26.806015504s
load duration:        7.981355354s
prompt eval count:    569 token(s)
prompt eval duration: 1.396062768s
prompt eval rate:     407.57 tokens/s
eval count:           229 token(s)
eval duration:        17.359498045s
eval rate:            13.19 tokens/s

Removing from the kernel “ttm.pages_limit=8388608” actually improve a bit the “eval rate”:

eval rate:            13.49 tokens/s

ollama-vulkan remains on top:

eval rate:            14.28 tokens/s

hmm, do you see the ram and usage spike in nvtop? It does work with AMD cards nix-shell -p nvtopPackages.amd they feel a little low for an 8b model.

Here’s nvtop on the left and amdgpu_topon the right.

For vulkan

For rocm

Running a MoE model, as suggested by @7c6f434c, like hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF:Q4_K_M, (which I think it’s a MoE) does an incredible jump. Any other model to recommend?

rocmOverrideGfx = “11.0.0”;

total duration:       17.983895585s
load duration:        10.478232941s
prompt eval count:    24 token(s)
prompt eval duration: 257.942964ms
prompt eval rate:     93.04 tokens/s
eval count:           201 token(s)
eval duration:        7.184206604s
eval rate:            27.98 tokens/s

rocmOverrideGfx = “11.5.1”;

total duration:       16.368635185s
load duration:        8.934547421s
prompt eval count:    24 token(s)
prompt eval duration: 261.230628ms
prompt eval rate:     91.87 tokens/s
eval count:           204 token(s)
eval duration:        7.119764094s
eval rate:            28.65 tokens/s

However, vulkan is still on top:

total duration:       15.345436454s
load duration:        8.017318758s
prompt eval count:    24 token(s)
prompt eval duration: 727.240871ms
prompt eval rate:     33.00 tokens/s
eval count:           206 token(s)
eval duration:        6.549346847s
eval rate:            31.45 tokens/s

I assume the issue with ministral might be that it’s also a vision model (?)

Note that if you believe that 14B model should have more than 14 tok/s, then 31 tok/s from a Moe A3B model is probably still an underperformance. The point of MoE is to be faster.

Performance optimisation for GPGPU does look like a black art, though, and estimating the outcomes from the first principles might just fail to work.

I don’t think llama.cpp is too hard to run. You do need to launch stuff manually from the command line, but it is not too complex for the basic functionality, although you’ll need to read documentation. My expectations are around «same speed» though, as Ollama seems to be running a fork and to import the core library changes. On the othe hand, you get the ability to run Q5 and Q6 quantisations that seem to be good speed/quality trade-offs on iGPU.

If we are talking STEM, including programming, I have an impression that Qwen3 MoE models (and I would expect VL model of the series to work fine too) are a good overall choice (of course you need to consider which of them fits better, as far as I understand current top-of-the-line models there are Instruct, Thinking, Coder, and VL, and they are not of the same age).

For some other topics one probably needs to check whether issues of data availability and training priorities has been made worse by the political context at both country and organisation levels.

Although I guess the degree of preference between programming languages might also be counted as an example of organisation-level politics! If you are asking about neither Python nor JS, maybe it is worth checking where quality/performance trade-off looks best.

In general, I wouldn’t expect any specific token/s problems from vision models. Quality fine-tuning priorities slightly differ, and probably at the small or MoE models size these trade-offs can become visible; I have not tried to check this systematically.

1 Like

If you want to compare with llama-cpp directly it should be pretty easy to run as a one off, you can give it a huggingface model ID:

nix run github:nixos/nixpkgs/nixos-unstable#pkgsRocm.llama-cpp -- -hf ggml-org/functiongemma-270m-it-GGUF -c 0 -fa 1 -p "hello"

The ROCm 7.0.2 bump was just merged, I’m curious if you’ll see any speedup once that reaches the unstable channel. 7 was the first version with official strix halo support.

edit: Found an issue for a big perf regression on strix halo in llama-server. I’d guess it will be a problem for ollama too, but depends on if their backports of llama-server changes are ahead of the bug. Misc. bug: Performance regression using ROCm on Strix Halo · Issue #17917 · ggml-org/llama.cpp · GitHub

3 Likes

rocm 7 just hit unstable :tada:

Unfortunately, I don’t see any improvements (ministral-3:8b), only 11.5.1 gain one second:

total duration:       25.109305128s
load duration:        4.025237496s
prompt eval count:    569 token(s)
prompt eval duration: 1.413796763s
prompt eval rate:     402.46 tokens/s
eval count:           266 token(s)
eval duration:        19.595268594s
eval rate:            13.57 tokens/s

vulkan, 11.5.0 and 11.0.0 remain about the same.

MoE’s (nemotron, Qwen3-Coder-30B-A3B) remain around the same speed.

With llama-cpp I get this error:

ggml_cuda_compute_forward: SCALE failed
ROCm error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /build/source/ggml/src/ggml-cuda/ggml-cuda.cu:2751
  err
# ... nothing relevant ...
Aborted                    (core dumped)

Edit:

looks like rocm7 is not supported by ollama yet:

Hi there.

Somehow I squeezed some performance out of ROCM.
My system is Ryzen AI Max+ 395 based and it managed to reach these times with the initial mentioned benchmark …

total duration: 11.75302112s
load duration: 1.613106275s
prompt eval count: 26 token(s)
prompt eval duration: 412.893809ms
prompt eval rate: 62.97 tokens/s
eval count: 159 token(s)
eval duration: 9.629045093s
eval rate: 16.51 tokens/s

The Nix Config looks like this:

{
services.ollama = {
enable = true;
package = pkgs.ollama-rocm;
#package = pkgs.ollama-vulkan
#package = pkgs.llama-cpp-rocm;
#acceleration = “rocm”;
loadModels = [
“llama2”
“llama3.1:8b”
“mistral:7b”
“mistral-small3.2:24b”
“ministral-3:14b” # Version > 14.0 needed!
];
environmentVariables = {
HSA_OVERRIDE_GFX_VERSION = “11.5.1”;
HIP_VISIBLE_DEVICES = “1”;
OLLAMA_LLM_LIBRARY = “rocm”;
HCC_AMDGPU_TARGET = “gfx1151”;
HSA_ENABLE_SDMA = “1”;{
services.ollama = {
enable = true;
package = pkgs.ollama-rocm;
#package = pkgs.ollama-vulkan
#package = pkgs.llama-cpp-rocm;
#acceleration = “rocm”;
loadModels = [
“llama2”
“llama3.1:8b”
“mistral:7b”
“mistral-small3.2:24b”
“ministral-3:14b” # Version > 14.0 needed!
];
environmentVariables = {
HSA_OVERRIDE_GFX_VERSION = “11.5.1”;
HIP_VISIBLE_DEVICES = “1”;
OLLAMA_LLM_LIBRARY = “rocm”;
HCC_AMDGPU_TARGET = “gfx1151”;
HSA_ENABLE_SDMA = “1”;
OLLAMA_DEBUG = “1”;
};
rocmOverrideGfx = “11.5.1”;
};
OLLAMA_DEBUG = “1”;
};
rocmOverrideGfx = “11.5.1”;
};

1 Like

@knix I tried your config (which I think it’s duplicated in your example), and the result is the same, you might be getting a few more seconds because it’s a Max+ 395.

If you use vulkan do you also see a performance increase?

ollama 0.14.3 is out in unstable, I’m gonna try it. Update: no gains

I recently found out that llama.cpp has something called llama-server repo has anyone played with it? Could it work as a replacement to ollama?

I sometimes use llama-server. In itself, it is not for managing the downloaded models or whatever, it is pirmarily for configuring and launching a model from command line, then having an API to it, with a small Web UI. The UI is minimalistic but it works pretty fine (and it doesn’t need to do too much).

Possibly you could use it together with llama-swap to choose from a Web UI which model is currently loaded.

For managing the model downloads you need something else, I think.

I recently figured out that my system is utilizing the cpu only and in some configs when ollama sees the GPU (8060S) it crashes. That is the case for 25.11 and for unstable.
So these numbers were actually not representative …

I will post the result again when it runs correctly - this might take a while.

With llama-server, do you refer to systemd service that runs ollama?
That is what I actually use, with a lot of trying what works and what not, hence the comments ….

{
  services.ollama = {
    enable = true;
    #package = pkgs.ollama;
    #package = pkgs.ollama-rocm;
    #package = pkgs.ollama-vulkan;
    #package = pkgs.llama-cpp-rocm;
    #acceleration = "rocm";
    #acceleration = "vulkan";
    #loadModels = [
      #"llama2"
      #"llama3.1:8b"
      #"mistral:7b"
      #"mistral-small3.2:24b"
      #"ministral-3:14b" # Version > 14.0 needed!
    #];
    #environmentVariables = {
    #  OLLAMA_HOST="0.0.0.0:11434";
    #  #HSA_OVERRIDE_GFX_VERSION = "11.5.1";
    #  #HIP_VISIBLE_DEVICES = "1";
    #  #ROCR_VISIBLE_DEVICES = "1";
    #  #OLLAMA_LLM_LIBRARY = "rocm";
    #  #HCC_AMDGPU_TARGET = "gfx1151";
    #  #HSA_ENABLE_SDMA = "0";
    #  OLLAMA_DEBUG = "1";
    #  OLLAMA_CONTEXT_LENGTH = "131072";
    #  OLLAMA_KV_CACHE_TYPE = "q8_0";
    #};
    #rocmOverrideGfx = "11.5.1";
  };

  users.groups.render.members = [ "knix" ];
}

No, it’s a tool in llama-cpp package.

Do you know if it’s available in nix? I couldn’t find it.

I use ollama as an API mainly, to integrate with my code editor. It would be nice if we have more alternatives in nix

llama-cpp is in Nixpkgs under this exact name. It has llama-server executable. Upstream name is llama.cpp for the package.

1 Like

I’ve tried llama-server with both llama-cpp-rocm and llama-cpp-vulkan.

The results are the same as reported in the first post.

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
  -hf mistralai/Ministral-3-8B-Reasoning-2512-GGUF:Q4_K_M \
  --jinja

I take this as a signal, that it’s not the tool per se, but proper support for rocm is not yet there.

1 Like

By the way, I’ve tried the new Qwen3-Coder-Next-GGUF and it produces token at the same rate as ministral-3-8b, around 14 token/second. Seems like a good option for now.

nix-shell -p llama-cpp-vulkan
llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 \
    --jinja
[ Prompt: 42,7 t/s | Generation: 14,0 t/s ]