Properly dealing with ML / CUDA / Python

I have now impl quite a lot of the above goal (yes i use LLMs, but I do read the code afterwards, still rn this is WIP so this is only on a best-effort bases)

i am now able to build from bin or source a compatible set of torch, flash-attn, mamba-ssm, and causal-conv1d, while being able to specify the python, cuda, torch, and a special cuda-packages version for pascal gpus.

(comment says otherwise but cuda 13.0 is supported technically, just missing cuda-packages v13 to be provided)

for now I will just have this so stuff works for myself, meaning the packages that I use.