Originally I was planning to use jupyterlab to schedule jobs on two machines at home but I have 100s of jobs that I need to run … some will fail and it will take several weeks to run them all. To me it seems like the best possible solution is to start using a job scheduler of some sort on two nodes that I have at home with 4 cores each. These jobs would run a python framework that I have written and write locally to a harddisks not using a distributed filesystem or nfs. I am currently using nixos and nixops so deployment should be relatively easy. I wanted to ask if slurm is the correct approach and what is the absolute simplest and sane setup that I could have?
@markuskowa I know that you have heavily worked on slurm in nixos. If you don’t have time for help I understand. I don’t care about accounts. I really would only like to view the queue, status or jobs, and possibly persist jobs if a node goes down. Also is it possible to run slurm on nodes that are not solely used for jobs such as also running an nginx server?
Here is what I have worked out for a nixops config. From what I can tell this will setup a debug queue that allows jobs to run forever.
let mungekey = "rootmepleasesupersecretpassword"; # set to /etc/munge/munge.key with 0400 permissions
slurmconfig = {
client.enable = true;
controlMachine = "master1";
nodeName = ''
master1
NodeName=node[1-2] CPUs=4 State=UNKNOWN
'';
partitionName = "debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP";
};
computeNode = {
services.slurm = slurmconfig;
};
in {
master1 =
{ config, pkgs, ...}:
{
networking.firewall.enable = false;
services.slurm = {
server.enable = true;
} // slurmconfig;
};
worker1 = computeNode;
worker2 = computeNode;
};
}
Hi Christopher,
That sounds to me like you should be fine with the minimalist slurm setup. Slurm itself does not use up much resources. It should be fine if nginx is running on one of the nodes. You could start your jobs with an increased nice value to avoid choking the web server (use the nice command inside the submission script).
Your configuration file looks fine. It also possible to run a slurm client and the master on the same host.
Thank you! In that case I will have a home setup of slurm at home soon. I appreciate the help.
I realize that others configuration may be different but here is what I did to get a two node slurm cluster with nixops. secrets.nix
is a file where I store passwords and special configuration information. This configuration will scale to N nodes.
{ config, pkgs, ... }:
let secrets = import ../../secrets.nix;
isMaster = (secrets.slurm.master == config.networking.hostName);
slurmconfig = {
client.enable = true;
controlMachine = secrets.slurm.master;
nodeName = [
"worker[1-2] CPUs=4 State=UNKNOWN"
];
partitionName = [
"standard Nodes=worker[1-2] Default=YES MaxTime=INFINITE State=UP"
];
};
in
{
services.slurm = (if isMaster then { server.enable = true; } else { }) // slurmconfig;
services.munge = {
enable = true;
password = "/var/run/keys/munge-key";
};
users.users.munge.extraGroups = ["keys"];
deployment.keys.munge-key = {
text = secrets.slurm.mungekey;
user = "munge";
group = "munge";
permissions = "0400";
};
}
And now I have a working cluster!
$ squeue -u costrouc
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
15 standard test costrouc PD 0:00 1 (Resources)
16 standard test costrouc PD 0:00 1 (Priority)
17 standard test costrouc PD 0:00 1 (Priority)
18 standard test costrouc PD 0:00 1 (Priority)
19 standard test costrouc PD 0:00 1 (Priority)
20 standard test costrouc PD 0:00 1 (Priority)
21 standard test costrouc PD 0:00 1 (Priority)
22 standard test costrouc PD 0:00 1 (Priority)
23 standard test costrouc PD 0:00 1 (Priority)
13 standard test costrouc R 0:58 1 worker1
14 standard test costrouc R 0:36 1 worker2
1 Like