Advice for simple nixos setup of local slurm cluster for home

costrouc · January 30, 2019, 3:43am

Originally I was planning to use jupyterlab to schedule jobs on two machines at home but I have 100s of jobs that I need to run … some will fail and it will take several weeks to run them all. To me it seems like the best possible solution is to start using a job scheduler of some sort on two nodes that I have at home with 4 cores each. These jobs would run a python framework that I have written and write locally to a harddisks not using a distributed filesystem or nfs. I am currently using nixos and nixops so deployment should be relatively easy. I wanted to ask if slurm is the correct approach and what is the absolute simplest and sane setup that I could have?

@markuskowa I know that you have heavily worked on slurm in nixos. If you don’t have time for help I understand. I don’t care about accounts. I really would only like to view the queue, status or jobs, and possibly persist jobs if a node goes down. Also is it possible to run slurm on nodes that are not solely used for jobs such as also running an nginx server?

Here is what I have worked out for a nixops config. From what I can tell this will setup a debug queue that allows jobs to run forever.

let mungekey = "rootmepleasesupersecretpassword"; # set to /etc/munge/munge.key with 0400 permissions
    slurmconfig = {
      client.enable = true;
      controlMachine = "master1";
      nodeName = ''
        master1
        NodeName=node[1-2] CPUs=4 State=UNKNOWN
      '';
      partitionName = "debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP";
    };
   computeNode =  {
        services.slurm = slurmconfig;
      };
in {
    master1 =
      { config, pkgs, ...}:
      {
        networking.firewall.enable = false;
        services.slurm = {
          server.enable = true;
        } // slurmconfig;
      };
    worker1 = computeNode;
    worker2 = computeNode;
  };
}

markuskowa · January 30, 2019, 9:54am

Hi Christopher,

That sounds to me like you should be fine with the minimalist slurm setup. Slurm itself does not use up much resources. It should be fine if nginx is running on one of the nodes. You could start your jobs with an increased nice value to avoid choking the web server (use the nice command inside the submission script).
Your configuration file looks fine. It also possible to run a slurm client and the master on the same host.

costrouc · January 30, 2019, 1:38pm

Thank you! In that case I will have a home setup of slurm at home soon. I appreciate the help.

costrouc · January 30, 2019, 5:38pm

I realize that others configuration may be different but here is what I did to get a two node slurm cluster with nixops. secrets.nix is a file where I store passwords and special configuration information. This configuration will scale to N nodes.

{ config, pkgs, ... }:

let secrets = import ../../secrets.nix;
    isMaster = (secrets.slurm.master == config.networking.hostName);
    slurmconfig = {
      client.enable = true;
      controlMachine = secrets.slurm.master;
      nodeName = [
        "worker[1-2] CPUs=4 State=UNKNOWN"
      ];
      partitionName = [
        "standard Nodes=worker[1-2] Default=YES MaxTime=INFINITE State=UP"
      ];
    };
in
{
  services.slurm = (if isMaster then { server.enable = true; } else { }) // slurmconfig;

  services.munge = {
    enable = true;
    password = "/var/run/keys/munge-key";
  };

  users.users.munge.extraGroups = ["keys"];

  deployment.keys.munge-key = {
    text = secrets.slurm.mungekey;
    user = "munge";
    group = "munge";
    permissions = "0400";
  };
}

And now I have a working cluster!

$ squeue -u costrouc
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                15  standard     test costrouc PD       0:00      1 (Resources)
                16  standard     test costrouc PD       0:00      1 (Priority)
                17  standard     test costrouc PD       0:00      1 (Priority)
                18  standard     test costrouc PD       0:00      1 (Priority)
                19  standard     test costrouc PD       0:00      1 (Priority)
                20  standard     test costrouc PD       0:00      1 (Priority)
                21  standard     test costrouc PD       0:00      1 (Priority)
                22  standard     test costrouc PD       0:00      1 (Priority)
                23  standard     test costrouc PD       0:00      1 (Priority)
                13  standard     test costrouc  R       0:58      1 worker1
                14  standard     test costrouc  R       0:36      1 worker2