Series: Tech
If you work in high-performance computing (HPC), you already know SLURM. It’s the industry-standard job scheduler used at supercomputing centers and labs around the world. But what if you could bring that same workload orchestration to your own datacenter, in minutes, on your own hardware? And what if you could also deploy the next-generation Flux scheduler alongside it?
This post walks through how we built oxide-slurm
, a complete Infrastructure as Code solution that deploys both SLURM and Flux clusters on an Oxide rack with full automation.
There are lighter-weight orchestrators out there, but mature HPC workloads demand proven schedulers. SLURM remains the gold standard for high-throughput job management with queueing, accounting, partitioning, multi-user policies, and a thriving community. But Flux represents the next generation—built from the ground up for modern HPC with hierarchical scheduling, graph-based resource models, and sub-second job turnaround.
This project gives you both options with identical infrastructure and deployment patterns.
The oxide-slurm
project demonstrates a complete end-to-end HPC deployment built on two key technologies:
The Terraform module provisions the complete cloud infrastructure:
resource "oxide_instance" "compute" {
for_each = { for i in range(var.instance_count) : i => "${var.instance_prefix}-${i + 1}" }
project_id = data.oxide_project.slurm.id
boot_disk_id = oxide_disk.compute_disks[each.key].id
memory = var.memory
ncpus = var.ncpus
start_on_create = true
external_ips = [{ type = "ephemeral" }]
# ... network interfaces and user data
}
The Ansible layer uses a role-based architecture supporting both schedulers:
Shared Infrastructure Roles:
common
: Base OS configuration, users, and system setupnfs
: Shared filesystem for package distributionmunge
: Cluster-wide authenticationverification
: Health checks and cluster validationScheduler-Specific Roles:
slurm
: Traditional SLURM deployment with local .deb
packagingflux
: Next-generation Flux scheduler with Spack-based buildsThis project supports both SLURM and Flux deployments with minimal changes to the Ansible playbooks. You can choose to install either scheduler based on your needs, or even run both in parallel for different workloads.
The traditional choice for production HPC environments:
# ansible/playbooks/slurm-cluster.yml
- name: Setup Slurm Cluster with Local Configs and NFS
hosts: all
become: true
roles:
- common
- nfs
- munge
- slurm
- verification
Our SLURM role builds custom .deb
packages locally to avoid upstream dependency issues and distributes them via NFS for consistent installations across the cluster.
The next-generation scheduler with modern architecture:
# ansible/playbooks/flux-cluster.yml
- name: Setup Flux Cluster with Local Configs and NFS
hosts: all
become: true
roles:
- common
- nfs
- munge
- flux
- verification
The Flux role builds a complete stack using Spack package manager:
The project includes a comprehensive Makefile that orchestrates the entire deployment:
# One-command infrastructure provisioning
make infra-up
# Provisions VPC, instances, waits for connectivity
# Deploy your scheduler of choice
make deploy-slurm # Traditional SLURM cluster
make deploy-flux # Next-gen Flux cluster
make deploy # Defaults to SLURM
# Full teardown
make destroy
The Makefile handles:
We deliberately made an architectural choice around NFS usage. Unlike many HPC deployments that put configuration files on shared storage, we use NFS exclusively for:
.deb
files and Spack stacksAll configuration files (slurm.conf
, flux-config.json
, etc.) are written to local disk via Ansible templates. This makes the system more resilient to NFS issues and simplifies debugging when nodes misbehave.
.deb
packages for slurmctld
, slurmd
, slurmdbd
slurm.conf
with automatic node discovery/data
exportOnce deployed, both schedulers provide familiar HPC workflows:
# Check cluster status
sinfo -Nel
# Interactive job
srun -N2 -n4 --pty /bin/bash
# Batch submission
sbatch job.sbatch
# Queue monitoring
squeue -u $USER
# Resource information
flux resource list
# Submit parallel job
flux run -N2 -n4 hostname
# Job status
flux jobs -a
Both provide the essential HPC functionality: job queueing, resource management, parallel execution, and cluster monitoring.
Terraform automatically generates Ansible inventory from deployed infrastructure:
[head_node]
slurm-node-1 ansible_host=203.0.113.10 ansible_user=ubuntu
[compute]
slurm-node-2 ansible_host=203.0.113.11 ansible_user=ubuntu
slurm-node-3 ansible_host=203.0.113.12 ansible_user=ubuntu
Each instance receives cloud-init configuration for initial setup:
# Configure sudo access
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu
# Ensure SSH availability
systemctl enable ssh && systemctl start ssh
The complete deployment follows this pattern:
make infra-up
creates VPC, instances, disksmake deploy-slurm
or make deploy-flux
Total deployment time: Under 15 minutes for a complete HPC cluster.
With your scheduler running on Oxide infrastructure, you can layer on additional HPC capabilities:
The complete infrastructure and automation code is available on Github: oxide-slurm
Requirements:
Clone the repo, configure your terraform.tfvars
, and you’re 15 minutes away from production HPC scheduling on your own hardware.
Whether you need the proven stability of SLURM or want to explore the cutting-edge capabilities of Flux, this project gives you both with enterprise-grade automation and Infrastructure as Code practices.