Blog

If you work in high-performance computing (HPC), you already know SLURM. It’s the industry-standard job scheduler used at supercomputing centers and labs around the world. But what if you could bring that same workload orchestration to your own datacenter, in minutes, on your own hardware? And what if you could also deploy the next-generation Flux scheduler alongside it?

This post walks through how we built oxide-slurm, a complete Infrastructure as Code solution that deploys both SLURM and Flux clusters on an Oxide rack with full automation.

There are lighter-weight orchestrators out there, but mature HPC workloads demand proven schedulers. SLURM remains the gold standard for high-throughput job management with queueing, accounting, partitioning, multi-user policies, and a thriving community. But Flux represents the next generation—built from the ground up for modern HPC with hierarchical scheduling, graph-based resource models, and sub-second job turnaround.

This project gives you both options with identical infrastructure and deployment patterns.

The oxide-slurm project demonstrates a complete end-to-end HPC deployment built on two key technologies:

The Terraform module provisions the complete cloud infrastructure:

  • VPC and Networking: Creates isolated network environments with automatic subnet configuration
  • Compute Instances: Provisions 1-N nodes with the first designated as head node
  • Storage: Attaches persistent disks with Ubuntu base images
  • Dynamic Inventory: Automatically generates Ansible inventory from instance IPs
resource "oxide_instance" "compute" {
  for_each = { for i in range(var.instance_count) : i => "${var.instance_prefix}-${i + 1}" }
  
  project_id       = data.oxide_project.slurm.id
  boot_disk_id     = oxide_disk.compute_disks[each.key].id
  memory           = var.memory
  ncpus            = var.ncpus
  start_on_create  = true
  
  external_ips = [{ type = "ephemeral" }]
  # ... network interfaces and user data
}

The Ansible layer uses a role-based architecture supporting both schedulers:

Shared Infrastructure Roles:

  • common: Base OS configuration, users, and system setup
  • nfs: Shared filesystem for package distribution
  • munge: Cluster-wide authentication
  • verification: Health checks and cluster validation

Scheduler-Specific Roles:

  • slurm: Traditional SLURM deployment with local .deb packaging
  • flux: Next-generation Flux scheduler with Spack-based builds

This project supports both SLURM and Flux deployments with minimal changes to the Ansible playbooks. You can choose to install either scheduler based on your needs, or even run both in parallel for different workloads.

The traditional choice for production HPC environments:

# ansible/playbooks/slurm-cluster.yml
- name: Setup Slurm Cluster with Local Configs and NFS
  hosts: all
  become: true
  roles:
    - common
    - nfs
    - munge
    - slurm
    - verification

Our SLURM role builds custom .deb packages locally to avoid upstream dependency issues and distributes them via NFS for consistent installations across the cluster.

The next-generation scheduler with modern architecture:

# ansible/playbooks/flux-cluster.yml  
- name: Setup Flux Cluster with Local Configs and NFS
  hosts: all
  become: true
  roles:
    - common
    - nfs
    - munge
    - flux
    - verification

The Flux role builds a complete stack using Spack package manager:

  • GCC 13 toolchain
  • PMIx process management
  • OpenMPI for distributed computing
  • Flux Core v0.76.0 with hierarchical scheduling

The project includes a comprehensive Makefile that orchestrates the entire deployment:

# One-command infrastructure provisioning
make infra-up
# Provisions VPC, instances, waits for connectivity

# Deploy your scheduler of choice  
make deploy-slurm    # Traditional SLURM cluster
make deploy-flux     # Next-gen Flux cluster  
make deploy          # Defaults to SLURM

# Full teardown
make destroy

The Makefile handles:

  • Python virtual environment setup
  • Terraform infrastructure provisioning with connectivity verification
  • Ansible deployment with proper working directory handling
  • Linting and validation across both Terraform and Ansible code

We deliberately made an architectural choice around NFS usage. Unlike many HPC deployments that put configuration files on shared storage, we use NFS exclusively for:

  • Package Distribution: Built .deb files and Spack stacks
  • Authentication Keys: Munge keys and SSH keys for cluster communication
  • Build Artifacts: Compiled software stacks

All configuration files (slurm.conf, flux-config.json, etc.) are written to local disk via Ansible templates. This makes the system more resilient to NFS issues and simplifies debugging when nodes misbehave.

  • Builds from source
  • Custom .deb packages for slurmctld, slurmd, slurmdbd
  • Templates for slurm.conf with automatic node discovery
  • Cgroup configuration for resource management
  • Systemd services with proper dependencies
  • Complete Spack-based build environment
  • Flux Core with hierarchical resource management
  • JSON-based cluster resource discovery
  • TCP-based broker communication between nodes
  • NFS server on head node with /data export
  • Munge authentication with shared secret keys
  • SSH key distribution for passwordless cluster communication
  • Verification tasks ensuring cluster health

Once deployed, both schedulers provide familiar HPC workflows:

# Check cluster status
sinfo -Nel

# Interactive job
srun -N2 -n4 --pty /bin/bash

# Batch submission  
sbatch job.sbatch

# Queue monitoring
squeue -u $USER
# Resource information
flux resource list

# Submit parallel job
flux run -N2 -n4 hostname

# Job status
flux jobs -a

Both provide the essential HPC functionality: job queueing, resource management, parallel execution, and cluster monitoring.

Terraform automatically generates Ansible inventory from deployed infrastructure:

[head_node]  
slurm-node-1 ansible_host=203.0.113.10 ansible_user=ubuntu

[compute]
slurm-node-2 ansible_host=203.0.113.11 ansible_user=ubuntu  
slurm-node-3 ansible_host=203.0.113.12 ansible_user=ubuntu

Each instance receives cloud-init configuration for initial setup:

# Configure sudo access
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu

# Ensure SSH availability  
systemctl enable ssh && systemctl start ssh

The complete deployment follows this pattern:

  1. Infrastructure Provisioning: make infra-up creates VPC, instances, disks
  2. Connectivity Verification: Automated Ansible ping tests ensure readiness
  3. Scheduler Deployment: make deploy-slurm or make deploy-flux
  4. Health Verification: Built-in checks validate cluster functionality
  5. Ready for Workloads: Submit jobs via standard HPC interfaces

Total deployment time: Under 15 minutes for a complete HPC cluster.

With your scheduler running on Oxide infrastructure, you can layer on additional HPC capabilities:

  • Jupyter Integration: Schedule interactive notebooks via SLURM or Flux
  • Container Workloads: Run Singularity/Apptainer containers through the schedulers
  • Multi-User Environments: Configure quotas, partitions, and access controls

The complete infrastructure and automation code is available on Github: oxide-slurm

Requirements:

  • Access to an Oxide rack
  • OpenTofu (default) or Terraform (requires configuration changes)
  • Local Ansible installation
  • Ubuntu base image uploaded to your rack

Clone the repo, configure your terraform.tfvars, and you’re 15 minutes away from production HPC scheduling on your own hardware.

Whether you need the proven stability of SLURM or want to explore the cutting-edge capabilities of Flux, this project gives you both with enterprise-grade automation and Infrastructure as Code practices.