Blog

Running Slurm and Flux on Oxide: Dual HPC Schedulers in Your Own Rack

By Jay

Aug 1, 2025 | 5 minutes read

Series: Tech

If you work in high-performance computing (HPC), you already know SLURM. It’s the industry-standard job scheduler used at supercomputing centers and labs around the world. But what if you could bring that same workload orchestration to your own datacenter, in minutes, on your own hardware? And what if you could also deploy the next-generation Flux scheduler alongside it?

This post walks through how we built oxide-slurm, a complete Infrastructure as Code solution that deploys both SLURM and Flux clusters on an Oxide rack with full automation.

Why Multiple Schedulers?

There are lighter-weight orchestrators out there, but mature HPC workloads demand proven schedulers. SLURM remains the gold standard for high-throughput job management with queueing, accounting, partitioning, multi-user policies, and a thriving community. But Flux represents the next generation—built from the ground up for modern HPC with hierarchical scheduling, graph-based resource models, and sub-second job turnaround.

This project gives you both options with identical infrastructure and deployment patterns.

Architecture: Infrastructure as Code from Day One

The oxide-slurm project demonstrates a complete end-to-end HPC deployment built on two key technologies:

Terraform Infrastructure Layer

The Terraform module provisions the complete cloud infrastructure:

VPC and Networking: Creates isolated network environments with automatic subnet configuration
Compute Instances: Provisions 1-N nodes with the first designated as head node
Storage: Attaches persistent disks with Ubuntu base images
Dynamic Inventory: Automatically generates Ansible inventory from instance IPs

resource "oxide_instance" "compute" {
  for_each = { for i in range(var.instance_count) : i => "${var.instance_prefix}-${i + 1}" }
  
  project_id       = data.oxide_project.slurm.id
  boot_disk_id     = oxide_disk.compute_disks[each.key].id
  memory           = var.memory
  ncpus            = var.ncpus
  start_on_create  = true
  
  external_ips = [{ type = "ephemeral" }]
  # ... network interfaces and user data
}

Modular Ansible Architecture

The Ansible layer uses a role-based architecture supporting both schedulers:

Shared Infrastructure Roles:

common: Base OS configuration, users, and system setup
nfs: Shared filesystem for package distribution
munge: Cluster-wide authentication
verification: Health checks and cluster validation

Scheduler-Specific Roles:

slurm: Traditional SLURM deployment with local .deb packaging
flux: Next-generation Flux scheduler with Spack-based builds

Dual Scheduler Support: SLURM vs Flux

This project supports both SLURM and Flux deployments with minimal changes to the Ansible playbooks. You can choose to install either scheduler based on your needs, or even run both in parallel for different workloads.

SLURM Deployment

The traditional choice for production HPC environments:

# ansible/playbooks/slurm-cluster.yml
- name: Setup Slurm Cluster with Local Configs and NFS
  hosts: all
  become: true
  roles:
    - common
    - nfs
    - munge
    - slurm
    - verification

Our SLURM role builds custom .deb packages locally to avoid upstream dependency issues and distributes them via NFS for consistent installations across the cluster.

Flux Deployment

The next-generation scheduler with modern architecture:

# ansible/playbooks/flux-cluster.yml  
- name: Setup Flux Cluster with Local Configs and NFS
  hosts: all
  become: true
  roles:
    - common
    - nfs
    - munge
    - flux
    - verification

The Flux role builds a complete stack using Spack package manager:

GCC 13 toolchain
PMIx process management
OpenMPI for distributed computing
Flux Core v0.76.0 with hierarchical scheduling

Automation with Make

The project includes a comprehensive Makefile that orchestrates the entire deployment:

# One-command infrastructure provisioning
make infra-up
# Provisions VPC, instances, waits for connectivity

# Deploy your scheduler of choice  
make deploy-slurm    # Traditional SLURM cluster
make deploy-flux     # Next-gen Flux cluster  
make deploy          # Defaults to SLURM

# Full teardown
make destroy

The Makefile handles:

Python virtual environment setup
Terraform infrastructure provisioning with connectivity verification
Ansible deployment with proper working directory handling
Linting and validation across both Terraform and Ansible code

NFS Strategy: Packages Only

We deliberately made an architectural choice around NFS usage. Unlike many HPC deployments that put configuration files on shared storage, we use NFS exclusively for:

Package Distribution: Built .deb files and Spack stacks
Authentication Keys: Munge keys and SSH keys for cluster communication
Build Artifacts: Compiled software stacks

All configuration files (slurm.conf, flux-config.json, etc.) are written to local disk via Ansible templates. This makes the system more resilient to NFS issues and simplifies debugging when nodes misbehave.

What Gets Built and Configured

SLURM Stack

Builds from source
Custom .deb packages for slurmctld, slurmd, slurmdbd
Templates for slurm.conf with automatic node discovery
Cgroup configuration for resource management
Systemd services with proper dependencies

Flux Stack

Complete Spack-based build environment
Flux Core with hierarchical resource management
JSON-based cluster resource discovery
TCP-based broker communication between nodes

Shared Infrastructure

NFS server on head node with /data export
Munge authentication with shared secret keys
SSH key distribution for passwordless cluster communication
Verification tasks ensuring cluster health

Real-World Usage Examples

Once deployed, both schedulers provide familiar HPC workflows:

SLURM Commands

# Check cluster status
sinfo -Nel

# Interactive job
srun -N2 -n4 --pty /bin/bash

# Batch submission  
sbatch job.sbatch

# Queue monitoring
squeue -u $USER

Flux Commands

# Resource information
flux resource list

# Submit parallel job
flux run -N2 -n4 hostname

# Job status
flux jobs -a

Both provide the essential HPC functionality: job queueing, resource management, parallel execution, and cluster monitoring.

Infrastructure Insights

Dynamic Inventory Generation

Terraform automatically generates Ansible inventory from deployed infrastructure:

[head_node]  
slurm-node-1 ansible_host=203.0.113.10 ansible_user=ubuntu

[compute]
slurm-node-2 ansible_host=203.0.113.11 ansible_user=ubuntu  
slurm-node-3 ansible_host=203.0.113.12 ansible_user=ubuntu

Cloud-Init Integration

Each instance receives cloud-init configuration for initial setup:

# Configure sudo access
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu

# Ensure SSH availability  
systemctl enable ssh && systemctl start ssh

Deployment Workflow

The complete deployment follows this pattern:

Infrastructure Provisioning: make infra-up creates VPC, instances, disks
Connectivity Verification: Automated Ansible ping tests ensure readiness
Scheduler Deployment: make deploy-slurm or make deploy-flux
Health Verification: Built-in checks validate cluster functionality
Ready for Workloads: Submit jobs via standard HPC interfaces

Total deployment time: Under 15 minutes for a complete HPC cluster.

Beyond Basic Scheduling

With your scheduler running on Oxide infrastructure, you can layer on additional HPC capabilities:

Jupyter Integration: Schedule interactive notebooks via SLURM or Flux
Container Workloads: Run Singularity/Apptainer containers through the schedulers
Multi-User Environments: Configure quotas, partitions, and access controls

Get Started

The complete infrastructure and automation code is available on Github: oxide-slurm

Requirements:

Access to an Oxide rack
OpenTofu (default) or Terraform (requires configuration changes)
Local Ansible installation
Ubuntu base image uploaded to your rack

Clone the repo, configure your terraform.tfvars, and you’re 15 minutes away from production HPC scheduling on your own hardware.

Whether you need the proven stability of SLURM or want to explore the cutting-edge capabilities of Flux, this project gives you both with enterprise-grade automation and Infrastructure as Code practices.