HPC & Apptainer Deployment¶

Run Bazzite Pods on HPC clusters using Apptainer (formerly Singularity).

Prerequisites¶

Apptainer installed on cluster (check: apptainer --version)
Access to shared filesystem for image storage
GPU nodes with NVIDIA drivers (for GPU pods)

Apptainer Installation¶

Most HPC systems have Apptainer pre-installed. If not:

# Ubuntu/Debian
sudo apt-get install apptainer

# RHEL/Fedora
sudo dnf install apptainer

# From source: https://apptainer.org/docs/admin/main/installation.html

Quick Start¶

# Pull image (creates .sif file)
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable

# Interactive shell with GPU
apptainer exec --nv bazzite-ai-pod-nvidia-python_stable.sif bash

# Run a command
apptainer exec --nv bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml python train.py

Image Management¶

Pull Images to Shared Storage¶

Store images in a shared location accessible from all nodes:

# Set image cache location
export APPTAINER_CACHEDIR=/shared/containers/cache
mkdir -p $APPTAINER_CACHEDIR

# Pull to shared storage
cd /shared/containers
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-jupyter:stable
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-devops:stable

Image Naming¶

Apptainer creates .sif files from Docker images:

docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
→ bazzite-ai-pod-nvidia-python_stable.sif

Slurm Examples¶

Interactive Session¶

# Request GPU node with interactive shell
srun --partition=gpu --gres=gpu:1 --time=2:00:00 --pty \
  apptainer exec --nv /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif bash

Batch Job Script¶

#!/bin/bash
#SBATCH --job-name=pytorch-training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --output=train_%j.out
#SBATCH --error=train_%j.err

# Load modules if needed
module load apptainer

# Set working directory
cd $SLURM_SUBMIT_DIR

# Run training
apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py

Submit with: sbatch train.slurm

Array Job (Hyperparameter Sweep)¶

#!/bin/bash
#SBATCH --job-name=hyperparam
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
#SBATCH --array=0-9
#SBATCH --output=sweep_%A_%a.out

# Learning rates to sweep
LR_VALUES=(0.001 0.0005 0.0001 0.00005 0.00001 0.005 0.01 0.02 0.05 0.1)
LR=${LR_VALUES[$SLURM_ARRAY_TASK_ID]}

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml \
  python /workspace/train.py --lr $LR --run-id $SLURM_ARRAY_TASK_ID

Multi-GPU Job¶

#!/bin/bash
#SBATCH --job-name=multi-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml \
  torchrun --nproc_per_node=4 /workspace/train_distributed.py

PBS/SGE Examples¶

PBS Script¶

#!/bin/bash
#PBS -N pytorch-training
#PBS -q gpu
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py

SGE Script¶

#!/bin/bash
#$ -N pytorch-training
#$ -q gpu.q
#$ -l gpu=1
#$ -l h_rt=24:00:00
#$ -cwd

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
  pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py

Bind Mounts¶

Common Patterns¶

# Working directory
--bind $(pwd):/workspace

# Data directory
--bind /shared/datasets:/data:ro

# Output directory
--bind /scratch/$USER:/output

# Multiple binds
--bind $(pwd):/workspace,/shared/datasets:/data:ro,/scratch/$USER:/output

Home Directory¶

# Use host home (default behavior)
apptainer exec ... bash

# Isolated home
apptainer exec --no-home --bind $(pwd):/workspace ... bash

GPU Access¶

NVIDIA¶

# Enable NVIDIA GPU access
apptainer exec --nv image.sif ...

# Verify inside container
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

AMD ROCm¶

# Enable ROCm GPU access
apptainer exec --rocm image.sif ...

Without GPU¶

# CPU-only (no --nv flag)
apptainer exec image.sif ...

JupyterLab on HPC¶

Start with Port Forwarding¶

On compute node:

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-jupyter_stable.sif \
  jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

On login node (SSH tunnel):

ssh -L 8888:compute-node:8888 login-node

Access at http://localhost:8888

Slurm Script for JupyterLab¶

#!/bin/bash
#SBATCH --job-name=jupyter
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=8:00:00
#SBATCH --output=jupyter_%j.out

# Get compute node hostname
HOSTNAME=$(hostname)
PORT=8888

echo "JupyterLab starting on $HOSTNAME:$PORT"
echo "SSH tunnel command: ssh -L $PORT:$HOSTNAME:$PORT $USER@login-node"

apptainer exec --nv \
  --bind $(pwd):/workspace \
  /shared/containers/bazzite-ai-pod-jupyter_stable.sif \
  jupyter lab --ip=0.0.0.0 --port=$PORT --no-browser

Troubleshooting¶

Image Pull Fails¶

# Check disk space
df -h

# Set cache directory
export APPTAINER_CACHEDIR=/tmp/apptainer

# Pull with verbose output
apptainer pull --debug docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable

GPU Not Found¶

# Check NVIDIA driver on host
nvidia-smi

# Verify --nv flag is used
apptainer exec --nv image.sif nvidia-smi

Permission Denied¶

# Check if fakeroot is needed
apptainer exec --fakeroot image.sif ...

# Check bind mount permissions
ls -la /shared/containers/

Module Conflicts¶

# Load only Apptainer, avoid conflicting modules
module purge
module load apptainer