HPC & Apptainer Deployment¶
Run Bazzite Pods on HPC clusters using Apptainer (formerly Singularity).
Prerequisites¶
- Apptainer installed on cluster (check:
apptainer --version) - Access to shared filesystem for image storage
- GPU nodes with NVIDIA drivers (for GPU pods)
Apptainer Installation¶
Most HPC systems have Apptainer pre-installed. If not:
# Ubuntu/Debian
sudo apt-get install apptainer
# RHEL/Fedora
sudo dnf install apptainer
# From source: https://apptainer.org/docs/admin/main/installation.html
Quick Start¶
# Pull image (creates .sif file)
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
# Interactive shell with GPU
apptainer exec --nv bazzite-ai-pod-nvidia-python_stable.sif bash
# Run a command
apptainer exec --nv bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml python train.py
Image Management¶
Pull Images to Shared Storage¶
Store images in a shared location accessible from all nodes:
# Set image cache location
export APPTAINER_CACHEDIR=/shared/containers/cache
mkdir -p $APPTAINER_CACHEDIR
# Pull to shared storage
cd /shared/containers
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-jupyter:stable
apptainer pull docker://ghcr.io/atrawog/bazzite-ai-pod-devops:stable
Image Naming¶
Apptainer creates .sif files from Docker images:
docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
→ bazzite-ai-pod-nvidia-python_stable.sif
Slurm Examples¶
Interactive Session¶
# Request GPU node with interactive shell
srun --partition=gpu --gres=gpu:1 --time=2:00:00 --pty \
apptainer exec --nv /shared/containers/bazzite-ai-pod-nvidia-python_stable.sif bash
Batch Job Script¶
#!/bin/bash
#SBATCH --job-name=pytorch-training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --output=train_%j.out
#SBATCH --error=train_%j.err
# Load modules if needed
module load apptainer
# Set working directory
cd $SLURM_SUBMIT_DIR
# Run training
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py
Submit with: sbatch train.slurm
Array Job (Hyperparameter Sweep)¶
#!/bin/bash
#SBATCH --job-name=hyperparam
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
#SBATCH --array=0-9
#SBATCH --output=sweep_%A_%a.out
# Learning rates to sweep
LR_VALUES=(0.001 0.0005 0.0001 0.00005 0.00001 0.005 0.01 0.02 0.05 0.1)
LR=${LR_VALUES[$SLURM_ARRAY_TASK_ID]}
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml \
python /workspace/train.py --lr $LR --run-id $SLURM_ARRAY_TASK_ID
Multi-GPU Job¶
#!/bin/bash
#SBATCH --job-name=multi-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml \
torchrun --nproc_per_node=4 /workspace/train_distributed.py
PBS/SGE Examples¶
PBS Script¶
#!/bin/bash
#PBS -N pytorch-training
#PBS -q gpu
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=24:00:00
cd $PBS_O_WORKDIR
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py
SGE Script¶
#!/bin/bash
#$ -N pytorch-training
#$ -q gpu.q
#$ -l gpu=1
#$ -l h_rt=24:00:00
#$ -cwd
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-nvidia-python_stable.sif \
pixi run --manifest-path /opt/pixi/pixi.toml python /workspace/train.py
Bind Mounts¶
Common Patterns¶
# Working directory
--bind $(pwd):/workspace
# Data directory
--bind /shared/datasets:/data:ro
# Output directory
--bind /scratch/$USER:/output
# Multiple binds
--bind $(pwd):/workspace,/shared/datasets:/data:ro,/scratch/$USER:/output
Home Directory¶
# Use host home (default behavior)
apptainer exec ... bash
# Isolated home
apptainer exec --no-home --bind $(pwd):/workspace ... bash
GPU Access¶
NVIDIA¶
# Enable NVIDIA GPU access
apptainer exec --nv image.sif ...
# Verify inside container
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
AMD ROCm¶
Without GPU¶
JupyterLab on HPC¶
Start with Port Forwarding¶
On compute node:
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-jupyter_stable.sif \
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser
On login node (SSH tunnel):
Access at http://localhost:8888
Slurm Script for JupyterLab¶
#!/bin/bash
#SBATCH --job-name=jupyter
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=8:00:00
#SBATCH --output=jupyter_%j.out
# Get compute node hostname
HOSTNAME=$(hostname)
PORT=8888
echo "JupyterLab starting on $HOSTNAME:$PORT"
echo "SSH tunnel command: ssh -L $PORT:$HOSTNAME:$PORT $USER@login-node"
apptainer exec --nv \
--bind $(pwd):/workspace \
/shared/containers/bazzite-ai-pod-jupyter_stable.sif \
jupyter lab --ip=0.0.0.0 --port=$PORT --no-browser
Troubleshooting¶
Image Pull Fails¶
# Check disk space
df -h
# Set cache directory
export APPTAINER_CACHEDIR=/tmp/apptainer
# Pull with verbose output
apptainer pull --debug docker://ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
GPU Not Found¶
# Check NVIDIA driver on host
nvidia-smi
# Verify --nv flag is used
apptainer exec --nv image.sif nvidia-smi
Permission Denied¶
# Check if fakeroot is needed
apptainer exec --fakeroot image.sif ...
# Check bind mount permissions
ls -la /shared/containers/
Module Conflicts¶
See Also¶
- Docker/Podman Guide - Local development
- Kubernetes Guide - Cloud deployment
- Apptainer Documentation