Kubernetes Deployment¶
Deploy Bazzite Pods to any Kubernetes cluster - EKS, GKE, AKS, or on-premises.
Prerequisites¶
- Kubernetes cluster (1.24+)
kubectlconfigured- For GPU: NVIDIA Device Plugin installed
GPU Plugin Installation¶
# NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify
kubectl get pods -n kube-system | grep nvidia
Quick Examples¶
Job (Batch Training)¶
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
spec:
template:
spec:
containers:
- name: pytorch
image: ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
command: ["pixi", "run", "--manifest-path", "/opt/pixi/pixi.toml", "python", "/workspace/train.py"]
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: workspace
mountPath: /workspace
volumes:
- name: workspace
persistentVolumeClaim:
claimName: ml-workspace
restartPolicy: OnFailure
Deployment (JupyterLab Service)¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyterlab
spec:
replicas: 1
selector:
matchLabels:
app: jupyterlab
template:
metadata:
labels:
app: jupyterlab
spec:
containers:
- name: jupyter
image: ghcr.io/atrawog/bazzite-ai-pod-jupyter:stable
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: workspace
mountPath: /workspace
volumes:
- name: workspace
persistentVolumeClaim:
claimName: jupyter-workspace
---
apiVersion: v1
kind: Service
metadata:
name: jupyterlab
spec:
type: LoadBalancer
selector:
app: jupyterlab
ports:
- port: 8888
targetPort: 8888
CronJob (Scheduled Tasks)¶
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: devops
image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
command: ["/bin/bash", "-c", "aws s3 sync /workspace s3://my-bucket/backup/"]
volumeMounts:
- name: workspace
mountPath: /workspace
- name: aws-credentials
mountPath: /home/jovian/.aws
readOnly: true
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace
- name: aws-credentials
secret:
secretName: aws-credentials
restartPolicy: OnFailure
Pod (Interactive Shell)¶
apiVersion: v1
kind: Pod
metadata:
name: devops-shell
spec:
containers:
- name: devops
image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
command: ["sleep", "infinity"]
volumeMounts:
- name: kubeconfig
mountPath: /home/jovian/.kube
readOnly: true
volumes:
- name: kubeconfig
secret:
secretName: kubeconfig
Access with: kubectl exec -it devops-shell -- bash
Storage Configuration¶
PersistentVolumeClaim¶
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ml-workspace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: standard # Or gp3, premium-rwo, etc.
Credentials Secret¶
apiVersion: v1
kind: Secret
metadata:
name: aws-credentials
type: Opaque
stringData:
credentials: |
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
config: |
[default]
region = us-east-1
GPU Scheduling¶
Node Selector¶
Tolerations¶
Resource Requests¶
resources:
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "2"
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
Cloud-Specific Notes¶
AWS EKS¶
# Install NVIDIA plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
# For GPU nodes, use managed node group with GPU AMI
eksctl create nodegroup --cluster my-cluster --name gpu-nodes \
--node-type g4dn.xlarge --nodes 2 --nodes-min 0 --nodes-max 4
Google GKE¶
# Enable GPU node pool
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 2
# Install device plugin (automatically managed in GKE 1.25+)
Azure AKS¶
# Add GPU node pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name gpunodepool \
--node-count 1 \
--node-vm-size Standard_NC6s_v3 \
--node-taints sku=gpu:NoSchedule
Troubleshooting¶
GPU Not Allocated¶
# Check device plugin
kubectl get pods -n kube-system | grep nvidia
# Check node capacity
kubectl describe node <gpu-node> | grep nvidia
# Check pod events
kubectl describe pod <pod-name>
Image Pull Issues¶
Container Crashes¶
# Check logs
kubectl logs <pod-name>
# Check events
kubectl describe pod <pod-name>
# Debug shell
kubectl run debug --rm -it --image=ghcr.io/atrawog/bazzite-ai-pod-base:stable -- bash
See Also¶
- Docker/Podman Guide - Local development
- HPC Guide - Research clusters
- NVIDIA Device Plugin