Kubernetes Deployment¶

Deploy Bazzite Pods to any Kubernetes cluster - EKS, GKE, AKS, or on-premises.

Prerequisites¶

Kubernetes cluster (1.24+)
kubectl configured
For GPU: NVIDIA Device Plugin installed

GPU Plugin Installation¶

# NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify
kubectl get pods -n kube-system | grep nvidia

Quick Examples¶

Job (Batch Training)¶

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
        command: ["pixi", "run", "--manifest-path", "/opt/pixi/pixi.toml", "python", "/workspace/train.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: ml-workspace
      restartPolicy: OnFailure

Deployment (JupyterLab Service)¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyterlab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyterlab
  template:
    metadata:
      labels:
        app: jupyterlab
    spec:
      containers:
      - name: jupyter
        image: ghcr.io/atrawog/bazzite-ai-pod-jupyter:stable
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: jupyter-workspace
---
apiVersion: v1
kind: Service
metadata:
  name: jupyterlab
spec:
  type: LoadBalancer
  selector:
    app: jupyterlab
  ports:
  - port: 8888
    targetPort: 8888

CronJob (Scheduled Tasks)¶

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: devops
            image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
            command: ["/bin/bash", "-c", "aws s3 sync /workspace s3://my-bucket/backup/"]
            volumeMounts:
            - name: workspace
              mountPath: /workspace
            - name: aws-credentials
              mountPath: /home/jovian/.aws
              readOnly: true
          volumes:
          - name: workspace
            persistentVolumeClaim:
              claimName: workspace
          - name: aws-credentials
            secret:
              secretName: aws-credentials
          restartPolicy: OnFailure

Pod (Interactive Shell)¶

apiVersion: v1
kind: Pod
metadata:
  name: devops-shell
spec:
  containers:
  - name: devops
    image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: kubeconfig
      mountPath: /home/jovian/.kube
      readOnly: true
  volumes:
  - name: kubeconfig
    secret:
      secretName: kubeconfig

Access with: kubectl exec -it devops-shell -- bash

Storage Configuration¶

PersistentVolumeClaim¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-workspace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: standard  # Or gp3, premium-rwo, etc.

Credentials Secret¶

apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
type: Opaque
stringData:
  credentials: |
    [default]
    aws_access_key_id = YOUR_ACCESS_KEY
    aws_secret_access_key = YOUR_SECRET_KEY
  config: |
    [default]
    region = us-east-1

GPU Scheduling¶

Node Selector¶

spec:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

Tolerations¶

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Resource Requests¶

resources:
  requests:
    nvidia.com/gpu: 1
    memory: "8Gi"
    cpu: "2"
  limits:
    nvidia.com/gpu: 1
    memory: "16Gi"
    cpu: "4"

Cloud-Specific Notes¶

AWS EKS¶

# Install NVIDIA plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# For GPU nodes, use managed node group with GPU AMI
eksctl create nodegroup --cluster my-cluster --name gpu-nodes \
  --node-type g4dn.xlarge --nodes 2 --nodes-min 0 --nodes-max 4

Google GKE¶

# Enable GPU node pool
gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 2

# Install device plugin (automatically managed in GKE 1.25+)

Azure AKS¶

# Add GPU node pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name gpunodepool \
  --node-count 1 \
  --node-vm-size Standard_NC6s_v3 \
  --node-taints sku=gpu:NoSchedule

Troubleshooting¶

GPU Not Allocated¶

# Check device plugin
kubectl get pods -n kube-system | grep nvidia

# Check node capacity
kubectl describe node <gpu-node> | grep nvidia

# Check pod events
kubectl describe pod <pod-name>

Image Pull Issues¶

# For private registries
spec:
  imagePullSecrets:
  - name: ghcr-credentials

Container Crashes¶

# Check logs
kubectl logs <pod-name>

# Check events
kubectl describe pod <pod-name>

# Debug shell
kubectl run debug --rm -it --image=ghcr.io/atrawog/bazzite-ai-pod-base:stable -- bash

Kubernetes Deployment¶

Prerequisites¶

GPU Plugin Installation¶

Quick Examples¶

Job (Batch Training)¶

Deployment (JupyterLab Service)¶

CronJob (Scheduled Tasks)¶

Pod (Interactive Shell)¶

Storage Configuration¶

PersistentVolumeClaim¶

Credentials Secret¶

GPU Scheduling¶

Node Selector¶

Tolerations¶

Resource Requests¶

Cloud-Specific Notes¶

AWS EKS¶

Google GKE¶

Azure AKS¶

Troubleshooting¶

GPU Not Allocated¶

Image Pull Issues¶

Container Crashes¶

See Also¶