Skip to content

Kubernetes Deployment

Deploy Bazzite Pods to any Kubernetes cluster - EKS, GKE, AKS, or on-premises.

Prerequisites

GPU Plugin Installation

# NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify
kubectl get pods -n kube-system | grep nvidia

Quick Examples

Job (Batch Training)

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: ghcr.io/atrawog/bazzite-ai-pod-nvidia-python:stable
        command: ["pixi", "run", "--manifest-path", "/opt/pixi/pixi.toml", "python", "/workspace/train.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: ml-workspace
      restartPolicy: OnFailure

Deployment (JupyterLab Service)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyterlab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyterlab
  template:
    metadata:
      labels:
        app: jupyterlab
    spec:
      containers:
      - name: jupyter
        image: ghcr.io/atrawog/bazzite-ai-pod-jupyter:stable
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: jupyter-workspace
---
apiVersion: v1
kind: Service
metadata:
  name: jupyterlab
spec:
  type: LoadBalancer
  selector:
    app: jupyterlab
  ports:
  - port: 8888
    targetPort: 8888

CronJob (Scheduled Tasks)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: devops
            image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
            command: ["/bin/bash", "-c", "aws s3 sync /workspace s3://my-bucket/backup/"]
            volumeMounts:
            - name: workspace
              mountPath: /workspace
            - name: aws-credentials
              mountPath: /home/jovian/.aws
              readOnly: true
          volumes:
          - name: workspace
            persistentVolumeClaim:
              claimName: workspace
          - name: aws-credentials
            secret:
              secretName: aws-credentials
          restartPolicy: OnFailure

Pod (Interactive Shell)

apiVersion: v1
kind: Pod
metadata:
  name: devops-shell
spec:
  containers:
  - name: devops
    image: ghcr.io/atrawog/bazzite-ai-pod-devops:stable
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: kubeconfig
      mountPath: /home/jovian/.kube
      readOnly: true
  volumes:
  - name: kubeconfig
    secret:
      secretName: kubeconfig

Access with: kubectl exec -it devops-shell -- bash

Storage Configuration

PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-workspace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: standard  # Or gp3, premium-rwo, etc.

Credentials Secret

apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
type: Opaque
stringData:
  credentials: |
    [default]
    aws_access_key_id = YOUR_ACCESS_KEY
    aws_secret_access_key = YOUR_SECRET_KEY
  config: |
    [default]
    region = us-east-1

GPU Scheduling

Node Selector

spec:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

Tolerations

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Resource Requests

resources:
  requests:
    nvidia.com/gpu: 1
    memory: "8Gi"
    cpu: "2"
  limits:
    nvidia.com/gpu: 1
    memory: "16Gi"
    cpu: "4"

Cloud-Specific Notes

AWS EKS

# Install NVIDIA plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# For GPU nodes, use managed node group with GPU AMI
eksctl create nodegroup --cluster my-cluster --name gpu-nodes \
  --node-type g4dn.xlarge --nodes 2 --nodes-min 0 --nodes-max 4

Google GKE

# Enable GPU node pool
gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 2

# Install device plugin (automatically managed in GKE 1.25+)

Azure AKS

# Add GPU node pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name gpunodepool \
  --node-count 1 \
  --node-vm-size Standard_NC6s_v3 \
  --node-taints sku=gpu:NoSchedule

Troubleshooting

GPU Not Allocated

# Check device plugin
kubectl get pods -n kube-system | grep nvidia

# Check node capacity
kubectl describe node <gpu-node> | grep nvidia

# Check pod events
kubectl describe pod <pod-name>

Image Pull Issues

# For private registries
spec:
  imagePullSecrets:
  - name: ghcr-credentials

Container Crashes

# Check logs
kubectl logs <pod-name>

# Check events
kubectl describe pod <pod-name>

# Debug shell
kubectl run debug --rm -it --image=ghcr.io/atrawog/bazzite-ai-pod-base:stable -- bash

See Also