Skip to content

Ollama - Local LLM Inference Server

Overview

The ollama command manages the Ollama LLM inference server using Podman Quadlet containers. It provides a single-instance server for running local LLMs with GPU acceleration.

Key Concept: Unlike Jupyter, Ollama uses a single-instance design because GPU memory is shared across all loaded models. The API is accessible at port 11434.

Quick Reference

Action Command Description
Config ujust ollama config [--port=...] [--gpu-type=...] Configure server
Start ujust ollama start Start server
Stop ujust ollama stop Stop server
Restart ujust ollama restart Restart server
Logs ujust ollama logs [--lines=...] View logs
Status ujust ollama status Show server status
Pull ujust ollama pull --model=<MODEL> Download a model
List ujust ollama list List installed models
Run ujust ollama run --model=<MODEL> [--prompt=...] Run model
Shell ujust ollama shell [-- CMD...] Open container shell
Delete ujust ollama delete Remove server and images

Parameters

Parameter Long Flag Short Default Description
Port --port -p 11434 API port
GPU Type --gpu-type -g auto GPU type: nvidia, amd, intel, none, auto
Image --image -i (default image) Container image
Tag --tag -t stable Image tag
Config Dir --config-dir -c ~/.config/ollama/1 Config/data directory
Workspace --workspace-dir -w (empty) Optional mount to /workspace
Bind --bind -b 127.0.0.1 Bind address
Lines --lines -l 50 Log lines to show
Model --model -m qwen3:4b Model for pull/run actions
Prompt --prompt - say hi Prompt for run action
Context Length --context-length - 8192 Context window size
Instance --instance -n 1 Instance number

Configuration

# Default: Port 11434, auto-detect GPU
ujust ollama config

# Custom port with NVIDIA GPU (long form)
ujust ollama config --port=11435 --gpu-type=nvidia

# Custom port with NVIDIA GPU (short form)
ujust ollama config -p 11435 -g nvidia

# CPU only
ujust ollama config --gpu-type=none

# With workspace mount
ujust ollama config --gpu-type=nvidia --workspace-dir=/home/user/projects

# Custom context length
ujust ollama config --context-length=16384

# Network-wide access
ujust ollama config --bind=0.0.0.0

# Combine multiple options
ujust ollama config -p 11435 -g nvidia -b 0.0.0.0 --context-length=16384

Update Existing Configuration

Running config when already configured will update the existing configuration, preserving values not explicitly changed.

Shell Access

# Interactive bash shell
ujust ollama shell

# Run specific command (use -- separator)
ujust ollama shell -- nvidia-smi
ujust ollama shell -- df -h
ujust ollama shell -- ls -la /root/.ollama

Model Management

Pull Models

# Download popular models (long form)
ujust ollama pull --model=llama3.2
ujust ollama pull --model=codellama
ujust ollama pull --model=mistral
ujust ollama pull --model=phi3

# Short form
ujust ollama pull -m llama3.2
ujust ollama pull -m codellama

# Specific versions
ujust ollama pull -m llama3.2:7b
ujust ollama pull -m llama3.2:70b

List Models

ujust ollama list

Output:

NAME              SIZE      MODIFIED
llama3.2:latest   4.7 GB    2 hours ago
codellama:latest  3.8 GB    1 day ago

Run Models

# Interactive chat (long form)
ujust ollama run --model=llama3.2

# Interactive chat (short form)
ujust ollama run -m llama3.2

# Single prompt
ujust ollama run -m llama3.2 --prompt="Explain quantum computing"

# Code generation
ujust ollama run -m codellama --prompt="Write a Python function to sort a list"

API Access

Default Endpoint

http://localhost:11434

API Examples

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello, how are you?"
}'

# Chat
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

# List models
curl http://localhost:11434/api/tags

Integration with Tools

# Claude Code with Ollama
export OLLAMA_HOST=http://localhost:11434

# LangChain
from langchain_community.llms import Ollama
llm = Ollama(model="llama3.2", base_url="http://localhost:11434")

Volume Mounts

Container Path Host Path Purpose
/root/.ollama ~/.ollama Model storage

Models are persisted in ~/.ollama and survive container restarts.

Common Workflows

Initial Setup

# 1. Configure Ollama with GPU
ujust ollama config --gpu-type=nvidia

# 2. Start the server
ujust ollama start

# 3. Pull a model
ujust ollama pull -m llama3.2

# 4. Test it
ujust ollama run -m llama3.2 --prompt="Hello!"

Development with Local LLM

# Start Ollama
ujust ollama start

# In your code, use:
# OLLAMA_HOST=http://localhost:11434

Model Comparison

# Pull multiple models
ujust ollama pull -m llama3.2
ujust ollama pull -m mistral
ujust ollama pull -m phi3

# Compare responses
ujust ollama run -m llama3.2 --prompt="Explain REST APIs"
ujust ollama run -m mistral --prompt="Explain REST APIs"
ujust ollama run -m phi3 --prompt="Explain REST APIs"

GPU Support

Automatic Detection

ujust ollama config  # Auto-detects GPU

Manual Selection

GPU Type Flag Value VRAM Usage
NVIDIA --gpu-type=nvidia or -g nvidia Full GPU acceleration
AMD --gpu-type=amd or -g amd ROCm acceleration
Intel --gpu-type=intel or -g intel oneAPI acceleration
None --gpu-type=none or -g none CPU only (slower)

Check GPU Status

ujust ollama shell -- nvidia-smi  # NVIDIA
ujust ollama shell -- rocm-smi    # AMD

Model Size Guide

Model Parameters VRAM Needed Quality
phi3 3B 4GB Fast, basic
llama3.2 8B 8GB Balanced
mistral 7B 8GB Good coding
codellama 7B 8GB Code-focused
llama3.2:70b 70B 48GB+ Best quality

Troubleshooting

Server Won't Start

Check:

systemctl --user status ollama
ujust ollama logs --lines=50

Common causes:

  • Port 11434 already in use
  • GPU driver issues
  • Image not pulled

Model Loading Fails

Symptom: "out of memory" or slow loading

Cause: Model too large for GPU VRAM

Fix:

# Use smaller model
ujust ollama pull -m phi3  # Only 4GB VRAM

# Or use quantized version
ujust ollama pull -m llama3.2:7b-q4_0

GPU Not Used

Symptom: Inference very slow

Check:

ujust ollama status
ujust ollama shell -- nvidia-smi

Fix:

# Reconfigure with explicit GPU
ujust ollama delete
ujust ollama config --gpu-type=nvidia

API Not Responding

Symptom: curl localhost:11434 fails

Check:

ujust ollama status
ujust ollama logs

Fix:

ujust ollama restart

Cross-References

When to Use This Skill

Use when the user asks about:

  • "install ollama", "setup local LLM", "run LLM locally"
  • "pull model", "download llama", "get mistral"
  • "ollama not working", "model won't load"
  • "ollama GPU", "ollama cuda", "ollama slow"
  • "ollama API", "integrate with ollama"