Ollama OpenAI Compatibility API¶
This notebook demonstrates Ollama's OpenAI-compatible API using the official openai Python library.
Features Covered¶
- List models
- Generate response (completions)
- Chat completion
- Streaming responses
- Generate embeddings
Limitations¶
The OpenAI compatibility layer does not support:
- Show model details (
/api/show) - List running models (
/api/ps) - Copy model (
/api/copy) - Delete model (
/api/delete)
Prerequisites¶
- Ollama pod running:
ujust ollama start - Model pulled:
ujust ollama pull llama3.2
1. Setup & Configuration¶
In [17]:
Copied!
import os
import time
import requests
from openai import OpenAI
# === Configuration ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
DEFAULT_MODEL = "llama3.2:latest"
# Initialize OpenAI client pointing to Ollama
client = OpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama" # Required by library but ignored by Ollama
)
print(f"Ollama host: {OLLAMA_HOST}")
print(f"OpenAI base URL: {OLLAMA_HOST}/v1")
print(f"Default model: {DEFAULT_MODEL}")
import os import time import requests from openai import OpenAI # === Configuration === OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434") DEFAULT_MODEL = "llama3.2:latest" # Initialize OpenAI client pointing to Ollama client = OpenAI( base_url=f"{OLLAMA_HOST}/v1", api_key="ollama" # Required by library but ignored by Ollama ) print(f"Ollama host: {OLLAMA_HOST}") print(f"OpenAI base URL: {OLLAMA_HOST}/v1") print(f"Default model: {DEFAULT_MODEL}")
Out[17]:
Ollama host: http://ollama:11434 OpenAI base URL: http://ollama:11434/v1 Default model: llama3.2:latest
2. Connection Health Check¶
In [18]:
Copied!
def check_ollama_health() -> tuple[bool, bool]:
"""Check if Ollama server is running and model is available.
Returns:
tuple: (server_healthy, model_available)
"""
try:
response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
if response.status_code == 200:
print("✓ Ollama server is running!")
models = response.json()
model_names = [m.get("name", "") for m in models.get("models", [])]
if DEFAULT_MODEL in model_names:
print(f"✓ Model '{DEFAULT_MODEL}' is available")
return True, True
else:
print(f"✗ Model '{DEFAULT_MODEL}' not found!")
print()
if model_names:
print("Available models:")
for name in model_names:
print(f" - {name}")
else:
print("No models installed.")
print()
print("To fix this, run:")
print(f" ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
return True, False
else:
print(f"Ollama returned unexpected status: {response.status_code}")
return False, False
except requests.exceptions.ConnectionError:
print("✗ Cannot connect to Ollama server!")
print("To fix this, run: ujust ollama start")
return False, False
except requests.exceptions.Timeout:
print("✗ Connection to Ollama timed out!")
return False, False
ollama_healthy, model_available = check_ollama_health()
def check_ollama_health() -> tuple[bool, bool]: """Check if Ollama server is running and model is available. Returns: tuple: (server_healthy, model_available) """ try: response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5) if response.status_code == 200: print("✓ Ollama server is running!") models = response.json() model_names = [m.get("name", "") for m in models.get("models", [])] if DEFAULT_MODEL in model_names: print(f"✓ Model '{DEFAULT_MODEL}' is available") return True, True else: print(f"✗ Model '{DEFAULT_MODEL}' not found!") print() if model_names: print("Available models:") for name in model_names: print(f" - {name}") else: print("No models installed.") print() print("To fix this, run:") print(f" ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") return True, False else: print(f"Ollama returned unexpected status: {response.status_code}") return False, False except requests.exceptions.ConnectionError: print("✗ Cannot connect to Ollama server!") print("To fix this, run: ujust ollama start") return False, False except requests.exceptions.Timeout: print("✗ Connection to Ollama timed out!") return False, False ollama_healthy, model_available = check_ollama_health()
Out[18]:
✓ Ollama server is running! ✓ Model 'llama3.2:latest' is available
3. List Models¶
Endpoint: GET /v1/models
In [19]:
Copied!
print("=== List Available Models ===")
models = client.models.list()
for model in models.data:
print(f" - {model.id}")
print("=== List Available Models ===") models = client.models.list() for model in models.data: print(f" - {model.id}")
Out[19]:
=== List Available Models ===
Out[19]:
- hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M - llama3.2:latest
4. Generate Response (Completions)¶
Endpoint: POST /v1/completions
In [20]:
Copied!
print("=== Generate Response ===")
if not model_available:
print()
print("⚠ Skipping - model not available")
print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
prompt = "Why is the sky blue? Answer in one sentence."
print(f"Prompt: {prompt}")
print()
try:
start_time = time.perf_counter()
response = client.completions.create(
model=DEFAULT_MODEL,
prompt=prompt,
max_tokens=100
)
end_time = time.perf_counter()
print(f"Response: {response.choices[0].text}")
print()
print(f"Latency: {end_time - start_time:.2f}s")
print(f"Completion tokens: {response.usage.completion_tokens}")
except Exception as e:
print(f"✗ Error: {e}")
print("=== Generate Response ===") if not model_available: print() print("⚠ Skipping - model not available") print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") else: prompt = "Why is the sky blue? Answer in one sentence." print(f"Prompt: {prompt}") print() try: start_time = time.perf_counter() response = client.completions.create( model=DEFAULT_MODEL, prompt=prompt, max_tokens=100 ) end_time = time.perf_counter() print(f"Response: {response.choices[0].text}") print() print(f"Latency: {end_time - start_time:.2f}s") print(f"Completion tokens: {response.usage.completion_tokens}") except Exception as e: print(f"✗ Error: {e}")
Out[20]:
=== Generate Response === Prompt: Why is the sky blue? Answer in one sentence.
Out[20]:
Response: The sky appears blue because when sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen and oxygen, which scatter the shorter, blue wavelengths of light more than the longer, red wavelengths. Latency: 0.26s Completion tokens: 42
5. Chat Completion¶
Endpoint: POST /v1/chat/completions
In [21]:
Copied!
print("=== Chat Completion ===")
if not model_available:
print()
print("⚠ Skipping - model not available")
print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
try:
response = client.chat.completions.create(
model=DEFAULT_MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant. Keep responses brief."},
{"role": "user", "content": "Explain machine learning in one sentence."}
],
temperature=0.7,
max_tokens=100
)
print(f"Assistant: {response.choices[0].message.content}")
print(f"\nTokens used: {response.usage.total_tokens}")
except Exception as e:
print(f"✗ Error: {e}")
print("=== Chat Completion ===") if not model_available: print() print("⚠ Skipping - model not available") print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") else: try: response = client.chat.completions.create( model=DEFAULT_MODEL, messages=[ {"role": "system", "content": "You are a helpful assistant. Keep responses brief."}, {"role": "user", "content": "Explain machine learning in one sentence."} ], temperature=0.7, max_tokens=100 ) print(f"Assistant: {response.choices[0].message.content}") print(f"\nTokens used: {response.usage.total_tokens}") except Exception as e: print(f"✗ Error: {e}")
Out[21]:
=== Chat Completion ===
Out[21]:
Assistant: Machine learning is a type of artificial intelligence that enables computers to learn, make decisions, and improve their performance on tasks without being explicitly programmed. Tokens used: 72
6. Multi-turn Conversation¶
In [22]:
Copied!
print("=== Multi-turn Conversation ===")
if not model_available:
print()
print("⚠ Skipping - model not available")
print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
try:
messages = [
{"role": "system", "content": "You are a helpful math tutor."}
]
# Turn 1
messages.append({"role": "user", "content": "What is 2 + 2?"})
response = client.chat.completions.create(
model=DEFAULT_MODEL,
messages=messages,
max_tokens=50
)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
print(f"User: What is 2 + 2?")
print(f"Assistant: {assistant_msg}")
# Turn 2
messages.append({"role": "user", "content": "And what is that multiplied by 3?"})
response = client.chat.completions.create(
model=DEFAULT_MODEL,
messages=messages,
max_tokens=50
)
print(f"User: And what is that multiplied by 3?")
print(f"Assistant: {response.choices[0].message.content}")
except Exception as e:
print(f"✗ Error: {e}")
print("=== Multi-turn Conversation ===") if not model_available: print() print("⚠ Skipping - model not available") print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") else: try: messages = [ {"role": "system", "content": "You are a helpful math tutor."} ] # Turn 1 messages.append({"role": "user", "content": "What is 2 + 2?"}) response = client.chat.completions.create( model=DEFAULT_MODEL, messages=messages, max_tokens=50 ) assistant_msg = response.choices[0].message.content messages.append({"role": "assistant", "content": assistant_msg}) print(f"User: What is 2 + 2?") print(f"Assistant: {assistant_msg}") # Turn 2 messages.append({"role": "user", "content": "And what is that multiplied by 3?"}) response = client.chat.completions.create( model=DEFAULT_MODEL, messages=messages, max_tokens=50 ) print(f"User: And what is that multiplied by 3?") print(f"Assistant: {response.choices[0].message.content}") except Exception as e: print(f"✗ Error: {e}")
Out[22]:
=== Multi-turn Conversation ===
Out[22]:
User: What is 2 + 2? Assistant: A simple but classic math question! The answer to 2 + 2 is... 4! Would you like help with anything else or want to move on to something more challenging?
Out[22]:
User: And what is that multiplied by 3? Assistant: Since we know the result of 2 + 2 is 4, let's multiply 4 by 3... 4 × 3 = 12! So, the answer is 12! Easy peasy! What's your next math question
7. Streaming Response¶
Endpoint: POST /v1/chat/completions with stream: true
In [23]:
Copied!
print("=== Streaming Response ===")
if not model_available:
print()
print("⚠ Skipping - model not available")
print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
try:
print()
stream = client.chat.completions.create(
model=DEFAULT_MODEL,
messages=[{"role": "user", "content": "Count from 1 to 5."}],
stream=True
)
collected = []
for chunk in stream:
if chunk.choices[0].delta.content:
collected.append(chunk.choices[0].delta.content)
print(f"Response: {''.join(collected)}")
except Exception as e:
print(f"✗ Error: {e}")
print("=== Streaming Response ===") if not model_available: print() print("⚠ Skipping - model not available") print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") else: try: print() stream = client.chat.completions.create( model=DEFAULT_MODEL, messages=[{"role": "user", "content": "Count from 1 to 5."}], stream=True ) collected = [] for chunk in stream: if chunk.choices[0].delta.content: collected.append(chunk.choices[0].delta.content) print(f"Response: {''.join(collected)}") except Exception as e: print(f"✗ Error: {e}")
Out[23]:
=== Streaming Response === Response: Here we go: 1, 2, 3, 4, 5!
8. Generate Embeddings¶
Endpoint: POST /v1/embeddings
In [24]:
Copied!
print("=== Generate Embeddings ===")
if not model_available:
print()
print("⚠ Skipping - model not available")
print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
try:
test_text = "Ollama makes running LLMs locally easy and efficient."
response = client.embeddings.create(
model=DEFAULT_MODEL,
input=test_text
)
embedding = response.data[0].embedding
print(f"Input: '{test_text}'")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
print(f"Last 5 values: {embedding[-5:]}")
except Exception as e:
print(f"✗ Error: {e}")
print("=== Generate Embeddings ===") if not model_available: print() print("⚠ Skipping - model not available") print(f" Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}") else: try: test_text = "Ollama makes running LLMs locally easy and efficient." response = client.embeddings.create( model=DEFAULT_MODEL, input=test_text ) embedding = response.data[0].embedding print(f"Input: '{test_text}'") print(f"Embedding dimensions: {len(embedding)}") print(f"First 5 values: {embedding[:5]}") print(f"Last 5 values: {embedding[-5:]}") except Exception as e: print(f"✗ Error: {e}")
Out[24]:
=== Generate Embeddings === Input: 'Ollama makes running LLMs locally easy and efficient.' Embedding dimensions: 3072 First 5 values: [-0.026683127507567406, -0.0028091324493288994, -0.02738499455153942, -0.009667067788541317, -0.017405545338988304] Last 5 values: [-0.028065813705325127, 0.010568944737315178, -0.028453463688492775, 0.014874468557536602, -0.02971256710588932]
9. Error Handling¶
In [25]:
Copied!
print("=== Error Handling ===")
# Test: Non-existent model
print("\n1. Testing non-existent model...")
try:
response = client.chat.completions.create(
model="invalid-model",
messages=[{"role": "user", "content": "Hello"}]
)
print(f" Unexpected success")
except Exception as e:
print(f" Expected error: {type(e).__name__}")
# Test: Empty messages
print("\n2. Testing empty messages...")
try:
response = client.chat.completions.create(
model=DEFAULT_MODEL,
messages=[]
)
print(f" Empty messages allowed")
except Exception as e:
print(f" Error: {type(e).__name__}")
print("\nError handling tests completed!")
print("=== Error Handling ===") # Test: Non-existent model print("\n1. Testing non-existent model...") try: response = client.chat.completions.create( model="invalid-model", messages=[{"role": "user", "content": "Hello"}] ) print(f" Unexpected success") except Exception as e: print(f" Expected error: {type(e).__name__}") # Test: Empty messages print("\n2. Testing empty messages...") try: response = client.chat.completions.create( model=DEFAULT_MODEL, messages=[] ) print(f" Empty messages allowed") except Exception as e: print(f" Error: {type(e).__name__}") print("\nError handling tests completed!")
Out[25]:
=== Error Handling === 1. Testing non-existent model... Expected error: NotFoundError 2. Testing empty messages... Error: BadRequestError Error handling tests completed!
Summary¶
This notebook demonstrated Ollama's OpenAI-compatible API.
API Endpoints Used¶
| Endpoint | Method | Purpose |
|---|---|---|
/v1/models | GET | List models |
/v1/completions | POST | Generate text |
/v1/chat/completions | POST | Chat completion |
/v1/embeddings | POST | Generate embeddings |
Quick Reference¶
from openai import OpenAI
client = OpenAI(
base_url="http://ollama:11434/v1",
api_key="ollama"
)
# Chat
response = client.chat.completions.create(
model="llama3.2:latest",
messages=[{"role": "user", "content": "Hello!"}]
)
Why Use OpenAI Compatibility?¶
- Migration - Drop-in replacement for OpenAI API
- Tool ecosystem - Works with LangChain, LlamaIndex, etc.
- Familiar interface - Standard OpenAI patterns