Prompt Engineering Essentials¶
The D1 notebooks will cover the essential topics of prompt engineering, beginning with inference in general and an introduction to LangChain. We will then cover the topics of prompt templates and parsing and will then go on to the concept of creating chains and connecting these in different ways to build more sophisticated constructs to make the most of LLMs.
Bazzite-AI Setup Required
RunD0_00_Bazzite_AI_Setup.ipynbfirst to configure Ollama and verify GPU access.
API vs. Locally Hosted LLM¶
Using the an API-hosted LLM (e.g. OpenAI) is like renting a powerful car — it's ready to go, but you mustn't tinker with the inner workings of the engine and you pay each time you drive. Using a locally hosted model is like buying your own vehicle — more upfront work and maintenance, but full control, privacy, and no cost per use, apart from footing the energy bill.
| Aspect | API-based (e.g. OpenAI) | Local Model (e.g. Mistral, PyTorch + LangChain) |
|---|---|---|
| Setup time | Minimal – just an API key | Requires downloading and managing the model |
| Hardware requirement | None (runs in the cloud) | Requires a GPU (sometimes large memory) |
| Latency | Network-dependent | Faster inference (once model is loaded) |
| Privacy / Data control | Data sent to external servers | Data stays on your infrastructure |
| Cost | Pay-per-use (based on tokens) | Free at inference (after download), but uses your compute |
| Scalability | Handled by provider | You manage and scale infrastructure |
| Flexibility | Limited to provider's models and settings | Full control: quantization, fine-tuning, prompt handling |
| Offline use | Not possible | Yes, after initial download |
| Customizability | No access to internals | You can modify and extend anything |
Using an API (e.g. OpenAI)
- You use OpenAI or ChatOpenAI class from LangChain
- LangChain sends your prompt to api.openai.com
- You don't manage the model, only the request and response
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(api_key="...", model="gpt-4")
response = llm.invoke("Summarize this legal clause...")
📝 Managing API Keys Securely
The recommended approach is to use a .env file with python-dotenv:
# .env file (add to .gitignore!)
OPENAI_API_KEY=sk-your-key-here
ANTHROPIC_API_KEY=sk-ant-your-key-here
from dotenv import load_dotenv
import os
load_dotenv() # Load variables from .env file
api_key = os.getenv("OPENAI_API_KEY")
llm = ChatOpenAI(api_key=api_key, model="gpt-4")
Note that LangChain automatically looks up OPENAI_API_KEY from environment variables, so you can also just do:
from dotenv import load_dotenv
load_dotenv()
llm = ChatOpenAI(model="gpt-4") # API key loaded automatically
Using a Local Model (e.g. Mistral, LLaMA)
- You load the model and tokenizer using Hugging Face Transformers
- You wrap the pipeline using HuggingFacePipeline or similar in LangChain
- You manage memory, GPU allocation, quantization, etc.
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain_huggingface import ChatHuggingFace
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = ChatHuggingFace(llm=HuggingFacePipeline(pipeline=pipe))
Basic Setup for Inference¶
Apart from the usual suspects of Pytorch and Huggingface libraries, we get our first imports of the LangChain library and some of its classes.
Since we want to show you how to work with LLMs that are not part of the closed OpenAI and Anthropic world, we are going to show you how to work with open and downloadable models. As it makes no sense for all of us to download the models and store them in our home directory, we've done that for you before the start of the course. You can find the path to the models down below.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace
[No output generated]
If you choose to work with a model such as meta-llama/Llama-3.3-70B-Instruct, you will have to use quantization in order to get the model into the memory of one GPU. It is advisable to utilise BitsAndBytes for qantization and write a short config for that, e.g.:
# Define quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_compute_dtype=torch.float16, # Use float16 for computation
bnb_4bit_use_double_quant=True # Double quantization for efficiency
)
However, beware, a model of that size takes roughly 30 minutes to load... In this course we do not want to wait around for that long, so we will use a smaller model called Nous-Hermes-2-Mistral-7B-DPO.
# Download model from HuggingFace (same base model as Ollama GGUF version)
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO"
[No output generated]
# Use this if you have an API key for a model hosted in the cloud:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")
[No output generated]
# Alternative models to try:
#HF_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct"
#HF_LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
[No output generated]
# 4-bit quantization config for efficient loading
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)
# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
HF_LLM_MODEL,
device_map="auto",
quantization_config=quantization_config,
)
# Verify model config
print(model.config)
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
MistralConfig {
"architectures": [
"MistralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"dtype": "float16",
"eos_token_id": 32000,
"head_dim": null,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"quantization_config": {
"_load_in_4bit": true,
"_load_in_8bit": false,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": true,
"load_in_8bit": false,
"quant_method": "bitsandbytes"
},
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"sliding_window": 4096,
"tie_word_embeddings": false,
"transformers_version": "4.57.2",
"use_cache": false,
"vocab_size": 32002
}
Now, let's try out a prompt or two:
prompt = "What is the capital of France? Can you give me some facts about it?"
# Use the device where the model is loaded (works with both CPU and GPU)
device = next(model.parameters()).device
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=250)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
What is the capital of France? Can you give me some facts about it? The capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, culture, and architecture. Some interesting facts about Paris include: 1. Paris is home to the Eiffel Tower, one of the most recognizable landmarks in the world. It was built in 1889 for the World's Fair and is named after its designer, Gustave Eiffel. 2. The Louvre Museum in Paris is the world's largest and most visited art museum. It is home to over 380,000 objects and 35,000 works of art, including the famous Mona Lisa painting by Leonardo da Vinci. 3. Notre-Dame Cathedral is a famous Gothic cathedral located on the Île de la Cité in the heart of Paris. It was completed in 1345 and is known for its stunning architecture, stained glass windows, and bell towers. 4. Paris is also known for its fashion industry, with many famous designers and luxury brands having their headquarters in the city.
Not bad, however, we can do better!
import os
# Ollama configuration (no API key needed!)
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
# === Model Configuration ===
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF"
OLLAMA_LLM_MODEL = f"hf.co/{HF_LLM_MODEL}:Q4_K_M"
print(f"Ollama host: {OLLAMA_HOST}")
print(f"Model: {OLLAMA_LLM_MODEL}")
Ollama host: http://ollama:11434 Model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M
from openai import OpenAI
# Point OpenAI client to Ollama (drop-in replacement!)
client = OpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama" # Required by library but ignored by Ollama
)
# Simple one-shot prompt, no roles
response = client.chat.completions.create(
model=OLLAMA_LLM_MODEL,
messages=[{"role": "user", "content": "What is the capital of France? Can you give me some facts about it?"}],
max_tokens=250
)
print(response.choices[0].message.content)
The capital of France is Paris. Some interesting facts about Paris include: 1. Population: As of 2021, the population of Paris is approximately 2.14 million people, making it one of the largest cities in Europe. 2. Language: The official language of Paris (and all of France) is French. However, many locals in tourist areas will understand and speak English to assist visitors. 3. Geography: Paris is located in northern-central France along the Seine River. It covers an area of about 105 square kilometers (40.5 square miles). 4. History: Founded by the Gauls, Paris has a rich and storied history spanning many centuries. In 52 BC, it was conquered by Julius Caesar and became a Roman city called Lutetia. Throughout the Middle Ages, it served as an important center of learning and culture, with famous universities like the Sorbonne attracting scholars from across Europe. 5. Architecture: Paris is world-renowned for its beautiful architecture, particularly its historic buildings in the romanesque, gothic, and bar
Enter LangChain¶
LangChain is a powerful open-source framework designed to help developers build applications using LLMs. It abstracts and simplifies common LLM tasks like prompt engineering, chaining multiple steps, retrieving documents, parsing structured output, and building conversational agents.
LangChain supports a wide range of models (OpenAI, Hugging Face, Cohere, Anthropic, etc.) and integrates seamlessly with tools like vector databases, APIs, file loaders, and output parsers.
LangChain Building Blocks¶
+-------------------+
| PromptTemplate | ← Create structured prompts
+-------------------+
↓
+-------------------+
| LLM | ← Connect to local or remote LLM
+-------------------+
↓
+-------------------+
| Output Parsers | ← Extract structured results (e.g. JSON)
+-------------------+
↓
+-------------------+
| Chains / Agents | ← Combine steps into flows
+-------------------+
↓
+-------------------+
| Memory / Tools | ← Use search, APIs, databases, etc.
+-------------------+
Core LLM/ChatModel Methods in LangChain¶
How to do inference with LangChain:
| Method | Purpose | Input Type | Output Type |
|---|---|---|---|
invoke() | Handles a single input, returns one response | str or Message(s) | str / AIMessage |
generate() | Handles a batch of inputs, returns multiple outputs | list[str] | LLMResult |
batch() | Batched input, returns a flat list of outputs | list[str] | list[str] / Messages |
stream() | Streams the output as tokens are generated | str / Message(s) | Generator (streamed text) |
ainvoke() | Async version of invoke() | str / Message(s) | Awaitable result |
agenerate() | Async version of generate() | list[str] | Awaitable result |
Before we use one of these methods, we need to create a pipeline and apply the LangChain wrapper to the pipeline, so we create a format that LangChain can call with .invoke() or .generate() etc. If we use an remotly hosted LLM, which we access through an API, we do not need the pipeline.
This is how you use Ollama's OpenAI-compatible API with LangChain:
from langchain_openai import ChatOpenAI
# Create an LLM that talks to Ollama (using OpenAI-compatible API)
llm = ChatOpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama", # Required but ignored by Ollama
model=OLLAMA_LLM_MODEL,
temperature=0.7, # like HF's temperature
max_tokens=150 # analogous to HF's max_new_tokens
)
# Use it just like your HuggingFacePipeline example:
print(llm.invoke("Here is a fun fact about Mars:").content)
Mars has the tallest volcano in our solar system, Olympus Mons, which is three times the height of Mount Everest and over 14 miles high.
For a locally hosted model use the Hugging Face text_pipeline:
# Create a text generation pipeline
text_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=150,
device_map="auto",
return_full_text=False,
eos_token_id=tokenizer.eos_token_id,
skip_special_tokens=True,
)
# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)
Device set to use cuda:0
llm.invoke()¶
print(llm.invoke('Here is a fun fact about Mars:'))
The planet’s surface area is about the same as Earth’s. Of course, Mars is much smaller than Earth, which means it is less dense. But that is a topic for another day. In the meantime, scientists have discovered a strange structure on Mars. It’s a huge face on one of the cliffs. The photo of the Martian face was taken by NASA’s Mars Reconnaissance Orbiter (MRO) on February 11th, 2015. The image was first posted on a forum called ‘The World of Mysteries’. However, it seems the image is from 2014. The face is 1.25 kilometers
llm.batch()¶
results = llm.batch(["Tell me a joke", "Translate this to German: It has been raining non-stop today."])
print(results)
[', please.\n\nWhy don’t scientists trust atoms?\n\nBecause they make up everything.\n\nWhat is the best thing about Jelly Babies?\n\nTheir ability to be happy in a jar.\n\nWhy don’t some couples go to the gym?\n\nBecause some relationships can’t handle the weight.\n\nWhy was the math book sad?\n\nIt had too many problems.\n\nWhy do we never see the sun at night?\n\nBecause it’s in bed by 8 Assistance\n\nWhy did the calculator break up with its girlfriend?\n\nIt found a bug in its system.\n\nWhy did the computer go to the doctor?\n\n', '\n\nÜbersetzung: Es hat heute ununterbrochen geregnet.\n\nThe German verb "regnen" means "to rain." In this sentence, "es" is the subject and "heute" is an adverb of time that means "today." The adjective "ununterbrochen" means "non-stop."\n\nSo, this sentence is translated as "It has been raining non-stop today" in English.']
Let's make that more structured and also format the output nicely:
prompts = [
"Tell me a joke",
"Translate this to German: 'It has been raining non-stop today.'"
]
# Run batch generation
results = llm.batch(prompts)
# Nicely format the output
for i, (prompt, response) in enumerate(zip(prompts, results), 1):
print(f"\nPrompt {i}: {prompt}")
print(f"Response:\n{response}")
Prompt 1: Tell me a joke Response: , I’ll tell you a story. TDM, why did you join the circus? Because I heard the Toot is a big deal there. Well, that was a joke. Now for my story. The Toot is a small town located in the foothills of the Rocky Mountains. It’s a quiet little place, where everyone knows everyone and secrets are hard to keep. But the Toot was about to experience something it had never seen before. In the fall of 1995, a traveling circus came to town. The townspeople were curious, but also a bit nervous. They had never seen anything like it before and were not sure what to Prompt 2: Translate this to German: 'It has been raining non-stop today.' Response: Translation: 'Es hat heute ständig geregnet.' Explanation: In this sentence, "It has been raining non-stop today" is translated to "Es hat heute ständig geregnet." Let's break down the translation: 1. 'It has been raining' translates to 'Es hat geregnet.' In German, 'Es' refers to 'it', 'hat' is the past form of 'haben' (to have), and 'geregnet' is the past participle of'regnen' (to rain). 2. The phrase 'non-stop' is translated to'ständig' in this context, which means '
llm.generate()¶
llm.generate() yields much more output than llm.batch() and is used if you actually want more metadata, such as the token count.
results = llm.generate(["Where should my customer go for a luxurious Safari?",
"What are your top three suggestions for backpacking destinations?"])
print(results)
generations=[[Generation(text='\n\nSafari is a word that stems from the Swahili word for journey. The quintessential African Safari is a once in a lifetime experience that can be found in many parts of Africa. However, the most sought-after destinations are found in East Africa, specifically Kenya and Tanzania. Both countries are well-known for their national parks and reserves that offer a diverse array of wildlife. If your customer is seeking a luxurious safari, there are several options that can be tailored to their preferences.\n\nHere are some of the most popular luxury safari options:\n\n1. Singita Grumeti Reserves in Tanzania: This reserve is part of the Serenget')], [Generation(text='\n\n1. The Pacific Crest Trail: This 2,650-mile trail from Mexico to Canada is one of the most popular backpacking routes in the United States and offers stunning scenery, diverse ecosystems, and a wide variety of flora and fauna.\n\n2. The Tour du Mont Blanc: This classic 110-mile circuit around Mont Blanc in the French, Swiss, and Italian Alps is one of the most popular multi-day hikes in Europe. It offers stunning mountain vistas, alpine meadows, and crystal-clear lakes.\n\n3. The Annapurna Circuit: This 100-mile trek in Nep')]] llm_output=None run=[RunInfo(run_id=UUID('019b66a7-71d6-77c0-a542-50678a183a2b')), RunInfo(run_id=UUID('019b66a7-71d6-77c0-a542-507700d1a222'))] type='LLMResult'
We need to prittyfy the output:
for gen in results.generations:
print(gen[0].text)
Safari is a word that stems from the Swahili word for journey. The quintessential African Safari is a once in a lifetime experience that can be found in many parts of Africa. However, the most sought-after destinations are found in East Africa, specifically Kenya and Tanzania. Both countries are well-known for their national parks and reserves that offer a diverse array of wildlife. If your customer is seeking a luxurious safari, there are several options that can be tailored to their preferences. Here are some of the most popular luxury safari options: 1. Singita Grumeti Reserves in Tanzania: This reserve is part of the Serenget 1. The Pacific Crest Trail: This 2,650-mile trail from Mexico to Canada is one of the most popular backpacking routes in the United States and offers stunning scenery, diverse ecosystems, and a wide variety of flora and fauna. 2. The Tour du Mont Blanc: This classic 110-mile circuit around Mont Blanc in the French, Swiss, and Italian Alps is one of the most popular multi-day hikes in Europe. It offers stunning mountain vistas, alpine meadows, and crystal-clear lakes. 3. The Annapurna Circuit: This 100-mile trek in Nep
llm.stream()¶
for chunk in llm.stream("Tell me a story about a cat."):
print(chunk, end="")
Tell me a story about a dog. Tell me about your pet.
I grew up with a cat named Muffin. She was a black and white
tuxedo cat who was the sweetest and most loving animal I have ever known. She
would climb into bed with me every night and cuddle with me until I fell
asleep. She had a habit of nuzzling her head into my hand as a sign
of affection. I still miss her to this day. I also had a dog
named Spot. He was a lovable and friendly golden retriever who loved to play
fetch and go on walks. He was always happy to see me when I came home and
would wag his tail non-stop
Model Types in LangChain¶
LangChain supports two main types of language models:
| Model Type | Description | Examples |
|---|---|---|
| LLMs | Models that take a plain text string as input and return generated text | GPT-2, Falcon, LLaMA, Mistral (raw) |
| Chat Models | Models that work with structured chat messages (system, user, assistant) | GPT-4, Claude, LLaMA-Instruct, Mistral-Instruct |
Why the distinction?
Chat models are designed to understand multi-turn conversation and role-based prompting. Their input format includes a structured message history, making them ideal for:
- Instruction following
- Contextual reasoning
- Assistant-like behavior
LLMs, on the other hand, expect a single flat prompt string. They still power many applications and are worth understanding, especially when using older models.
Do Chat Models matter more now?
Yes — most modern instruction-tuned models (like GPT-4, Claude, Mistral-Instruct, or LLaMA-3-Instruct) are designed as chat models, and LangChain's agent and memory systems are built around them.
However, LLMs are still important:
- Some models only support the LLM interface
- LLMs are useful in batch processing and structured generation
- Understanding their behavior helps you build better prompts
# Plain LLM (single prompt string)
llm = HuggingFacePipeline(pipeline=text_pipeline)
print("--- LLM-style output ---\n")
print(llm.invoke("Explain LangChain in one sentence."))
# Use as a ChatModel (structured messages)
chat_llm = ChatHuggingFace(llm=llm)
messages = [
SystemMessage(content="You are a helpful AI assistant."),
HumanMessage(content="Explain LangChain in one sentence.")
]
print("\n--- Chat-style output ---\n")
print(chat_llm.invoke(messages).content)
--- LLM-style output ---
LangChain is a Python library that allows developers to easily build AI-powered applications that interact with natural language data. ## What is LangChain? LangChain is a Python library that provides tools and frameworks for building applications that leverage natural language processing and generation. It allows developers to quickly and easily integrate AI capabilities into their applications, enabling them to perform tasks such as natural language querying, document summarization, and text generation. LangChain is designed to work with a wide range of data sources, including structured databases, unstructured data, and semi-structured data. It supports multiple programming languages, including Python, JavaScript, and Java, and can be used in a variety of applications, such
--- Chat-style output ---
LangChain is a Python library that facilitates working with natural language processing, enabling the interaction between AI models and human-generated text in a variety of applications, such as language translation, sentiment analysis, and question-answering systems.
The raw output you're seeing includes special chat formatting tokens (like <|im_start|>, <|im_end|>, etc.) which are used internally by the model (e.g., Mistral, LLaMA, GPT-J-style models) to distinguish between roles in a chat.
These tokens help the model understand who is speaking, but they're not intended for humans to see.
So, to prettyfy the ouput we will define a function:
def clean_output(raw: str) -> str:
# If the assistant marker is in the output, split on it and take the last part
if "<|im_start|>assistant" in raw:
return raw.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "").strip()
return raw.strip()
raw_output = chat_llm.invoke(messages).content
cleaned = clean_output(raw_output)
print("Cleaned Response:\n",cleaned)
Cleaned Response: LangChain is an open-source Python library that enables developers to build and deploy large-scale natural language processing and artificial intelligence applications.
An even simpler approach would be to pass the following argument earlier on:
llm = HuggingFacePipeline(pipeline=text_pipe, model_kwargs={"clean_up_tokenization_spaces": True})
Confused?
You are not alone. Until recently, LangChain had a different wrapper for LLMs and Chat Models, but in recent versions of LangChain, the HuggingFacePipeline class implements the ChatModel interface under the hood — it can accept structured chat messages (SystemMessage, HumanMessage, etc.) even though it wasn't originally designed to.
So yes: You can now do:
llm = HuggingFacePipeline(pipeline=text_pipe)
response = llm.invoke([
SystemMessage(content="You are a helpful legal assistant."),
HumanMessage(content="Simplify this clause: ...")
])
Even though you're not explicitly using ChatHuggingFace, LangChain detects the message types and processes them correctly using the underlying text-generation model.
The same would apply if you used a remotly hosted LLM/Chat Model through an API:
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(openai_api_key=api_key)
result = chat.invoke([HumanMessage(content="Can you tell me a fact about Dolphins?")])
from langchain_core.messages import (AIMessage, HumanMessage, SystemMessage)
[No output generated]
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)
[No output generated]
result = chat_llm.invoke([HumanMessage(content="Can you tell me a fact about dolphins?")])
[No output generated]
result
AIMessage(content='Dolphins are highly intelligent and social mammals, known for their playful behavior and complex communication systems. They have a remarkable ability to learn and solve problems, and some species have been observed using tools and even teaching their young.', additional_kwargs={}, response_metadata={}, id='lc_run--019b66a7-ddd2-7db2-a82a-8a0280fcb428-0') print(clean_output(result.content))
Dolphins are highly intelligent and social mammals, known for their playful behavior and complex communication systems. They have a remarkable ability to learn and solve problems, and some species have been observed using tools and even teaching their young.
result = chat_llm.invoke([SystemMessage(content='You are a gumpy 5-year old child who only wants to get new toys and not answer questions'),
HumanMessage(content='Can you tell me a fact about dophins?')])
[No output generated]
print(clean_output(result.content))
Ugh, fine. Dolphins are intelligent mammals. They can recognize themselves in a mirror, which shows they have a sense of self-awareness.
result = chat_llm.invoke(
[SystemMessage(content='You are a University Professor'),
HumanMessage(content='Can you tell me a fact about dolphins?')]
)
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
print(clean_output(result.content))
Dolphins are highly intelligent marine mammals that are known for their playful and social behavior. They are known to have large brains and are capable of problem-solving, communication, and even exhibiting self-awareness.
result = chat_llm.generate([
[
SystemMessage(content='You are a University Professor.'),
HumanMessage(content='Can you tell me a fact about dolphins?')
],
[
SystemMessage(content='You are a University Professor.'),
HumanMessage(content='What is the difference between whales and dolphins?')
]
])
[No output generated]
for i, generation in enumerate(result.generations, 1):
raw = generation[0].text
cleaned = clean_output(raw)
print(f"\nPrompt {i}:\n{cleaned}")
Prompt 1: Dolphins are highly intelligent marine mammals that belong to the family Delphinidae. They have a complex social structure, often living in groups called pods and exhibiting various forms of communication, including vocalizations and body posturing. They are also known for their playful behavior and their ability to form strong bonds with one another. Prompt 2: Whales and dolphins are both mammals that belong to the group called cetaceans. They share many similarities, but there are also some key differences between them. 1. Taxonomy: Whales are classified under the suborder Balaenoptera, while dolphins belong to the suborder Odontoceti. 2. Size: Whales are generally larger than dolphins. The blue whale is the largest animal on earth, growing up to 100 feet long, while dolphins typically range from 4 to 12 feet in length. 3. Physical Characteristics: Whales have a streamlined body, a horizontal tail, and large flippers,
# Create a text generation pipeline
text_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
device_map="auto",
return_full_text=False,
eos_token_id=tokenizer.eos_token_id,
skip_special_tokens=True,
)
# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)
Device set to use cuda:0
eos_token_id = tokenizer.eos_token_id
result = chat_llm.generate([
[
SystemMessage(content='You are a University Professor.'),
HumanMessage(content='Can you tell me a fact about dolphins?')
],
[
SystemMessage(content='You are a University Professor.'),
HumanMessage(content='What is the difference between whales and dolphins?')
]
], eos_token_id=eos_token_id)
[No output generated]
for i, generation in enumerate(result.generations, 1):
raw = generation[0].text
cleaned = clean_output(raw)
print(f"\nPrompt {i}:\n{cleaned}")
Prompt 1: Dolphins are highly intelligent animals, known for their problem-solving abilities and complex social behaviors. They have a large brain-to-body size ratio, equivalent to that of humans, and are capable of experiencing emotions such as joy, sorrow, and even grief. Prompt 2: Although whales and dolphins are both marine mammals and belong to the infraorder Cetacea, there are several notable differences between them. Here are a few key differences: 1. Evolutionary classification: Whales are more closely related to even-toed ungulates (hoofed mammals) like cows, whereas dolphins are more closely related to hippos and manatees. 2. Physical characteristics: Whales are generally larger than dolphins, with the largest whale, the blue whale, reaching lengths of up to 100 feet. Dolphins, on the other hand, are smaller, with the largest dolphin species, the killer whale, reaching up to 32 feet. Whales also have a more streamlined body shape, while dolphins have a more robust build. 3. Adaptations: Whales have adapted to life in deep ocean environments and have a thick layer of blubber for insulation, while dolphins, which inhabit a wider range of marine environments, have a thinner layer of fat and smaller flippers. 4. Feeding habits: Whales feed mostly by filtering small crustaceans and plankton from the water using baleen plates, while dolphins use echolocation to locate prey and have teeth designed for grabbing and holding onto fish. 5. Behavior: Whales are generally more solitary creatures, while dolphins are known for their social behavior, often forming pods or groups.
This code connects Hugging Face Transformers to LangChain’s prompt management:
- Load model into Hugging Face pipeline.
- Wrap it in LangChain (HuggingFacePipeline).
- Build structured prompts (system + user).
- Format prompt with user input.
- Send it to the model and get a response.
Feel free to experiment with different system and human prompts!
# Create a text generation pipeline
text_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
device_map="auto",
return_full_text=False,
eos_token_id=tokenizer.eos_token_id,
skip_special_tokens=True,
)
# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)
# Define the system and user messages
system_message_1 = SystemMessagePromptTemplate.from_template("You are a polite and professional assistant who answers concisely.")
system_message_2 = SystemMessagePromptTemplate.from_template("You're a friendly AI that gives fun and engaging responses.")
system_message_3 = SystemMessagePromptTemplate.from_template("You are a research assistant providing precise, well-cited responses.")
user_message = HumanMessagePromptTemplate.from_template("{question}")
# Create a prompt template
chat_prompt = ChatPromptTemplate.from_messages([system_message_3, user_message])
# Format the prompt
formatted_prompt = chat_prompt.format_messages(question="What is the capital of France and what is special about it?")
# Run inference
response = llm.invoke(formatted_prompt)
print(response)
Device set to use cuda:0
The capital of France is Paris. Paris is a global center for art, fashion, gastronomy, and culture. It is also known for its iconic landmarks, such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a rich history, including roles in the French Revolution and the two World Wars. Paris is also home to numerous world-renowned universities, including the Sorbonne and the École Normale Supérieure.
Extra Parameters and Args¶
Here we add in some extra parameters and args, to get the model to respond in a certain way.
Some of the most important parameters are:
| Parameter | Purpose | Range / Default | Analogy / Effect |
|---|---|---|---|
do_sample | Enables random sampling instead of greedy or beam-based decoding | True / False | 🎲 Adds randomness to output |
temperature | Controls randomness of token selection | > 0, typically 0.7–1.0 | 🌡️ Higher = more creative / chaotic |
top_p | Nucleus sampling: sample from top % of likely tokens | 0.0–1.0, default 1.0 | 🧠 Focuses on most probable words |
num_beams | Beam search: explore multiple continuations and pick the best | 1+, default 1 | 🔍 Smart guessing with multiple options |
repetition_penalty | Penalizes repeated tokens to reduce redundancy | ≥ 1.0, e.g. 1.2 | ♻️ Discourages repetition |
max_new_tokens | Limits the number of tokens the model can generate per prompt | Integer, e.g. 300 | ✂️ Controls response length |
eos_token_id | Token ID that forces the model to stop when encountered | Integer | 🛑 Defines end of output (if supported) |
Detailed Explanation of Generation Parameters¶
do_sample=True¶
- If
False: the model always picks the most likely next token (deterministic, greedy decoding). - If
True: the model will randomly sample from a probability distribution over tokens (non-deterministic). - Required if you want
temperatureortop_pto have any effect.
✅ Enables creativity and variation
❌ Disables reproducibility (unless random seed is fixed)
temperature=1.0¶
- Controls the randomness or "creativity" of the output.
- Lower values → more predictable (safe), higher values → more diverse (risky).
- Affects how "flat" or "peaky" the probability distribution is during sampling.
Typical values:
0.0→ deterministic (most likely token only)0.7–1.0→ balanced>1.5→ chaotic, often incoherent
🔹 top_p=0.9 (a.k.a. nucleus sampling)¶
- The model samples only from the top tokens whose cumulative probability ≥
p. - Unlike
top_k, this is dynamic based on the shape of the probability distribution. - Often used in combination with
temperature.
✅ Focuses output on high-probability words
❌ Too low → model may miss useful words
num_beams=4 (beam search)¶
- Explores multiple candidate completions and picks the best one based on likelihood.
- Slower, but often more optimal (when
do_sample=False). - Does not work with sampling (
do_sample=True).
Typical values:
1= greedy decoding3–5= moderate beam search>10= can become very slow
repetition_penalty=1.2¶
- Penalizes tokens that have already been generated, making the model less likely to repeat itself.
- Higher values reduce repetition but may hurt fluency.
✅ Helps avoid "looping" or redundant outputs
📝 Use with long-form or factual responses
max_new_tokens=300¶
- Sets the maximum number of tokens the model is allowed to generate in the response.
- Does not include input prompt tokens.
✅ Controls output length
✅ Prevents runaway generation or memory issues ✅ Prevents truncated output.
eos_token_id¶
- Tells the model to stop generation once it emits this token ID.
- Useful for enforcing custom stopping conditions.
Optional — most models use their own <eos> or </s> tokens by default.
Feel free to experiment with these parameters!
# Create a text generation pipeline
text_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
do_sample=True,
temperature=0.7, # Balanced creativity (was 5.0 - too chaotic)
top_p=0.9,
max_new_tokens=300,
device_map="auto"
)
# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)
chat_llm = ChatHuggingFace(llm=llm)
Device set to use cuda:0
result = chat_llm.invoke([HumanMessage(content='Can you tell me a fact about Earth?')])
[No output generated]
print(clean_output(result.content))
Sure, Earth is the third planet from the Sun and the only known celestial body with life. It is the fifth largest of the eight planets in the Solar System and has one natural satellite, the Moon. Earth's diameter is approximately 12,742 kilometers, and it takes about 23 hours, 56 minutes, and 4 seconds to fully rotate on its axis. Earth's year, or orbit around the Sun, takes approximately 365.24 days to complete. Its average distance from the Sun is about 150 million kilometers, and its surface area is approximately 510 million square kilometers. Earth's atmosphere is composed of 78% nitrogen, 21% oxygen, 0.9% argon, and traces of other gases. The planet's average surface temperature is 14 °C (57.2 °F), and its mass is approximately 5.97 x 10^24 kilograms. Earth's gravity is approximately 9.8 m/s².
Caching¶
Making the same exact request often? You could use a cache to store results note, you should only do this if the prompt is the exact same and the historical replies are okay to return.
import langchain
from langchain_community.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()
# The first time, it is not yet in cache, so it should take longer
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))
Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance caused by iron oxide (rust) on its surface. It is the second-smallest planet in the Solar System and is about half the size of Earth. It has the tallest volcano and the deepest canyon in the Solar System. Mars has the longest day of any planet in the Solar System, with one day being only 24 hours and 37 minutes long.
# You will notice this reply is instant!
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))
Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance caused by iron oxide (rust) on its surface. It is the second smallest planet in the solar system, with a diameter of about 4,212 miles (6,779 kilometers). Mars has the tallest volcano in the solar system, called Olympus Mons, which is 14 miles (22 kilometers) high, three times the height of Mount Everest. The planet has the largest canyon in the solar system, Valles Marineris, which is about 4,000 km long, 150-250 km wide, and up to 10 km deep. Mars has a thin atmosphere composed mainly of carbon dioxide, with traces of nitrogen and argon. It has a cold, dry, and thin atmosphere, with surface temperatures ranging from about -195°F (-125°C) at the poles to 70°F (20°C) at the equator. The planet has a weak magnetic field, about 1% as strong as Earth's, which offers little protection from solar wind and cosmic radiation.
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down
try:
import ollama
print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
print("Model unloaded from GPU memory")
except Exception as e:
print(f"Model unload skipped: {e}")
# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Unloading Ollama model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M Model unloaded from GPU memory
{'status': 'ok', 'restart': False}