Skip to content

Fast Inference

Overview

Unsloth provides optimized inference through the vLLM backend, enabling 2x faster generation compared to standard HuggingFace inference. This skill covers fast inference setup, thinking model output parsing, and memory management.

Quick Reference

Component Purpose
fast_inference=True Enable vLLM backend for 2x speedup
model.fast_generate() vLLM-accelerated generation
SamplingParams Control generation (temperature, top_p, etc.)
FastLanguageModel.for_inference() Merge LoRA adapters for inference
Token ID 151668 </think> boundary for Qwen3-Thinking models

Critical Environment Setup

import os
from dotenv import load_dotenv
load_dotenv()

Critical Import Order

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import vllm
from vllm import SamplingParams

Environment Verification

Before inference, verify your environment is correctly configured:

import unsloth
from unsloth import FastLanguageModel
import torch
import vllm

# Check versions
print(f"unsloth: {unsloth.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

Standard Inference (No vLLM)

Load Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    max_seq_length=1024,
    load_in_4bit=True,
)

# Prepare for inference (merges LoRA adapters if present)
FastLanguageModel.for_inference(model)

Generate Response

messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
)

# Decode only new tokens
input_length = inputs["input_ids"].shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print(response)

Fast Inference (vLLM Backend)

Load Model with Fast Inference

from unsloth import FastLanguageModel
from vllm import SamplingParams

MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,  # Enable vLLM backend
)

Fast Generate

FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "What is 15 + 27? Show your thinking."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(
    temperature=0.6,      # Recommended for thinking models
    top_p=0.95,
    top_k=20,
    max_tokens=2048,      # Increased for thinking + response
)

# Use fast_generate instead of generate
outputs = model.fast_generate([prompt], sampling_params=sampling_params)

# Extract output
raw_output = outputs[0].outputs[0].text
output_token_ids = outputs[0].outputs[0].token_ids
print(raw_output)

Sampling Parameters

from vllm import SamplingParams

# Conservative (factual responses)
conservative = SamplingParams(
    temperature=0.3,
    top_p=0.9,
    max_tokens=512,
)

# Balanced (general use)
balanced = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    max_tokens=1024,
)

# Creative (diverse outputs)
creative = SamplingParams(
    temperature=0.9,
    top_p=0.95,
    top_k=50,
    max_tokens=2048,
)

# Thinking models (allow long reasoning)
thinking = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    max_tokens=2048,  # Extra space for <think> content
)

Thinking Model Output Parsing

Qwen3-Thinking models use <think>...</think> tags to separate reasoning from final responses. Use token-based parsing for accuracy.

THINK_END_TOKEN_ID = 151668  # </think> token for Qwen3-Thinking models

def parse_thinking_response(token_ids, tokenizer):
    """
    Parse thinking model output using token ID boundary.

    With Thinking models + add_generation_prompt=True:
    - Template adds <think> to prompt
    - Model output starts with thinking content
    - Model outputs </think> (token 151668) when done
    - Final response follows </think>

    Args:
        token_ids: Output token IDs from generation
        tokenizer: Model tokenizer

    Returns:
        tuple: (thinking_content, response_content)
    """
    token_list = list(token_ids)

    if THINK_END_TOKEN_ID in token_list:
        end_idx = token_list.index(THINK_END_TOKEN_ID)
        thinking_tokens = token_list[:end_idx]
        response_tokens = token_list[end_idx + 1:]

        thinking = tokenizer.decode(thinking_tokens, skip_special_tokens=True).strip()
        response = tokenizer.decode(response_tokens, skip_special_tokens=True).strip()
    else:
        # No </think> found - model may still be thinking
        thinking = tokenizer.decode(token_list, skip_special_tokens=True).strip()
        response = "(Model did not complete thinking - increase max_tokens)"

    return thinking, response

Usage Example

# Generate with fast_inference
outputs = model.fast_generate([prompt], sampling_params=sampling_params)
output_token_ids = outputs[0].outputs[0].token_ids

# Parse thinking and response
thinking, response = parse_thinking_response(output_token_ids, tokenizer)

print("=== THINKING ===")
print(thinking)
print("\n=== RESPONSE ===")
print(response)

Verification

# Verify parsing worked correctly
think_tag_found = THINK_END_TOKEN_ID in list(output_token_ids)
has_thinking = bool(thinking) and "did not complete" not in response
has_response = bool(response) and "did not complete" not in response

print(f"</think> token found: {'Yes' if think_tag_found else 'No'}")
print(f"Thinking extracted: {'Yes' if has_thinking else 'No'}")
print(f"Response extracted: {'Yes' if has_response else 'No'}")

if not think_tag_found:
    print("Tip: Increase max_tokens in SamplingParams")

Batch Inference

Multiple Prompts

prompts = [
    "What is recursion?",
    "Explain machine learning in simple terms.",
    "What is the difference between Python and JavaScript?",
]

# Format all prompts
formatted_prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True
    )
    for p in prompts
]

# Batch generate (vLLM handles parallelization)
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)
outputs = model.fast_generate(formatted_prompts, sampling_params=sampling_params)

# Process results
for i, output in enumerate(outputs):
    print(f"\n=== Prompt {i+1} ===")
    print(f"Q: {prompts[i]}")
    print(f"A: {output.outputs[0].text}")

Memory Management

GPU Memory Monitoring

import subprocess

def measure_gpu_memory():
    """Measure current GPU memory usage in MB."""
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    return int(result.stdout.strip().split('\n')[0])

# Usage
print(f"GPU memory used: {measure_gpu_memory()} MB")

Memory Cleanup

import gc
import torch

def cleanup_memory():
    """Force garbage collection and clear CUDA cache."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# Usage after inference
cleanup_memory()
print(f"GPU memory after cleanup: {measure_gpu_memory()} MB")

Jupyter Kernel Shutdown (Critical for vLLM)

vLLM does NOT release GPU memory within a Jupyter session. Kernel restart is required between model tests:

import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Important: Always run this at the end of notebooks that use fast_inference=True. Without kernel shutdown, loading a different model will fail with OOM.

Notebook pattern: All finetuning notebooks end with a shutdown cell.

Model Loading Patterns

# Fast loading with pre-quantized models
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",  # Pre-quantized
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
)

On-Demand Quantization

# Quantize during loading (slower initial load)
model, tokenizer = FastLanguageModel.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Full precision
    max_seq_length=1024,
    load_in_4bit=True,  # Quantize on load
    fast_inference=True,
)

Post-Training Inference

# After SFT/GRPO/DPO training
FastLanguageModel.for_inference(model)  # Merge LoRA adapters

# Then generate as normal
outputs = model.generate(**inputs, max_new_tokens=512)

Supported Models

Model Path Parameters Use Case
Qwen3-4B-Thinking unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit 4B Reasoning, chain-of-thought
Ministral-3B-Reasoning unsloth/Ministral-3-3B-Reasoning-2512 3B Fast reasoning
Qwen3-4B unsloth/Qwen3-4B-unsloth-bnb-4bit 4B General instruction following
Llama-3.2-3B unsloth/Llama-3.2-3B-Instruct-bnb-4bit 3B General instruction following

Troubleshooting

vLLM Not Available

Symptom: fast_inference=True fails or falls back to standard inference

Fix:

# Check vLLM installation
import inspect
sig = inspect.signature(FastLanguageModel.from_pretrained)
if 'fast_inference' in sig.parameters:
    print("fast_inference parameter available")
else:
    print("vLLM not available - using standard inference")

Out of Memory

Symptom: CUDA out of memory during inference

Fix: - Use 4-bit quantization (load_in_4bit=True) - Reduce max_seq_length - Reduce max_tokens in SamplingParams - Use cleanup_memory() between batches

Incomplete Thinking

Symptom: </think> token not found in output

Fix: - Increase max_tokens in SamplingParams (try 2048+) - Check that model is a Thinking variant - Verify add_generation_prompt=True in chat template

GPU Memory Not Released

Symptom: Memory stays high after inference

Fix: - Call cleanup_memory() - Restart Jupyter kernel between model tests - Use del model then cleanup_memory()

When to Use This Skill

Use when: - Running inference on fine-tuned models - Need fast batch inference - Working with thinking/reasoning models - Optimizing inference latency - Parsing chain-of-thought outputs

Cross-References

  • bazzite-ai-jupyter:sft - Supervised fine-tuning (train before inference)
  • bazzite-ai-jupyter:peft - LoRA adapter loading
  • bazzite-ai-jupyter:quantization - Quantization options
  • bazzite-ai-jupyter:transformers - Transformer architecture background
  • bazzite-ai-ollama:api - Ollama deployment for production