DPO Training Test: Qwen3-4B¶

Tests Direct Preference Optimization (DPO) with Unsloth on Qwen3-4B.

Key features tested:

FastLanguageModel loading with 4-bit quantization
LoRA adapter configuration
DPOTrainer with synthetic preference pairs
Post-training inference verification

DPO Overview: DPO learns from preference pairs (chosen vs rejected responses) without an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model.

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import DPOConfig, DPOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import DPOConfig, DPOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Qwen3-4B with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    dtype=None,  # Auto-detect
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, # Auto-detect ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B-unsloth-bnb-4bit...==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Model loaded: Qwen3ForCausalLM

In [3]:

  Copied!     
 
# Apply LoRA adapters for DPO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for DPO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

LoRA applied: 33,030,144 trainable / 2,541,616,640 total (1.30%)

In [4]:

  Copied!     
 
# Create minimal synthetic preference dataset (5 samples)
# DPO requires: prompt, chosen response, rejected response

preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is an API?",
        "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
        "rejected": "API is code."
    },
    {
        "prompt": "Describe version control.",
        "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
        "rejected": "Version control saves files."
    },
    {
        "prompt": "What is a database?",
        "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
        "rejected": "A database stores stuff."
    },
    {
        "prompt": "Explain object-oriented programming.",
        "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
        "rejected": "OOP uses objects."
    },
]

# Format for DPO
def format_for_dpo(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": sample["prompt"]}],
        tokenize=False,
        add_generation_prompt=True
    )
    return {
        "prompt": prompt,
        "chosen": sample["chosen"],
        "rejected": sample["rejected"],
    }

dataset = Dataset.from_list(preference_data)
dataset = dataset.map(format_for_dpo)

print(f"Dataset created: {len(dataset)} preference pairs")
print(f"Sample prompt: {dataset[0]['prompt'][:80]}...")
# Create minimal synthetic preference dataset (5 samples) # DPO requires: prompt, chosen response, rejected response preference_data = [ { "prompt": "Explain recursion in programming.", "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.", "rejected": "Recursion is just loops." }, { "prompt": "What is an API?", "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.", "rejected": "API is code." }, { "prompt": "Describe version control.", "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.", "rejected": "Version control saves files." }, { "prompt": "What is a database?", "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).", "rejected": "A database stores stuff." }, { "prompt": "Explain object-oriented programming.", "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).", "rejected": "OOP uses objects." }, ] # Format for DPO def format_for_dpo(sample): prompt = tokenizer.apply_chat_template( [{"role": "user", "content": sample["prompt"]}], tokenize=False, add_generation_prompt=True ) return { "prompt": prompt, "chosen": sample["chosen"], "rejected": sample["rejected"], } dataset = Dataset.from_list(preference_data) dataset = dataset.map(format_for_dpo) print(f"Dataset created: {len(dataset)} preference pairs") print(f"Sample prompt: {dataset[0]['prompt'][:80]}...")

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset created: 5 preference pairs
Sample prompt: <|im_start|>user
Explain recursion in programming.<|im_end|>
<|im_start|>assista...

In [ ]:

  Copied!     
 
# DPO Training Configuration (minimal steps for testing)
dpo_config = DPOConfig(
    output_dir="outputs_dpo_qwen_test",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=2,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=5e-6,  # Lower LR for DPO
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    beta=0.1,  # DPO temperature
    max_length=512,
    max_prompt_length=256,
    seed=42,
)

# Initialize DPO Trainer
trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

print("Starting DPO training (2 steps)...")
trainer_stats = trainer.train()
print(f"DPO training completed!")
# DPO Training Configuration (minimal steps for testing) dpo_config = DPOConfig( output_dir="outputs_dpo_qwen_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=5e-6, # Lower LR for DPO logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", beta=0.1, # DPO temperature max_length=512, max_prompt_length=256, seed=42, ) # Initialize DPO Trainer trainer = DPOTrainer( model=model, args=dpo_config, train_dataset=dataset, processing_class=tokenizer, ) print("Starting DPO training (2 steps)...") trainer_stats = trainer.train() print(f"DPO training completed!")

In [6]:

  Copied!     
 
# Post-training inference test
FastLanguageModel.for_inference(model)

test_prompt = "What is machine learning?"
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=" * 60)
print("DPO Training Pipeline Test PASSED")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "What is machine learning?" messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("=" * 60) print("DPO Training Pipeline Test PASSED") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")

============================================================
DPO Training Pipeline Test PASSED
============================================================
Sample generation:
 clear and concise way. Let me start by recalling the basics.

First, machine learning is a subset of artificial intelligence. So, I should mention that. Then, I should define it as the field of study

Test Complete¶

The DPO Training Pipeline test has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with 4-bit quantization (Qwen3-4B)
LoRA adapter configuration for preference learning
Synthetic preference dataset creation (chosen vs rejected)
DPOTrainer training loop (2 steps)
Post-training inference generation

DPO Concepts Demonstrated¶

Direct Preference Optimization: Learning from preference pairs
Implicit Reward Model: No explicit reward model needed
Beta Parameter: Controls strength of preference signal

Ready for Production¶

If this test passed, your environment is ready for:

DPO training with real preference data
Human preference alignment
Post-SFT preference optimization

In [ ]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)