Reward Model Training Test: Qwen3-4B¶
Tests Reward Model training with Unsloth on Qwen3-4B.
Key features tested:
- AutoModelForSequenceClassification loading with 4-bit quantization
- LoRA adapter configuration for reward modeling
- RewardTrainer with synthetic preference pairs
- Post-training reward scoring verification
Reward Model Overview: Reward models learn to score responses based on human preferences. They output a scalar reward for each response, used in RLHF pipelines (GRPO, RLOO) for policy optimization.
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import is_bf16_supported
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardConfig, RewardTrainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import is_bf16_supported import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardConfig, RewardTrainer from peft import LoraConfig, get_peft_model from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Qwen3-4B as Reward Model (SequenceClassification head)
from transformers import BitsAndBytesConfig
MODEL_NAME = "Qwen/Qwen3-4B"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...")
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=1, # Single scalar reward output
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B as Reward Model (SequenceClassification head) from transformers import BitsAndBytesConfig MODEL_NAME = "Qwen/Qwen3-4B" print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...") # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=1, # Single scalar reward output quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id model.config.pad_token_id = tokenizer.pad_token_id print(f"Model loaded: {type(model).__name__}")
Loading Qwen3-4B as reward model...
config.json: 0%| | 0.00/726 [00:00<?, ?B/s]
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
Loading weights: 0%| | 0/398 [00:00<?, ?it/s]
Qwen3ForSequenceClassification LOAD REPORT from: Qwen/Qwen3-4B Key | Status | -------------+---------+- score.weight | MISSING | Notes: - MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
tokenizer_config.json: 0.00B [00:00, ?B/s]
vocab.json: 0.00B [00:00, ?B/s]
merges.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0%| | 0.00/11.4M [00:00<?, ?B/s]
Model loaded: Qwen3ForSequenceClassification
In [3]:
Copied!
# Apply LoRA adapters for reward model training
lora_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="SEQ_CLS",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Apply LoRA adapters for reward model training lora_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="SEQ_CLS", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters()
trainable params: 33,032,704 || all params: 4,055,503,360 || trainable%: 0.8145
In [4]:
Copied!
# Create minimal synthetic preference dataset (5 samples)
# Reward models learn from preference pairs (chosen > rejected)
preference_data = [
{
"prompt": "Explain recursion in programming.",
"chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
"rejected": "Recursion is just loops."
},
{
"prompt": "What is an API?",
"chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
"rejected": "API is code."
},
{
"prompt": "Describe version control.",
"chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
"rejected": "Version control saves files."
},
{
"prompt": "What is a database?",
"chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
"rejected": "A database stores stuff."
},
{
"prompt": "Explain object-oriented programming.",
"chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
"rejected": "OOP uses objects."
},
]
dataset = Dataset.from_list(preference_data)
print(f"Dataset created: {len(dataset)} preference pairs")
# Create minimal synthetic preference dataset (5 samples) # Reward models learn from preference pairs (chosen > rejected) preference_data = [ { "prompt": "Explain recursion in programming.", "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.", "rejected": "Recursion is just loops." }, { "prompt": "What is an API?", "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.", "rejected": "API is code." }, { "prompt": "Describe version control.", "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.", "rejected": "Version control saves files." }, { "prompt": "What is a database?", "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).", "rejected": "A database stores stuff." }, { "prompt": "Explain object-oriented programming.", "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).", "rejected": "OOP uses objects." }, ] dataset = Dataset.from_list(preference_data) print(f"Dataset created: {len(dataset)} preference pairs")
Dataset created: 5 preference pairs
In [ ]:
Copied!
# Reward Training Configuration (minimal steps for testing)
reward_config = RewardConfig(
output_dir="outputs_reward_qwen_test",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_steps=3, # Minimal steps for testing
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_length=512,
seed=42,
)
# Initialize Reward Trainer
trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=dataset,
processing_class=tokenizer,
)
print("Starting Reward training (3 steps)...")
trainer_stats = trainer.train()
print(f"Reward training completed!")
# Reward Training Configuration (minimal steps for testing) reward_config = RewardConfig( output_dir="outputs_reward_qwen_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_length=512, seed=42, ) # Initialize Reward Trainer trainer = RewardTrainer( model=model, args=reward_config, train_dataset=dataset, processing_class=tokenizer, ) print("Starting Reward training (3 steps)...") trainer_stats = trainer.train() print(f"Reward training completed!")
In [6]:
Copied!
# Post-training reward scoring test
model.eval()
def get_reward(prompt, response):
"""Score a response using the trained reward model."""
text = prompt + " " + response
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
reward = outputs.logits[0, 0].item()
return reward
# Test with a good and bad response
test_prompt = "What is machine learning?"
good_response = "Machine learning is a subset of AI where computers learn patterns from data to make predictions."
bad_response = "ML is stuff."
good_score = get_reward(test_prompt, good_response)
bad_score = get_reward(test_prompt, bad_response)
print("=" * 60)
print("Reward Model Training Pipeline Test PASSED")
print("=" * 60)
print(f"Good response score: {good_score:.4f}")
print(f"Bad response score: {bad_score:.4f}")
print(f"Preference correct: {'Yes' if good_score > bad_score else 'No'}")
# Post-training reward scoring test model.eval() def get_reward(prompt, response): """Score a response using the trained reward model.""" text = prompt + " " + response inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) reward = outputs.logits[0, 0].item() return reward # Test with a good and bad response test_prompt = "What is machine learning?" good_response = "Machine learning is a subset of AI where computers learn patterns from data to make predictions." bad_response = "ML is stuff." good_score = get_reward(test_prompt, good_response) bad_score = get_reward(test_prompt, bad_response) print("=" * 60) print("Reward Model Training Pipeline Test PASSED") print("=" * 60) print(f"Good response score: {good_score:.4f}") print(f"Bad response score: {bad_score:.4f}") print(f"Preference correct: {'Yes' if good_score > bad_score else 'No'}")
============================================================ Reward Model Training Pipeline Test PASSED ============================================================ Good response score: 1.4453 Bad response score: 1.0703 Preference correct: Yes
Test Complete¶
The Reward Model Training Pipeline test has completed successfully. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- AutoModelForSequenceClassification loading with 4-bit quantization (Qwen3-4B)
- LoRA adapter configuration for reward modeling
- Synthetic preference dataset creation
- RewardTrainer training loop (3 steps)
- Post-training reward scoring
Reward Model Concepts Demonstrated¶
- Sequence Classification: Model outputs scalar reward score
- Preference Learning: Trained on chosen vs rejected pairs
- RLHF Integration: Can be used with GRPO/RLOO trainers
Ready for Production¶
If this test passed, your environment is ready for:
- Training reward models on real preference data
- RLHF pipelines with learned rewards
- Response quality scoring
In [ ]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)