Reward Model Training Test: Qwen3-4B-Thinking-2507¶
Tests Reward Model training with Unsloth on Qwen3-4B-Thinking-2507.
Key features tested:
- AutoModelForSequenceClassification loading with 4-bit quantization
- LoRA adapter configuration for reward modeling
- RewardTrainer with preference pairs that score thinking quality
- Post-training reward scoring verification
Reward Model Overview: Reward models learn to score responses based on human preferences. They output a scalar reward for each response, used in RLHF pipelines (GRPO, RLOO) for policy optimization.
Thinking Preference:
- Higher reward: Responses with quality self-questioning
<think>blocks - Lower reward: Responses with poor, minimal, or no thinking
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import is_bf16_supported
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardConfig, RewardTrainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import is_bf16_supported import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardConfig, RewardTrainer from peft import LoraConfig, get_peft_model from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
Out[1]:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Out[1]:
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
Out[1]:
🦥 Unsloth Zoo will now patch everything to make training faster!
Out[1]:
Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Qwen3-4B-Thinking-2507 as Reward Model (SequenceClassification head)
from transformers import BitsAndBytesConfig
MODEL_NAME = "Qwen/Qwen3-4B-Thinking-2507"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...")
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=1, # Single scalar reward output
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 as Reward Model (SequenceClassification head) from transformers import BitsAndBytesConfig MODEL_NAME = "Qwen/Qwen3-4B-Thinking-2507" print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...") # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=1, # Single scalar reward output quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id model.config.pad_token_id = tokenizer.pad_token_id print(f"Model loaded: {type(model).__name__}")
Out[2]:
Loading Qwen3-4B-Thinking-2507 as reward model...
Out[2]:
config.json: 0%| | 0.00/727 [00:00<?, ?B/s]
Out[2]:
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Out[2]:
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Out[2]:
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
Out[2]:
Loading weights: 0%| | 0/398 [00:00<?, ?it/s]
Out[2]:
Qwen3ForSequenceClassification LOAD REPORT from: Qwen/Qwen3-4B-Thinking-2507 Key | Status | -------------+---------+- score.weight | MISSING | Notes: - MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
Out[2]:
tokenizer_config.json: 0.00B [00:00, ?B/s]
Out[2]:
vocab.json: 0.00B [00:00, ?B/s]
Out[2]:
merges.txt: 0.00B [00:00, ?B/s]
Out[2]:
tokenizer.json: 0%| | 0.00/11.4M [00:00<?, ?B/s]
Out[2]:
Model loaded: Qwen3ForSequenceClassification
In [3]:
Copied!
# Apply LoRA adapters for reward model training
lora_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="SEQ_CLS",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Apply LoRA adapters for reward model training lora_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="SEQ_CLS", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters()
Out[3]:
trainable params: 33,032,704 || all params: 4,055,503,360 || trainable%: 0.8145
In [4]:
Copied!
# Create minimal synthetic preference dataset with thinking quality contrast (5 samples)
# Reward models learn: good thinking (chosen) > poor/no thinking (rejected)
preference_data = [
{
"prompt": "Explain recursion in programming.",
"chosen": "<think>\nWhat is recursion exactly? It's when a function calls itself. But why would you do that? To break down problems into smaller pieces. What's the key thing users need to understand? The base case.\n</think>\n\nRecursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
"rejected": "Recursion is just loops."
},
{
"prompt": "What is an API?",
"chosen": "<think>\nHow do I explain API to someone? What does it stand for? Application Programming Interface. What's a good analogy? Like a waiter taking orders between you and the kitchen.\n</think>\n\nAn API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
"rejected": "<think>\n\n</think>\n\nAPI is code."
},
{
"prompt": "Describe version control.",
"chosen": "<think>\nWhat's the core purpose of version control? Tracking changes over time. Why is that useful? You can go back to previous versions. What systems exist? Git is the most popular.\n</think>\n\nVersion control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
"rejected": "Version control saves files."
},
{
"prompt": "What is a database?",
"chosen": "<think>\nWhat is the essential definition of a database? It stores data, but that's too simple. What makes it different from just files? It's organized and structured. What manages it? A DBMS.\n</think>\n\nA database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
"rejected": "<think>\nDatabase.\n</think>\n\nA database stores stuff."
},
{
"prompt": "Explain object-oriented programming.",
"chosen": "<think>\nWhat are the key concepts of OOP? Objects, classes, encapsulation. But what's the core idea? Organizing code around objects that have both data and behavior. How do I explain this simply?\n</think>\n\nObject-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
"rejected": "OOP uses objects."
},
]
dataset = Dataset.from_list(preference_data)
print(f"Dataset created: {len(dataset)} preference pairs")
print(f"Chosen: quality self-questioning thinking | Rejected: poor/no thinking")
# Create minimal synthetic preference dataset with thinking quality contrast (5 samples) # Reward models learn: good thinking (chosen) > poor/no thinking (rejected) preference_data = [ { "prompt": "Explain recursion in programming.", "chosen": "\nWhat is recursion exactly? It's when a function calls itself. But why would you do that? To break down problems into smaller pieces. What's the key thing users need to understand? The base case.\n \n\nRecursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.", "rejected": "Recursion is just loops." }, { "prompt": "What is an API?", "chosen": "\nHow do I explain API to someone? What does it stand for? Application Programming Interface. What's a good analogy? Like a waiter taking orders between you and the kitchen.\n \n\nAn API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.", "rejected": "\n\n \n\nAPI is code." }, { "prompt": "Describe version control.", "chosen": "\nWhat's the core purpose of version control? Tracking changes over time. Why is that useful? You can go back to previous versions. What systems exist? Git is the most popular.\n \n\nVersion control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.", "rejected": "Version control saves files." }, { "prompt": "What is a database?", "chosen": "\nWhat is the essential definition of a database? It stores data, but that's too simple. What makes it different from just files? It's organized and structured. What manages it? A DBMS.\n \n\nA database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).", "rejected": "\nDatabase.\n \n\nA database stores stuff." }, { "prompt": "Explain object-oriented programming.", "chosen": "\nWhat are the key concepts of OOP? Objects, classes, encapsulation. But what's the core idea? Organizing code around objects that have both data and behavior. How do I explain this simply?\n \n\nObject-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).", "rejected": "OOP uses objects." }, ] dataset = Dataset.from_list(preference_data) print(f"Dataset created: {len(dataset)} preference pairs") print(f"Chosen: quality self-questioning thinking | Rejected: poor/no thinking")
Out[4]:
Dataset created: 5 preference pairs Chosen: quality self-questioning thinking | Rejected: poor/no thinking
In [ ]:
Copied!
# Reward Training Configuration (minimal steps for testing)
reward_config = RewardConfig(
output_dir="outputs_reward_qwen_think_test",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_steps=3, # Minimal steps for testing
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_length=1024, # Increased for thinking content
seed=42,
)
# Initialize Reward Trainer
trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=dataset,
processing_class=tokenizer,
)
print("Starting Reward training with thinking preferences (3 steps)...")
trainer_stats = trainer.train()
print(f"Reward training completed!")
# Reward Training Configuration (minimal steps for testing) reward_config = RewardConfig( output_dir="outputs_reward_qwen_think_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_length=1024, # Increased for thinking content seed=42, ) # Initialize Reward Trainer trainer = RewardTrainer( model=model, args=reward_config, train_dataset=dataset, processing_class=tokenizer, ) print("Starting Reward training with thinking preferences (3 steps)...") trainer_stats = trainer.train() print(f"Reward training completed!")
In [6]:
Copied!
# Post-training reward scoring test
model.eval()
def get_reward(prompt, response):
"""Score a response using the trained reward model."""
text = prompt + " " + response
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
reward = outputs.logits[0, 0].item()
return reward
# Test with good thinking vs poor thinking
test_prompt = "What is machine learning?"
good_response = """<think>
What is the user asking here? They want to understand machine learning. What are the key concepts? It's a subset of AI, involves learning from data. How should I explain this clearly?
</think>
Machine learning is a subset of AI where computers learn patterns from data to make predictions."""
poor_response = "ML is stuff."
good_score = get_reward(test_prompt, good_response)
poor_score = get_reward(test_prompt, poor_response)
print("=" * 60)
print("Reward Model Training Pipeline Test (Thinking Mode) PASSED")
print("=" * 60)
print(f"Good thinking response score: {good_score:.4f}")
print(f"Poor/no thinking score: {poor_score:.4f}")
print(f"Prefers good thinking: {'Yes' if good_score > poor_score else 'No'}")
# Post-training reward scoring test model.eval() def get_reward(prompt, response): """Score a response using the trained reward model.""" text = prompt + " " + response inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) reward = outputs.logits[0, 0].item() return reward # Test with good thinking vs poor thinking test_prompt = "What is machine learning?" good_response = """ What is the user asking here? They want to understand machine learning. What are the key concepts? It's a subset of AI, involves learning from data. How should I explain this clearly? Machine learning is a subset of AI where computers learn patterns from data to make predictions.""" poor_response = "ML is stuff." good_score = get_reward(test_prompt, good_response) poor_score = get_reward(test_prompt, poor_response) print("=" * 60) print("Reward Model Training Pipeline Test (Thinking Mode) PASSED") print("=" * 60) print(f"Good thinking response score: {good_score:.4f}") print(f"Poor/no thinking score: {poor_score:.4f}") print(f"Prefers good thinking: {'Yes' if good_score > poor_score else 'No'}")
Out[6]:
============================================================ Reward Model Training Pipeline Test (Thinking Mode) PASSED ============================================================ Good thinking response score: -4.5000 Poor/no thinking score: -5.6875 Prefers good thinking: Yes
Test Complete¶
The Reward Model Training Pipeline test with thinking preferences has completed successfully. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- AutoModelForSequenceClassification loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration for reward modeling
- Preference dataset with thinking quality contrast
- RewardTrainer training loop (3 steps)
- Post-training reward scoring prefers good thinking
Reward Model Concepts with Thinking¶
- Thinking Quality Scoring: Model learns to prefer self-questioning reasoning
- Preference Learning: Good
<think>blocks score higher than poor/empty ones - RLHF Integration: Use with GRPO/RLOO to optimize thinking quality
Ready for Production¶
If this test passed, your environment is ready for:
- Training reward models that score thinking quality
- RLHF pipelines optimizing chain-of-thought reasoning
- Response quality scoring based on reasoning depth
In [7]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Out[7]:
Shutting down kernel to release GPU memory...
Out[7]:
{'status': 'ok', 'restart': False}