Reward Model Training Test: Qwen3-4B¶

Tests Reward Model training with Unsloth on Qwen3-4B.

Key features tested:

AutoModelForSequenceClassification loading with 4-bit quantization
LoRA adapter configuration for reward modeling
RewardTrainer with synthetic preference pairs
Post-training reward scoring verification

Reward Model Overview: Reward models learn to score responses based on human preferences. They output a scalar reward for each response, used in RLHF pipelines (GRPO, RLOO) for policy optimization.

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import is_bf16_supported

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardConfig, RewardTrainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import is_bf16_supported import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardConfig, RewardTrainer from peft import LoraConfig, get_peft_model from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Qwen3-4B as Reward Model (SequenceClassification head)
from transformers import BitsAndBytesConfig

MODEL_NAME = "Qwen/Qwen3-4B"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...")

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=1,  # Single scalar reward output
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B as Reward Model (SequenceClassification head) from transformers import BitsAndBytesConfig MODEL_NAME = "Qwen/Qwen3-4B" print(f"\nLoading {MODEL_NAME.split('/')[-1]} as reward model...") # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 if is_bf16_supported() else torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=1, # Single scalar reward output quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id model.config.pad_token_id = tokenizer.pad_token_id print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B as reward model...

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Qwen3ForSequenceClassification LOAD REPORT from: Qwen/Qwen3-4B
Key          | Status  | 
-------------+---------+-
score.weight | MISSING | 

Notes:
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Model loaded: Qwen3ForSequenceClassification

In [3]:

  Copied!     
 
# Apply LoRA adapters for reward model training
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="SEQ_CLS",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Apply LoRA adapters for reward model training lora_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="SEQ_CLS", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters()

trainable params: 33,032,704 || all params: 4,055,503,360 || trainable%: 0.8145

In [4]:

  Copied!     
 
# Create minimal synthetic preference dataset (5 samples)
# Reward models learn from preference pairs (chosen > rejected)

preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is an API?",
        "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
        "rejected": "API is code."
    },
    {
        "prompt": "Describe version control.",
        "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
        "rejected": "Version control saves files."
    },
    {
        "prompt": "What is a database?",
        "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
        "rejected": "A database stores stuff."
    },
    {
        "prompt": "Explain object-oriented programming.",
        "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
        "rejected": "OOP uses objects."
    },
]

dataset = Dataset.from_list(preference_data)
print(f"Dataset created: {len(dataset)} preference pairs")
# Create minimal synthetic preference dataset (5 samples) # Reward models learn from preference pairs (chosen > rejected) preference_data = [ { "prompt": "Explain recursion in programming.", "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.", "rejected": "Recursion is just loops." }, { "prompt": "What is an API?", "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.", "rejected": "API is code." }, { "prompt": "Describe version control.", "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.", "rejected": "Version control saves files." }, { "prompt": "What is a database?", "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).", "rejected": "A database stores stuff." }, { "prompt": "Explain object-oriented programming.", "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).", "rejected": "OOP uses objects." }, ] dataset = Dataset.from_list(preference_data) print(f"Dataset created: {len(dataset)} preference pairs")

Dataset created: 5 preference pairs

In [ ]:

  Copied!     
 
# Reward Training Configuration (minimal steps for testing)
reward_config = RewardConfig(
    output_dir="outputs_reward_qwen_test",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=3,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_length=512,
    seed=42,
)

# Initialize Reward Trainer
trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

print("Starting Reward training (3 steps)...")
trainer_stats = trainer.train()
print(f"Reward training completed!")
# Reward Training Configuration (minimal steps for testing) reward_config = RewardConfig( output_dir="outputs_reward_qwen_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_length=512, seed=42, ) # Initialize Reward Trainer trainer = RewardTrainer( model=model, args=reward_config, train_dataset=dataset, processing_class=tokenizer, ) print("Starting Reward training (3 steps)...") trainer_stats = trainer.train() print(f"Reward training completed!")

In [6]:

  Copied!     
 
# Post-training reward scoring test
model.eval()

def get_reward(prompt, response):
    """Score a response using the trained reward model."""
    text = prompt + " " + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits[0, 0].item()
    
    return reward

# Test with a good and bad response
test_prompt = "What is machine learning?"
good_response = "Machine learning is a subset of AI where computers learn patterns from data to make predictions."
bad_response = "ML is stuff."

good_score = get_reward(test_prompt, good_response)
bad_score = get_reward(test_prompt, bad_response)

print("=" * 60)
print("Reward Model Training Pipeline Test PASSED")
print("=" * 60)
print(f"Good response score: {good_score:.4f}")
print(f"Bad response score:  {bad_score:.4f}")
print(f"Preference correct:  {'Yes' if good_score > bad_score else 'No'}")
# Post-training reward scoring test model.eval() def get_reward(prompt, response): """Score a response using the trained reward model.""" text = prompt + " " + response inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) reward = outputs.logits[0, 0].item() return reward # Test with a good and bad response test_prompt = "What is machine learning?" good_response = "Machine learning is a subset of AI where computers learn patterns from data to make predictions." bad_response = "ML is stuff." good_score = get_reward(test_prompt, good_response) bad_score = get_reward(test_prompt, bad_response) print("=" * 60) print("Reward Model Training Pipeline Test PASSED") print("=" * 60) print(f"Good response score: {good_score:.4f}") print(f"Bad response score: {bad_score:.4f}") print(f"Preference correct: {'Yes' if good_score > bad_score else 'No'}")

============================================================
Reward Model Training Pipeline Test PASSED
============================================================
Good response score: 1.4453
Bad response score:  1.0703
Preference correct:  Yes

Test Complete¶

The Reward Model Training Pipeline test has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

AutoModelForSequenceClassification loading with 4-bit quantization (Qwen3-4B)
LoRA adapter configuration for reward modeling
Synthetic preference dataset creation
RewardTrainer training loop (3 steps)
Post-training reward scoring

Reward Model Concepts Demonstrated¶

Sequence Classification: Model outputs scalar reward score
Preference Learning: Trained on chosen vs rejected pairs
RLHF Integration: Can be used with GRPO/RLOO trainers

Ready for Production¶

If this test passed, your environment is ready for:

Training reward models on real preference data
RLHF pipelines with learned rewards
Response quality scoring

In [ ]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)