RLOO Training Test: Qwen3-4B-Thinking-2507¶
Tests Reinforcement Learning with Leave-One-Out (RLOO) optimization with Unsloth on Qwen3-4B-Thinking-2507.
Key features tested:
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- RLOOTrainer with thinking-aware reward function
- Rewards self-questioning reasoning in
<think>blocks - Post-training inference verification
RLOO Overview: RLOO uses leave-one-out baseline estimation for variance reduction in policy gradients. For each completion, the baseline is computed as the mean reward of all other completions, providing more stable training than single-sample estimates.
Thinking Reward: The reward function evaluates:
- Presence of
<think>...</think>tags - Quality and length of reasoning
- Bonus for self-questioning (question marks in thinking)
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [9]:
Copied!
# Environment Setup
import os
# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import RLOOConfig, RLOOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
Out[9]:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Out[9]:
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
Out[9]:
🦥 Unsloth Zoo will now patch everything to make training faster!
Out[9]:
Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER ACCELERATE_MIXED_PRECISION: bf16 HF_TOKEN loaded: Yes
In [10]:
Copied!
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")
model, tokenizer = FastLanguageModel.from_pretrained(
MODEL_NAME,
max_seq_length=1024, # Increased for thinking content
load_in_4bit=True,
dtype=None,
)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, dtype=None, ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")
Out[10]:
Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...
Out[10]:
==((====))== Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Out[10]:
Loading weights: 0%| | 0/398 [00:00<?, ?it/s]
Out[10]:
Model loaded: Qwen3ForCausalLM
In [11]:
Copied!
# Apply LoRA adapters for RLOO training
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for RLOO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Out[11]:
Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
Out[11]:
LoRA applied: 33,030,144 trainable / 2,526,543,360 total (1.31%)
In [12]:
Copied!
# Create minimal synthetic prompt dataset for RLOO (5 prompts)
# RLOO requires prompts only - completions are generated during training
prompts = [
"Explain the concept of recursion in programming.",
"What are the benefits of using version control?",
"Describe how a hash table works.",
"What is the difference between a stack and a queue?",
"Explain what an API is to a beginner.",
]
# Format prompts for RLOO (requires "prompt" field)
dataset = Dataset.from_dict({
"prompt": [
tokenizer.apply_chat_template(
[{"role": "user", "content": p}],
tokenize=False,
add_generation_prompt=True
) for p in prompts
]
})
print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
# Create minimal synthetic prompt dataset for RLOO (5 prompts) # RLOO requires prompts only - completions are generated during training prompts = [ "Explain the concept of recursion in programming.", "What are the benefits of using version control?", "Describe how a hash table works.", "What is the difference between a stack and a queue?", "Explain what an API is to a beginner.", ] # Format prompts for RLOO (requires "prompt" field) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
Out[12]:
Dataset created: 5 prompts Sample prompt: <|im_start|>user Explain the concept of recursion in programming.<|im_end|> <|im_start|>assistant <think> ...
In [13]:
Copied!
# Define thinking-aware reward function using token IDs
# TRL passes completion_ids directly - no re-tokenization needed!
THINK_END_TOKEN_ID = 151668 # </think> token for Qwen3-Thinking models
def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs):
"""
Token-based reward function using completion_ids provided by TRL.
- Uses token ID 151668 for </think> boundary detection
- Rewards longer, more detailed reasoning (measured in tokens)
- Bonus for self-questioning (question marks in thinking content)
Benefits over string matching:
- No re-tokenization overhead (faster training)
- Exact token boundaries (no regex edge cases)
- Consistent with inference code pattern
"""
rewards = []
for completion, comp_ids in zip(completions, completion_ids):
# Token-based detection: check for </think> token (ID 151668)
if THINK_END_TOKEN_ID in comp_ids:
end_idx = comp_ids.index(THINK_END_TOKEN_ID)
thinking_length = end_idx # Token count before </think>
# String-based content analysis for question detection
# (using string here is fine since we already know boundary from tokens)
thinking_content = completion.split('</think>')[0]
question_marks = thinking_content.count('?')
has_self_questions = question_marks >= 1
# Reward based on thinking token count
if thinking_length < 10:
reward = 0.3 # Minimal thinking
elif thinking_length < 30:
reward = 0.7 + (0.1 if has_self_questions else 0)
else:
reward = 1.0 + (0.1 if has_self_questions else 0)
else:
reward = -1.0 # No </think> token found
rewards.append(reward)
return rewards
print("Token-based thinking reward function defined")
print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}")
print("Rewards: thinking quality (token count) + self-questioning bonus")
# Define thinking-aware reward function using token IDs # TRL passes completion_ids directly - no re-tokenization needed! THINK_END_TOKEN_ID = 151668 # token for Qwen3-Thinking models def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs): """ Token-based reward function using completion_ids provided by TRL. - Uses token ID 151668 for boundary detection - Rewards longer, more detailed reasoning (measured in tokens) - Bonus for self-questioning (question marks in thinking content) Benefits over string matching: - No re-tokenization overhead (faster training) - Exact token boundaries (no regex edge cases) - Consistent with inference code pattern """ rewards = [] for completion, comp_ids in zip(completions, completion_ids): # Token-based detection: check for token (ID 151668) if THINK_END_TOKEN_ID in comp_ids: end_idx = comp_ids.index(THINK_END_TOKEN_ID) thinking_length = end_idx # Token count before # String-based content analysis for question detection # (using string here is fine since we already know boundary from tokens) thinking_content = completion.split('')[0] question_marks = thinking_content.count('?') has_self_questions = question_marks >= 1 # Reward based on thinking token count if thinking_length < 10: reward = 0.3 # Minimal thinking elif thinking_length < 30: reward = 0.7 + (0.1 if has_self_questions else 0) else: reward = 1.0 + (0.1 if has_self_questions else 0) else: reward = -1.0 # No token found rewards.append(reward) return rewards print("Token-based thinking reward function defined") print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}") print("Rewards: thinking quality (token count) + self-questioning bonus")
Out[13]:
Token-based thinking reward function defined Using THINK_END_TOKEN_ID = 151668 Rewards: thinking quality (token count) + self-questioning bonus
In [ ]:
Copied!
# RLOO Training Configuration (minimal steps for testing)
rloo_config = RLOOConfig(
output_dir="outputs_rloo_qwen_think_test",
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
max_steps=2, # Minimal steps for testing
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
num_generations=4, # Completions per prompt for leave-one-out
max_completion_length=128, # Increased for thinking content
beta=0.05,
seed=42,
)
# Initialize RLOO Trainer
trainer = RLOOTrainer(
model=model,
args=rloo_config,
train_dataset=dataset,
processing_class=tokenizer,
reward_funcs=thinking_reward_fn,
)
print("Starting RLOO training with thinking rewards (2 steps)...")
trainer_stats = trainer.train()
print(f"RLOO training completed!")
# RLOO Training Configuration (minimal steps for testing) rloo_config = RLOOConfig( output_dir="outputs_rloo_qwen_think_test", per_device_train_batch_size=4, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", num_generations=4, # Completions per prompt for leave-one-out max_completion_length=128, # Increased for thinking content beta=0.05, seed=42, ) # Initialize RLOO Trainer trainer = RLOOTrainer( model=model, args=rloo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=thinking_reward_fn, ) print("Starting RLOO training with thinking rewards (2 steps)...") trainer_stats = trainer.train() print(f"RLOO training completed!")
In [15]:
Copied!
# Post-training inference test
FastLanguageModel.for_inference(model)
test_prompt = "What is machine learning?"
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1024, # Increased to allow full thinking + response
temperature=0.6,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
# Get generated token IDs only (exclude prompt)
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0][input_length:].tolist()
# Token-based parsing using </think> token ID
THINK_END_TOKEN_ID = 151668
if THINK_END_TOKEN_ID in generated_ids:
end_idx = generated_ids.index(THINK_END_TOKEN_ID)
thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip()
final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip()
think_tag_found = True
else:
thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
final_resp = "(Model did not complete thinking - increase max_new_tokens)"
think_tag_found = False
print("=" * 60)
print("RLOO Training Pipeline Test (Thinking Mode)")
print("=" * 60)
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO'}")
print(f"Output tokens: {len(generated_ids)}")
print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}")
print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}")
if think_tag_found and thinking and final_resp:
print("\n✅ RLOO Training Pipeline Test PASSED")
else:
print("\n⚠️ Test completed but output may need review")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "What is machine learning?" messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1024, # Increased to allow full thinking + response temperature=0.6, top_p=0.95, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) # Get generated token IDs only (exclude prompt) input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0][input_length:].tolist() # Token-based parsing using token ID THINK_END_TOKEN_ID = 151668 if THINK_END_TOKEN_ID in generated_ids: end_idx = generated_ids.index(THINK_END_TOKEN_ID) thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip() final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip() think_tag_found = True else: thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() final_resp = "(Model did not complete thinking - increase max_new_tokens)" think_tag_found = False print("=" * 60) print("RLOO Training Pipeline Test (Thinking Mode)") print("=" * 60) print(f" token found: {'✅ YES' if think_tag_found else '❌ NO'}") print(f"Output tokens: {len(generated_ids)}") print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}") print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}") if think_tag_found and thinking and final_resp: print("\n✅ RLOO Training Pipeline Test PASSED") else: print("\n⚠️ Test completed but output may need review")
Out[15]:
============================================================ RLOO Training Pipeline Test (Thinking Mode) ============================================================ </think> token found: ✅ YES Output tokens: 1024 THINKING: Okay, the user is asking "What is machine learning?" Hmm, this is a pretty basic question but super important. Let me think about how to approach this. First, I should consider who might be asking this. Could be a complete beginner - maybe a student, a curious professional, or someone who just hear... RESPONSE: That's a great question—and one of the most important in tech today! Let me break it down **simply, clearly, and with real-world examples** (no jargon overload). Here's what you need to know: --- ##... ✅ RLOO Training Pipeline Test PASSED
Test Complete¶
The RLOO Training Pipeline test with thinking rewards has completed successfully. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration for RL training
- Synthetic prompt dataset creation
- Thinking-aware reward function (rewards reasoning quality + self-questioning)
- RLOOTrainer training loop (2 steps)
- Post-training inference with thinking output
RLOO Concepts with Thinking¶
- Leave-One-Out Baseline: Each completion's baseline is mean of other K-1 rewards
- Thinking Reward: Evaluates
<think>content quality - Stable Optimization: Variance reduction enables reliable thinking quality optimization
- KL Penalty: Prevents policy from diverging too far from reference
Ready for Production¶
If this test passed, your environment is ready for:
- RLOO training with thinking-focused reward models
- RLHF pipelines optimizing reasoning quality with stable gradients
- Chain-of-thought policy refinement
In [16]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Out[16]:
Shutting down kernel to release GPU memory...
Out[16]:
{'status': 'ok', 'restart': False}