RLOO Training Test: Qwen3-4B-Thinking-2507¶

Tests Reinforcement Learning with Leave-One-Out (RLOO) optimization with Unsloth on Qwen3-4B-Thinking-2507.

Key features tested:

FastLanguageModel loading with 4-bit quantization
LoRA adapter configuration
RLOOTrainer with thinking-aware reward function
Rewards self-questioning reasoning in <think> blocks
Post-training inference verification

RLOO Overview: RLOO uses leave-one-out baseline estimation for variance reduction in policy gradients. For each completion, the baseline is computed as the mean reward of all other completions, providing more stable training than single-sample estimates.

Thinking Reward: The reward function evaluates:

Presence of <think>...</think> tags
Quality and length of reasoning
Bonus for self-questioning (question marks in thinking)

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [9]:

  Copied!     
 
# Environment Setup
import os

# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'

from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import RLOOConfig, RLOOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

Out[9]:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

Out[9]:

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

Out[9]:

🦥 Unsloth Zoo will now patch everything to make training faster!

Out[9]:

Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
ACCELERATE_MIXED_PRECISION: bf16
HF_TOKEN loaded: Yes

In [10]:

  Copied!     
 
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
    dtype=None,
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, dtype=None, ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")

Out[10]:

Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...

Out[10]:

==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Out[10]:

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Out[10]:

Model loaded: Qwen3ForCausalLM

In [11]:

  Copied!     
 
# Apply LoRA adapters for RLOO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for RLOO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Out[11]:

Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

Out[11]:

LoRA applied: 33,030,144 trainable / 2,526,543,360 total (1.31%)

In [12]:

  Copied!     
 
# Create minimal synthetic prompt dataset for RLOO (5 prompts)
# RLOO requires prompts only - completions are generated during training

prompts = [
    "Explain the concept of recursion in programming.",
    "What are the benefits of using version control?",
    "Describe how a hash table works.",
    "What is the difference between a stack and a queue?",
    "Explain what an API is to a beginner.",
]

# Format prompts for RLOO (requires "prompt" field)
dataset = Dataset.from_dict({
    "prompt": [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": p}],
            tokenize=False,
            add_generation_prompt=True
        ) for p in prompts
    ]
})

print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
# Create minimal synthetic prompt dataset for RLOO (5 prompts) # RLOO requires prompts only - completions are generated during training prompts = [ "Explain the concept of recursion in programming.", "What are the benefits of using version control?", "Describe how a hash table works.", "What is the difference between a stack and a queue?", "Explain what an API is to a beginner.", ] # Format prompts for RLOO (requires "prompt" field) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")

Out[12]:

Dataset created: 5 prompts
Sample prompt:
<|im_start|>user
Explain the concept of recursion in programming.<|im_end|>
<|im_start|>assistant
<think>
...

In [13]:

  Copied!     
 
# Define thinking-aware reward function using token IDs
# TRL passes completion_ids directly - no re-tokenization needed!

THINK_END_TOKEN_ID = 151668  # </think> token for Qwen3-Thinking models

def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs):
    """
    Token-based reward function using completion_ids provided by TRL.
    
    - Uses token ID 151668 for </think> boundary detection
    - Rewards longer, more detailed reasoning (measured in tokens)
    - Bonus for self-questioning (question marks in thinking content)
    
    Benefits over string matching:
    - No re-tokenization overhead (faster training)
    - Exact token boundaries (no regex edge cases)
    - Consistent with inference code pattern
    """
    rewards = []
    
    for completion, comp_ids in zip(completions, completion_ids):
        # Token-based detection: check for </think> token (ID 151668)
        if THINK_END_TOKEN_ID in comp_ids:
            end_idx = comp_ids.index(THINK_END_TOKEN_ID)
            thinking_length = end_idx  # Token count before </think>
            
            # String-based content analysis for question detection
            # (using string here is fine since we already know boundary from tokens)
            thinking_content = completion.split('</think>')[0]
            question_marks = thinking_content.count('?')
            has_self_questions = question_marks >= 1
            
            # Reward based on thinking token count
            if thinking_length < 10:
                reward = 0.3  # Minimal thinking
            elif thinking_length < 30:
                reward = 0.7 + (0.1 if has_self_questions else 0)
            else:
                reward = 1.0 + (0.1 if has_self_questions else 0)
        else:
            reward = -1.0  # No </think> token found
        
        rewards.append(reward)
    
    return rewards

print("Token-based thinking reward function defined")
print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}")
print("Rewards: thinking quality (token count) + self-questioning bonus")
# Define thinking-aware reward function using token IDs # TRL passes completion_ids directly - no re-tokenization needed! THINK_END_TOKEN_ID = 151668 # token for Qwen3-Thinking models def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs): """ Token-based reward function using completion_ids provided by TRL. - Uses token ID 151668 for boundary detection - Rewards longer, more detailed reasoning (measured in tokens) - Bonus for self-questioning (question marks in thinking content) Benefits over string matching: - No re-tokenization overhead (faster training) - Exact token boundaries (no regex edge cases) - Consistent with inference code pattern """ rewards = [] for completion, comp_ids in zip(completions, completion_ids): # Token-based detection: check for token (ID 151668) if THINK_END_TOKEN_ID in comp_ids: end_idx = comp_ids.index(THINK_END_TOKEN_ID) thinking_length = end_idx # Token count before # String-based content analysis for question detection # (using string here is fine since we already know boundary from tokens) thinking_content = completion.split('')[0] question_marks = thinking_content.count('?') has_self_questions = question_marks >= 1 # Reward based on thinking token count if thinking_length < 10: reward = 0.3 # Minimal thinking elif thinking_length < 30: reward = 0.7 + (0.1 if has_self_questions else 0) else: reward = 1.0 + (0.1 if has_self_questions else 0) else: reward = -1.0 # No token found rewards.append(reward) return rewards print("Token-based thinking reward function defined") print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}") print("Rewards: thinking quality (token count) + self-questioning bonus")

Out[13]:

Token-based thinking reward function defined
Using THINK_END_TOKEN_ID = 151668
Rewards: thinking quality (token count) + self-questioning bonus

In [ ]:

  Copied!     
 
# RLOO Training Configuration (minimal steps for testing)
rloo_config = RLOOConfig(
    output_dir="outputs_rloo_qwen_think_test",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    max_steps=2,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    num_generations=4,  # Completions per prompt for leave-one-out
    max_completion_length=128,  # Increased for thinking content
    beta=0.05,
    seed=42,
)

# Initialize RLOO Trainer
trainer = RLOOTrainer(
    model=model,
    args=rloo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_funcs=thinking_reward_fn,
)

print("Starting RLOO training with thinking rewards (2 steps)...")
trainer_stats = trainer.train()
print(f"RLOO training completed!")
# RLOO Training Configuration (minimal steps for testing) rloo_config = RLOOConfig( output_dir="outputs_rloo_qwen_think_test", per_device_train_batch_size=4, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", num_generations=4, # Completions per prompt for leave-one-out max_completion_length=128, # Increased for thinking content beta=0.05, seed=42, ) # Initialize RLOO Trainer trainer = RLOOTrainer( model=model, args=rloo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=thinking_reward_fn, ) print("Starting RLOO training with thinking rewards (2 steps)...") trainer_stats = trainer.train() print(f"RLOO training completed!")

In [15]:

  Copied!     
 
# Post-training inference test
FastLanguageModel.for_inference(model)

test_prompt = "What is machine learning?"
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,  # Increased to allow full thinking + response
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

# Get generated token IDs only (exclude prompt)
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0][input_length:].tolist()

# Token-based parsing using </think> token ID
THINK_END_TOKEN_ID = 151668

if THINK_END_TOKEN_ID in generated_ids:
    end_idx = generated_ids.index(THINK_END_TOKEN_ID)
    thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip()
    final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip()
    think_tag_found = True
else:
    thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
    final_resp = "(Model did not complete thinking - increase max_new_tokens)"
    think_tag_found = False

print("=" * 60)
print("RLOO Training Pipeline Test (Thinking Mode)")
print("=" * 60)
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO'}")
print(f"Output tokens: {len(generated_ids)}")
print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}")
print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}")

if think_tag_found and thinking and final_resp:
    print("\n✅ RLOO Training Pipeline Test PASSED")
else:
    print("\n⚠️ Test completed but output may need review")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "What is machine learning?" messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1024, # Increased to allow full thinking + response temperature=0.6, top_p=0.95, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) # Get generated token IDs only (exclude prompt) input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0][input_length:].tolist() # Token-based parsing using token ID THINK_END_TOKEN_ID = 151668 if THINK_END_TOKEN_ID in generated_ids: end_idx = generated_ids.index(THINK_END_TOKEN_ID) thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip() final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip() think_tag_found = True else: thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() final_resp = "(Model did not complete thinking - increase max_new_tokens)" think_tag_found = False print("=" * 60) print("RLOO Training Pipeline Test (Thinking Mode)") print("=" * 60) print(f" token found: {'✅ YES' if think_tag_found else '❌ NO'}") print(f"Output tokens: {len(generated_ids)}") print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}") print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}") if think_tag_found and thinking and final_resp: print("\n✅ RLOO Training Pipeline Test PASSED") else: print("\n⚠️ Test completed but output may need review")

Out[15]:

============================================================
RLOO Training Pipeline Test (Thinking Mode)
============================================================
</think> token found: ✅ YES
Output tokens: 1024

THINKING: Okay, the user is asking "What is machine learning?" Hmm, this is a pretty basic question but super important. Let me think about how to approach this.

First, I should consider who might be asking this. Could be a complete beginner - maybe a student, a curious professional, or someone who just hear...

RESPONSE: That's a great question—and one of the most important in tech today! Let me break it down **simply, clearly, and with real-world examples** (no jargon overload). Here's what you need to know:

---

##...

✅ RLOO Training Pipeline Test PASSED

Test Complete¶

The RLOO Training Pipeline test with thinking rewards has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
LoRA adapter configuration for RL training
Synthetic prompt dataset creation
Thinking-aware reward function (rewards reasoning quality + self-questioning)
RLOOTrainer training loop (2 steps)
Post-training inference with thinking output

RLOO Concepts with Thinking¶

Leave-One-Out Baseline: Each completion's baseline is mean of other K-1 rewards
Thinking Reward: Evaluates <think> content quality
Stable Optimization: Variance reduction enables reliable thinking quality optimization
KL Penalty: Prevents policy from diverging too far from reference

Ready for Production¶

If this test passed, your environment is ready for:

RLOO training with thinking-focused reward models
RLHF pipelines optimizing reasoning quality with stable gradients
Chain-of-thought policy refinement

In [16]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[16]:

Shutting down kernel to release GPU memory...

Out[16]:

{'status': 'ok', 'restart': False}