GRPO Training Test: Qwen3-4B-Thinking-2507¶

Tests Group Relative Policy Optimization (GRPO) reinforcement learning with Unsloth on Qwen3-4B-Thinking-2507.

Key features tested:

FastLanguageModel loading with 4-bit quantization
LoRA adapter configuration
GRPOTrainer with thinking-aware reward function
Rewards self-questioning reasoning in <think> blocks
Post-training inference verification

GRPO Overview: GRPO is a reinforcement learning method that optimizes language models using relative policy gradients. It compares multiple completions per prompt and learns from their relative rewards.

Thinking Reward: The reward function evaluates:

Presence of <think>...</think> tags
Quality and length of reasoning
Bonus for self-questioning (question marks in thinking)

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os

# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'

from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import GRPOConfig, GRPOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
ACCELERATE_MIXED_PRECISION: bf16
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
    dtype=None,
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, dtype=None, ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Model loaded: Qwen3ForCausalLM

In [3]:

  Copied!     
 
# Apply LoRA adapters for GRPO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for GRPO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

LoRA applied: 33,030,144 trainable / 2,526,543,360 total (1.31%)

In [4]:

  Copied!     
 
# Create minimal synthetic prompt dataset for GRPO (5 prompts)
# GRPO requires prompts only - completions are generated during training

prompts = [
    "Explain the concept of recursion in programming.",
    "What are the benefits of using version control?",
    "Describe how a hash table works.",
    "What is the difference between a stack and a queue?",
    "Explain what an API is to a beginner.",
]

# Format prompts for GRPO (requires "prompt" field)
dataset = Dataset.from_dict({
    "prompt": [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": p}],
            tokenize=False,
            add_generation_prompt=True
        ) for p in prompts
    ]
})

print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
# Create minimal synthetic prompt dataset for GRPO (5 prompts) # GRPO requires prompts only - completions are generated during training prompts = [ "Explain the concept of recursion in programming.", "What are the benefits of using version control?", "Describe how a hash table works.", "What is the difference between a stack and a queue?", "Explain what an API is to a beginner.", ] # Format prompts for GRPO (requires "prompt" field) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")

Dataset created: 5 prompts
Sample prompt:
<|im_start|>user
Explain the concept of recursion in programming.<|im_end|>
<|im_start|>assistant
<think>
...

In [5]:

  Copied!     
 
# Define thinking-aware reward function using token IDs
# TRL passes completion_ids directly - no re-tokenization needed!

THINK_END_TOKEN_ID = 151668  # </think> token for Qwen3-Thinking models

def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs):
    """
    Token-based reward function using completion_ids provided by TRL.
    
    - Uses token ID 151668 for </think> boundary detection
    - Rewards longer, more detailed reasoning (measured in tokens)
    - Bonus for self-questioning (question marks in thinking content)
    
    Benefits over string matching:
    - No re-tokenization overhead (faster training)
    - Exact token boundaries (no regex edge cases)
    - Consistent with inference code pattern
    """
    rewards = []
    
    for completion, comp_ids in zip(completions, completion_ids):
        # Token-based detection: check for </think> token (ID 151668)
        if THINK_END_TOKEN_ID in comp_ids:
            end_idx = comp_ids.index(THINK_END_TOKEN_ID)
            thinking_length = end_idx  # Token count before </think>
            
            # String-based content analysis for question detection
            # (using string here is fine since we already know boundary from tokens)
            thinking_content = completion.split('</think>')[0]
            question_marks = thinking_content.count('?')
            has_self_questions = question_marks >= 1
            
            # Reward based on thinking token count
            if thinking_length < 10:
                reward = 0.3  # Minimal thinking
            elif thinking_length < 30:
                reward = 0.7 + (0.1 if has_self_questions else 0)
            else:
                reward = 1.0 + (0.1 if has_self_questions else 0)
        else:
            reward = -1.0  # No </think> token found
        
        rewards.append(reward)
    
    return rewards

print("Token-based thinking reward function defined")
print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}")
print("Rewards: thinking quality (token count) + self-questioning bonus")
# Define thinking-aware reward function using token IDs # TRL passes completion_ids directly - no re-tokenization needed! THINK_END_TOKEN_ID = 151668 # token for Qwen3-Thinking models def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs): """ Token-based reward function using completion_ids provided by TRL. - Uses token ID 151668 for boundary detection - Rewards longer, more detailed reasoning (measured in tokens) - Bonus for self-questioning (question marks in thinking content) Benefits over string matching: - No re-tokenization overhead (faster training) - Exact token boundaries (no regex edge cases) - Consistent with inference code pattern """ rewards = [] for completion, comp_ids in zip(completions, completion_ids): # Token-based detection: check for token (ID 151668) if THINK_END_TOKEN_ID in comp_ids: end_idx = comp_ids.index(THINK_END_TOKEN_ID) thinking_length = end_idx # Token count before # String-based content analysis for question detection # (using string here is fine since we already know boundary from tokens) thinking_content = completion.split('')[0] question_marks = thinking_content.count('?') has_self_questions = question_marks >= 1 # Reward based on thinking token count if thinking_length < 10: reward = 0.3 # Minimal thinking elif thinking_length < 30: reward = 0.7 + (0.1 if has_self_questions else 0) else: reward = 1.0 + (0.1 if has_self_questions else 0) else: reward = -1.0 # No token found rewards.append(reward) return rewards print("Token-based thinking reward function defined") print(f"Using THINK_END_TOKEN_ID = {THINK_END_TOKEN_ID}") print("Rewards: thinking quality (token count) + self-questioning bonus")

Token-based thinking reward function defined
Using THINK_END_TOKEN_ID = 151668
Rewards: thinking quality (token count) + self-questioning bonus

In [ ]:

  Copied!     
 
# GRPO Training Configuration (minimal steps for testing)
grpo_config = GRPOConfig(
    output_dir="outputs_grpo_qwen_think_test",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    max_steps=2,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_completion_length=128,  # Increased for thinking content
    num_generations=2,
    beta=0.1,
    seed=42,
)

# Initialize GRPO Trainer
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_funcs=thinking_reward_fn,
)

print("Starting GRPO training with thinking rewards (2 steps)...")
trainer_stats = trainer.train()
print(f"GRPO training completed!")
# GRPO Training Configuration (minimal steps for testing) grpo_config = GRPOConfig( output_dir="outputs_grpo_qwen_think_test", per_device_train_batch_size=2, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_completion_length=128, # Increased for thinking content num_generations=2, beta=0.1, seed=42, ) # Initialize GRPO Trainer trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=thinking_reward_fn, ) print("Starting GRPO training with thinking rewards (2 steps)...") trainer_stats = trainer.train() print(f"GRPO training completed!")

In [7]:

  Copied!     
 
# Post-training inference test
FastLanguageModel.for_inference(model)

test_prompt = "Explain what machine learning is in simple terms."
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,  # Increased to allow full thinking + response
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

# Get generated token IDs only (exclude prompt)
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0][input_length:].tolist()

# Token-based parsing using </think> token ID
THINK_END_TOKEN_ID = 151668

if THINK_END_TOKEN_ID in generated_ids:
    end_idx = generated_ids.index(THINK_END_TOKEN_ID)
    thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip()
    final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip()
    think_tag_found = True
else:
    thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
    final_resp = "(Model did not complete thinking - increase max_new_tokens)"
    think_tag_found = False

print("=" * 60)
print("GRPO Training Pipeline Test (Thinking Mode)")
print("=" * 60)
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO'}")
print(f"Output tokens: {len(generated_ids)}")
print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}")
print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}")

if think_tag_found and thinking and final_resp:
    print("\n✅ GRPO Training Pipeline Test PASSED")
else:
    print("\n⚠️ Test completed but output may need review")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "Explain what machine learning is in simple terms." messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1024, # Increased to allow full thinking + response temperature=0.6, top_p=0.95, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) # Get generated token IDs only (exclude prompt) input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0][input_length:].tolist() # Token-based parsing using token ID THINK_END_TOKEN_ID = 151668 if THINK_END_TOKEN_ID in generated_ids: end_idx = generated_ids.index(THINK_END_TOKEN_ID) thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip() final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip() think_tag_found = True else: thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() final_resp = "(Model did not complete thinking - increase max_new_tokens)" think_tag_found = False print("=" * 60) print("GRPO Training Pipeline Test (Thinking Mode)") print("=" * 60) print(f" token found: {'✅ YES' if think_tag_found else '❌ NO'}") print(f"Output tokens: {len(generated_ids)}") print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}") print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}") if think_tag_found and thinking and final_resp: print("\n✅ GRPO Training Pipeline Test PASSED") else: print("\n⚠️ Test completed but output may need review")

============================================================
GRPO Training Pipeline Test (Thinking Mode)
============================================================
</think> token found: ✅ YES
Output tokens: 993

THINKING: Okay, the user asked for a simple explanation of machine learning. Hmm, they probably want something super basic since they said "in simple terms." Maybe they're a total beginner, or just someone curious without a tech background. 

First, I should avoid jargon completely. No "algorithms" or "neural...

RESPONSE: Here's a super simple explanation of **machine learning (ML)** — no tech jargon, just plain English:

---

### Imagine you're teaching a child to recognize cats 🐱
You show them **10 pictures of cats**...

✅ GRPO Training Pipeline Test PASSED

Test Complete¶

The GRPO Training Pipeline test with thinking rewards has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
LoRA adapter configuration for RL training
Synthetic prompt dataset creation
Thinking-aware reward function (rewards reasoning quality + self-questioning)
GRPOTrainer training loop (2 steps)
Post-training inference with thinking output

GRPO Concepts with Thinking¶

Thinking Reward: Evaluates <think> content quality
Self-Questioning Bonus: Extra reward for question marks in reasoning
KL Penalty (beta): Prevents policy from diverging too far from reference

Ready for Production¶

If this test passed, your environment is ready for:

GRPO training with thinking-focused reward models
RLHF pipelines that optimize reasoning quality
Chain-of-thought preference optimization

In [8]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

Out[8]:

{'status': 'ok', 'restart': False}