GRPO Training Test: Ministral (Vision)¶
Tests Group Relative Policy Optimization (GRPO) reinforcement learning with Unsloth on Ministral-3B using vision mode.
Model Variant: Vision (FastVisionModel) Expected Result: Experimental - GRPO may not have native vision support
Key features tested:
- FastVisionModel loading with 4-bit quantization
- LoRA adapter configuration (vision + language layers)
- GRPOTrainer with vision model and reward function
- Post-training inference verification
GRPO Overview: GRPO is a reinforcement learning method that optimizes language models using relative policy gradients. This notebook tests if it works with vision models.
Note: GRPOTrainer may not have native vision support. This notebook documents whether the combination works.
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
import torch
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset, load_dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported import torch from trl import GRPOConfig, GRPOTrainer from datasets import Dataset, load_dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER ACCELERATE_MIXED_PRECISION: bf16 HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Ministral-3B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")
model, tokenizer = FastVisionModel.from_pretrained(
MODEL_NAME,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Ministral-3B with FastVisionModel for vision capabilities MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...") model, tokenizer = FastVisionModel.from_pretrained( MODEL_NAME, load_in_4bit=True, use_gradient_checkpointing="unsloth", ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")
Loading Ministral-3-3B-Reasoning-2512 with FastVisionModel...==((====))== Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading weights: 0%| | 0/458 [00:00<?, ?it/s]
Model loaded: Mistral3ForConditionalGeneration
In [3]:
Copied!
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for vision training model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradientsLoRA applied: 33,751,040 trainable / 2,160,030,720 total (1.56%)
In [4]:
Copied!
# Create prompt dataset for GRPO with vision
# Note: GRPO typically uses text prompts - testing if vision context works
# Using text-only prompts as GRPO generates completions during training
prompts = [
"Describe what you see in the image.",
"What mathematical expression is shown?",
"Explain the visual content briefly.",
"What is the main element in this image?",
"Describe the pattern or structure shown.",
]
# Format prompts for GRPO (text-only prompts, vision model)
dataset = Dataset.from_dict({
"prompt": [
tokenizer.apply_chat_template(
[{"role": "user", "content": [{"type": "text", "text": p}]}],
tokenize=False,
add_generation_prompt=True
) for p in prompts
]
})
print(f"Dataset created: {len(dataset)} prompts")
print(f"Note: Using text prompts with vision model (GRPO generates completions)")
# Create prompt dataset for GRPO with vision # Note: GRPO typically uses text prompts - testing if vision context works # Using text-only prompts as GRPO generates completions during training prompts = [ "Describe what you see in the image.", "What mathematical expression is shown?", "Explain the visual content briefly.", "What is the main element in this image?", "Describe the pattern or structure shown.", ] # Format prompts for GRPO (text-only prompts, vision model) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": [{"type": "text", "text": p}]}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Note: Using text prompts with vision model (GRPO generates completions)")
Dataset created: 5 prompts Note: Using text prompts with vision model (GRPO generates completions)
In [5]:
Copied!
# Define a simple reward function for testing
def simple_reward_fn(completions, prompts=None, **kwargs):
"""Simple reward function for testing GRPO with vision model."""
rewards = []
for completion in completions:
length = len(completion.split())
if length < 5:
reward = -1.0
elif length < 20:
reward = 0.5
elif length < 50:
reward = 1.0
else:
reward = 0.8
rewards.append(reward)
return rewards
print("Reward function defined: simple_reward_fn")
# Define a simple reward function for testing def simple_reward_fn(completions, prompts=None, **kwargs): """Simple reward function for testing GRPO with vision model.""" rewards = [] for completion in completions: length = len(completion.split()) if length < 5: reward = -1.0 elif length < 20: reward = 0.5 elif length < 50: reward = 1.0 else: reward = 0.8 rewards.append(reward) return rewards print("Reward function defined: simple_reward_fn")
Reward function defined: simple_reward_fn
In [ ]:
Copied!
# GRPO Training with Vision Model (experimental)
grpo_config = GRPOConfig(
output_dir="outputs_grpo_ministral_vision_test",
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
max_steps=2,
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_completion_length=64,
num_generations=2,
beta=0.1,
seed=42,
)
print("Attempting GRPO training with FastVisionModel...")
print("Note: This is experimental - GRPO may not natively support vision models")
try:
trainer = GRPOTrainer(
model=model,
args=grpo_config,
train_dataset=dataset,
processing_class=tokenizer,
reward_funcs=simple_reward_fn,
)
trainer_stats = trainer.train()
print(f"GRPO Vision training completed!")
GRPO_VISION_SUPPORTED = True
except Exception as e:
print(f"GRPO Vision training failed: {e}")
GRPO_VISION_SUPPORTED = False
# GRPO Training with Vision Model (experimental) grpo_config = GRPOConfig( output_dir="outputs_grpo_ministral_vision_test", per_device_train_batch_size=2, gradient_accumulation_steps=1, max_steps=2, warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_completion_length=64, num_generations=2, beta=0.1, seed=42, ) print("Attempting GRPO training with FastVisionModel...") print("Note: This is experimental - GRPO may not natively support vision models") try: trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=simple_reward_fn, ) trainer_stats = trainer.train() print(f"GRPO Vision training completed!") GRPO_VISION_SUPPORTED = True except Exception as e: print(f"GRPO Vision training failed: {e}") GRPO_VISION_SUPPORTED = False
In [ ]:
Copied!
# Post-training inference test with vision
FastVisionModel.for_inference(model)
# Load a test image for vision inference
vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]")
test_image = vision_dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe what you see."}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()
# Clear success/failure banner
print("=" * 60)
if GRPO_VISION_SUPPORTED:
print("GRPO Training: SUPPORTED for Ministral (Vision)")
print("Model: FastVisionModel + Ministral-3-3B-Reasoning-2512")
else:
print("GRPO Training: NOT SUPPORTED for Ministral (Vision)")
print("Reason: GRPOTrainer may not have native vision support")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test with vision FastVisionModel.for_inference(model) # Load a test image for vision inference vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]") test_image = vision_dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe what you see."}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) response = tokenizer.decode(output[0], skip_special_tokens=True) # Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if GRPO_VISION_SUPPORTED: print("GRPO Training: SUPPORTED for Ministral (Vision)") print("Model: FastVisionModel + Ministral-3-3B-Reasoning-2512") else: print("GRPO Training: NOT SUPPORTED for Ministral (Vision)") print("Reason: GRPOTrainer may not have native vision support") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")
Test Complete¶
The GRPO Training Pipeline test for Ministral (Vision) has completed. The kernel will now shut down to release all GPU memory.
What Was Tested¶
- FastVisionModel loading with 4-bit quantization (Ministral-3B)
- LoRA adapter configuration (vision + language layers)
- GRPOTrainer with vision model (experimental)
- Post-training vision inference
Vision GRPO Notes¶
- GRPOTrainer may not have native vision data support
- Text prompts were used with vision model architecture
- Vision inference after training still works
Comparison with Text-Only¶
| Aspect | Text-Only | Vision |
|---|---|---|
| Model Class | FastLanguageModel | FastVisionModel |
| GRPO Support | Native | Experimental |
| Training Data | Text prompts | Text prompts (vision inference) |
In [8]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
Out[8]:
{'status': 'ok', 'restart': False}