RLOO Training Test: Pixtral (Vision)¶
Tests Reinforcement Learning with Leave-One-Out (RLOO) optimization with Unsloth on Pixtral-12B using vision mode.
Model: unsloth/pixtral-12b-2409-bnb-4bit (pre-quantized 4-bit) Expected Result: Experimental - RLOO may not have native vision support
Key features tested:
- FastVisionModel loading with pre-quantized 4-bit
- LoRA adapter configuration (vision + language layers)
- RLOOTrainer with vision model and reward function
- Post-training inference verification
RLOO Overview: RLOO uses leave-one-out baseline estimation for variance reduction in policy gradients. This notebook tests if it works with vision models.
Note: RLOOTrainer may not have native vision data support. This notebook documents whether the combination works.
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
# Environment Setup
import os
# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
import torch
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset, load_dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!
Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER ACCELERATE_MIXED_PRECISION: bf16 HF_TOKEN loaded: Yes
# Load Pixtral-12B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")
model, tokenizer = FastVisionModel.from_pretrained(
MODEL_NAME,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Model loaded: {type(model).__name__}")
Loading pixtral-12b-2409-bnb-4bit with FastVisionModel...
==((====))== Unsloth 2025.12.10: Fast Llava patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading weights: 0%| | 0/585 [00:00<?, ?it/s]
The tied weights mapping and config for this model specifies to tie model.language_model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
The tokenizer you are loading from 'unsloth/pixtral-12b-2409-bnb-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Model loaded: LlavaForConditionalGeneration
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradients
LoRA applied: 66,060,288 trainable / 7,079,149,568 total (0.93%)
# Create prompt dataset for RLOO with vision
# Note: RLOO typically uses text prompts - testing if vision model works
prompts = [
"Describe what you see in the image.",
"What mathematical expression is shown?",
"Explain the visual content briefly.",
"What is the main element in this image?",
"Describe the pattern or structure shown.",
]
# Format prompts for RLOO (text-only prompts, vision model)
dataset = Dataset.from_dict({
"prompt": [
tokenizer.apply_chat_template(
[{"role": "user", "content": [{"type": "text", "text": p}]}],
tokenize=False,
add_generation_prompt=True
) for p in prompts
]
})
print(f"Dataset created: {len(dataset)} prompts")
print("Note: Using text prompts with vision model (RLOO generates completions)")
Dataset created: 5 prompts Note: Using text prompts with vision model (RLOO generates completions)
# Define a simple reward function for testing
def simple_reward_fn(completions, prompts=None, **kwargs):
"""Simple reward function for testing RLOO with vision model."""
rewards = []
for completion in completions:
length = len(completion.split())
score = 0.0
if 10 <= length <= 50:
score += 1.0
elif length < 10:
score -= 0.5
if completion.strip().endswith("."):
score += 0.5
rewards.append(score)
return rewards
print("Reward function defined: simple_reward_fn")
Reward function defined: simple_reward_fn
# RLOO Training with Vision Model (experimental)
# Reduced batch size and num_generations for Pixtral-12B memory requirements
rloo_config = RLOOConfig(
output_dir="outputs_rloo_pixtral_vision_test",
per_device_train_batch_size=2, # Reduced from 4 for memory
gradient_accumulation_steps=1,
max_steps=2,
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
num_generations=2, # Reduced from 4 for memory
max_completion_length=64,
beta=0.05,
seed=42,
)
print("Attempting RLOO training with FastVisionModel...")
print("Note: This is experimental - RLOO may not natively support vision models")
try:
trainer = RLOOTrainer(
model=model,
args=rloo_config,
train_dataset=dataset,
processing_class=tokenizer,
reward_funcs=simple_reward_fn,
)
trainer_stats = trainer.train()
print(f"RLOO Vision training completed!")
RLOO_VISION_SUPPORTED = True
except Exception as e:
print(f"RLOO Vision training failed: {e}")
RLOO_VISION_SUPPORTED = False
# Post-training inference test with vision
FastVisionModel.for_inference(model)
# Load a test image for vision inference
vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]")
test_image = vision_dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe what you see."}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()
# Clear success/failure banner
print("=" * 60)
if RLOO_VISION_SUPPORTED:
print("RLOO Training: SUPPORTED for Pixtral (Vision)")
print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit")
else:
print("RLOO Training: NOT SUPPORTED for Pixtral (Vision)")
print("Reason: RLOOTrainer may not have native vision support")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
============================================================ RLOO Training: SUPPORTED for Pixtral (Vision) Model: FastVisionModel + pixtral-12b-2409-bnb-4bit ============================================================ Sample generation: thematical equation or logical statement, please provide additional context or clarify the variables and operators involved. This will help in providing a precise and accurate explanation or solution.
Test Complete¶
The RLOO Training Pipeline test for Pixtral (Vision) has completed. The kernel will now shut down to release all GPU memory.
What Was Tested¶
- FastVisionModel loading with pre-quantized 4-bit (Pixtral-12B)
- LoRA adapter configuration (vision + language layers)
- RLOOTrainer with vision model (experimental)
- Post-training vision inference
Vision RLOO Notes¶
- RLOOTrainer may not have native vision data support
- Text prompts were used with vision model architecture
- Vision inference after training still works
Comparison with Text-Only¶
| Aspect | Text-Only | Vision |
|---|---|---|
| Model Class | FastLanguageModel | FastVisionModel |
| RLOO Support | Native | Experimental |
| Training Data | Text prompts | Text prompts (vision inference) |
Complete Pixtral Training Method Summary¶
| Method | Text-Only | Vision |
|---|---|---|
| SFT | Supported | Supported (UnslothVisionDataCollator) |
| GRPO | Testing | Experimental |
| DPO | Testing | Experimental |
| Reward | Testing | NOT SUPPORTED (no SequenceClassification) |
| RLOO | Testing | Experimental |
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
{'status': 'ok', 'restart': False}