RLOO Training Test: Pixtral (Vision)¶

Tests Reinforcement Learning with Leave-One-Out (RLOO) optimization with Unsloth on Pixtral-12B using vision mode.

Model: unsloth/pixtral-12b-2409-bnb-4bit (pre-quantized 4-bit) Expected Result: Experimental - RLOO may not have native vision support

Key features tested:

FastVisionModel loading with pre-quantized 4-bit
LoRA adapter configuration (vision + language layers)
RLOOTrainer with vision model and reward function
Post-training inference verification

RLOO Overview: RLOO uses leave-one-out baseline estimation for variance reduction in policy gradients. This notebook tests if it works with vision models.

Note: RLOOTrainer may not have native vision data support. This notebook documents whether the combination works.

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [2]:

  Copied!     
 
# Environment Setup
import os

# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'

from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported

import torch
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset, load_dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported import torch from trl import RLOOConfig, RLOOTrainer from datasets import Dataset, load_dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

Out[2]:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

Out[2]:

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

Out[2]:

🦥 Unsloth Zoo will now patch everything to make training faster!

Out[2]:

Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
ACCELERATE_MIXED_PRECISION: bf16
HF_TOKEN loaded: Yes

In [3]:

  Copied!     
 
# Load Pixtral-12B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")

model, tokenizer = FastVisionModel.from_pretrained(
    MODEL_NAME,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Pixtral-12B with FastVisionModel for vision capabilities MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...") model, tokenizer = FastVisionModel.from_pretrained( MODEL_NAME, load_in_4bit=True, use_gradient_checkpointing="unsloth", ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")

Out[3]:

Loading pixtral-12b-2409-bnb-4bit with FastVisionModel...

Out[3]:

==((====))==  Unsloth 2025.12.10: Fast Llava patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Out[3]:

Loading weights:   0%|          | 0/585 [00:00<?, ?it/s]

Out[3]:

The tied weights mapping and config for this model specifies to tie model.language_model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning

Out[3]:

The tokenizer you are loading from 'unsloth/pixtral-12b-2409-bnb-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

Out[3]:

Model loaded: LlavaForConditionalGeneration

In [4]:

  Copied!     
 
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for vision training model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Out[4]:

Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradients

Out[4]:

LoRA applied: 66,060,288 trainable / 7,079,149,568 total (0.93%)

In [5]:

  Copied!     
 
# Create prompt dataset for RLOO with vision
# Note: RLOO typically uses text prompts - testing if vision model works

prompts = [
    "Describe what you see in the image.",
    "What mathematical expression is shown?",
    "Explain the visual content briefly.",
    "What is the main element in this image?",
    "Describe the pattern or structure shown.",
]

# Format prompts for RLOO (text-only prompts, vision model)
dataset = Dataset.from_dict({
    "prompt": [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": [{"type": "text", "text": p}]}],
            tokenize=False,
            add_generation_prompt=True
        ) for p in prompts
    ]
})

print(f"Dataset created: {len(dataset)} prompts")
print("Note: Using text prompts with vision model (RLOO generates completions)")
# Create prompt dataset for RLOO with vision # Note: RLOO typically uses text prompts - testing if vision model works prompts = [ "Describe what you see in the image.", "What mathematical expression is shown?", "Explain the visual content briefly.", "What is the main element in this image?", "Describe the pattern or structure shown.", ] # Format prompts for RLOO (text-only prompts, vision model) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": [{"type": "text", "text": p}]}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print("Note: Using text prompts with vision model (RLOO generates completions)")

Out[5]:

Dataset created: 5 prompts
Note: Using text prompts with vision model (RLOO generates completions)

In [6]:

  Copied!     
 
# Define a simple reward function for testing
def simple_reward_fn(completions, prompts=None, **kwargs):
    """Simple reward function for testing RLOO with vision model."""
    rewards = []
    for completion in completions:
        length = len(completion.split())
        score = 0.0
        if 10 <= length <= 50:
            score += 1.0
        elif length < 10:
            score -= 0.5
        if completion.strip().endswith("."):
            score += 0.5
        rewards.append(score)
    return rewards

print("Reward function defined: simple_reward_fn")
# Define a simple reward function for testing def simple_reward_fn(completions, prompts=None, **kwargs): """Simple reward function for testing RLOO with vision model.""" rewards = [] for completion in completions: length = len(completion.split()) score = 0.0 if 10 <= length <= 50: score += 1.0 elif length < 10: score -= 0.5 if completion.strip().endswith("."): score += 0.5 rewards.append(score) return rewards print("Reward function defined: simple_reward_fn")

Out[6]:

Reward function defined: simple_reward_fn

In [ ]:

  Copied!     
 
# RLOO Training with Vision Model (experimental)
# Reduced batch size and num_generations for Pixtral-12B memory requirements
rloo_config = RLOOConfig(
    output_dir="outputs_rloo_pixtral_vision_test",
    per_device_train_batch_size=2,  # Reduced from 4 for memory
    gradient_accumulation_steps=1,
    max_steps=2,
    warmup_steps=0,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    num_generations=2,  # Reduced from 4 for memory
    max_completion_length=64,
    beta=0.05,
    seed=42,
)

print("Attempting RLOO training with FastVisionModel...")
print("Note: This is experimental - RLOO may not natively support vision models")
try:
    trainer = RLOOTrainer(
        model=model,
        args=rloo_config,
        train_dataset=dataset,
        processing_class=tokenizer,
        reward_funcs=simple_reward_fn,
    )
    trainer_stats = trainer.train()
    print(f"RLOO Vision training completed!")
    RLOO_VISION_SUPPORTED = True
except Exception as e:
    print(f"RLOO Vision training failed: {e}")
    RLOO_VISION_SUPPORTED = False
# RLOO Training with Vision Model (experimental) # Reduced batch size and num_generations for Pixtral-12B memory requirements rloo_config = RLOOConfig( output_dir="outputs_rloo_pixtral_vision_test", per_device_train_batch_size=2, # Reduced from 4 for memory gradient_accumulation_steps=1, max_steps=2, warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", num_generations=2, # Reduced from 4 for memory max_completion_length=64, beta=0.05, seed=42, ) print("Attempting RLOO training with FastVisionModel...") print("Note: This is experimental - RLOO may not natively support vision models") try: trainer = RLOOTrainer( model=model, args=rloo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=simple_reward_fn, ) trainer_stats = trainer.train() print(f"RLOO Vision training completed!") RLOO_VISION_SUPPORTED = True except Exception as e: print(f"RLOO Vision training failed: {e}") RLOO_VISION_SUPPORTED = False

In [8]:

  Copied!     
 
# Post-training inference test with vision
FastVisionModel.for_inference(model)

# Load a test image for vision inference
vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]")
test_image = vision_dataset[0]["image"]

messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe what you see."}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)

response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()

# Clear success/failure banner
print("=" * 60)
if RLOO_VISION_SUPPORTED:
    print("RLOO Training: SUPPORTED for Pixtral (Vision)")
    print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit")
else:
    print("RLOO Training: NOT SUPPORTED for Pixtral (Vision)")
    print("Reason: RLOOTrainer may not have native vision support")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test with vision FastVisionModel.for_inference(model) # Load a test image for vision inference vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]") test_image = vision_dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe what you see."}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) response = tokenizer.decode(output[0], skip_special_tokens=True) # Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if RLOO_VISION_SUPPORTED: print("RLOO Training: SUPPORTED for Pixtral (Vision)") print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit") else: print("RLOO Training: NOT SUPPORTED for Pixtral (Vision)") print("Reason: RLOOTrainer may not have native vision support") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")

Out[8]:

============================================================
RLOO Training: SUPPORTED for Pixtral (Vision)
Model: FastVisionModel + pixtral-12b-2409-bnb-4bit
============================================================
Sample generation:
thematical equation or logical statement, please provide additional context or clarify the variables and operators involved. This will help in providing a precise and accurate explanation or solution.

Test Complete¶

The RLOO Training Pipeline test for Pixtral (Vision) has completed. The kernel will now shut down to release all GPU memory.

What Was Tested¶

FastVisionModel loading with pre-quantized 4-bit (Pixtral-12B)
LoRA adapter configuration (vision + language layers)
RLOOTrainer with vision model (experimental)
Post-training vision inference

Vision RLOO Notes¶

RLOOTrainer may not have native vision data support
Text prompts were used with vision model architecture
Vision inference after training still works

Comparison with Text-Only¶

Aspect	Text-Only	Vision
Model Class	FastLanguageModel	FastVisionModel
RLOO Support	Native	Experimental
Training Data	Text prompts	Text prompts (vision inference)

Complete Pixtral Training Method Summary¶

Method	Text-Only	Vision
SFT	Supported	Supported (UnslothVisionDataCollator)
GRPO	Testing	Experimental
DPO	Testing	Experimental
Reward	Testing	NOT SUPPORTED (no SequenceClassification)
RLOO	Testing	Experimental

In [9]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[9]:

Shutting down kernel to release GPU memory...

Out[9]:

{'status': 'ok', 'restart': False}