SFT Training Test: Pixtral (Vision)¶

Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Pixtral-12B using vision mode.

Model: unsloth/pixtral-12b-2409-bnb-4bit (pre-quantized 4-bit) Expected Result: Works - Uses native vision capabilities

Key features tested:

FastVisionModel loading with pre-quantized 4-bit
LoRA adapter configuration (vision + language layers)
SFTTrainer with UnslothVisionDataCollator
Vision dataset (LaTeX_OCR) with image inputs
Post-training inference verification

Key Differences from Text-Only:

Uses FastVisionModel instead of FastLanguageModel
Uses UnslothVisionDataCollator for vision data
Dataset includes actual images
Chat format includes {"type": "image"} elements

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator

import torch

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Pixtral-12B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")

model, tokenizer = FastVisionModel.from_pretrained(
    MODEL_NAME,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)
print(f"Model loaded: {type(model).__name__}")
# Load Pixtral-12B with FastVisionModel for vision capabilities MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...") model, tokenizer = FastVisionModel.from_pretrained( MODEL_NAME, load_in_4bit=True, use_gradient_checkpointing="unsloth", ) print(f"Model loaded: {type(model).__name__}")

Loading pixtral-12b-2409-bnb-4bit with FastVisionModel...==((====))==  Unsloth 2025.12.10: Fast Llava patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/585 [00:00<?, ?it/s]

The tied weights mapping and config for this model specifies to tie model.language_model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warningThe tokenizer you are loading from 'unsloth/pixtral-12b-2409-bnb-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

Model loaded: LlavaForConditionalGeneration

In [3]:

  Copied!     
 
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for vision training model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradientsLoRA applied: 66,060,288 trainable / 7,079,149,568 total (0.93%)

In [4]:

  Copied!     
 
# Load vision dataset (LaTeX_OCR - 5 samples for testing)
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    return {
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]}
            ]},
            {"role": "assistant", "content": [
                {"type": "text", "text": sample["text"]}
            ]}
        ]
    }

converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"Dataset loaded: {len(converted_dataset)} vision samples")
# Load vision dataset (LaTeX_OCR - 5 samples for testing) from datasets import load_dataset from trl import SFTTrainer, SFTConfig dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]") instruction = "Write the LaTeX representation for this image." def convert_to_conversation(sample): return { "messages": [ {"role": "user", "content": [ {"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]} ]}, {"role": "assistant", "content": [ {"type": "text", "text": sample["text"]} ]} ] } converted_dataset = [convert_to_conversation(s) for s in dataset] print(f"Dataset loaded: {len(converted_dataset)} vision samples")

Dataset loaded: 5 vision samples

In [ ]:

  Copied!     
 
# SFT Training with Vision (minimal steps for testing)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),
    train_dataset=converted_dataset,
    args=SFTConfig(
        per_device_train_batch_size=1,
        max_steps=3,  # Minimal steps for testing
        warmup_steps=1,
        learning_rate=2e-4,
        logging_steps=1,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        output_dir="outputs_sft_pixtral_vision_test",
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_seq_length=1024,
    ),
)

print("Starting SFT Vision training (3 steps)...")
try:
    trainer_stats = trainer.train()
    final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
    print(f"Training completed. Final loss: {final_loss:.4f}")
    SFT_VISION_SUPPORTED = True
except Exception as e:
    print(f"Training failed: {e}")
    SFT_VISION_SUPPORTED = False
# SFT Training with Vision (minimal steps for testing) trainer = SFTTrainer( model=model, tokenizer=tokenizer, data_collator=UnslothVisionDataCollator(model, tokenizer), train_dataset=converted_dataset, args=SFTConfig( per_device_train_batch_size=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), output_dir="outputs_sft_pixtral_vision_test", remove_unused_columns=False, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, max_seq_length=1024, ), ) print("Starting SFT Vision training (3 steps)...") try: trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 'N/A') print(f"Training completed. Final loss: {final_loss:.4f}") SFT_VISION_SUPPORTED = True except Exception as e: print(f"Training failed: {e}") SFT_VISION_SUPPORTED = False

In [6]:

  Copied!     
 
# Post-training inference test with vision
FastVisionModel.for_inference(model)

test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)

response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()

# Clear success/failure banner
print("=" * 60)
if SFT_VISION_SUPPORTED:
    print("SFT Training: SUPPORTED for Pixtral (Vision)")
    print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit")
    print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset")
else:
    print("SFT Training: NOT SUPPORTED for Pixtral (Vision)")
    print("Reason: See error above")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test with vision FastVisionModel.for_inference(model) test_image = dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) response = tokenizer.decode(output[0], skip_special_tokens=True) # Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if SFT_VISION_SUPPORTED: print("SFT Training: SUPPORTED for Pixtral (Vision)") print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit") print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset") else: print("SFT Training: NOT SUPPORTED for Pixtral (Vision)") print("Reason: See error above") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")

============================================================
SFT Training: SUPPORTED for Pixtral (Vision)
Model: FastVisionModel + pixtral-12b-2409-bnb-4bit
Components: UnslothVisionDataCollator, LaTeX_OCR dataset
============================================================
Sample generation:
 LaTeX representation for the given mathematical expression is:

\[
N = \frac{Z}{M} \leq \frac{P}{Q} \leq \frac{R}{S} \leq \frac{T}{U}
\]

Where:
- \( N \) is the numerator of the first fraction.
- \(

Test Complete¶

The SFT Training Pipeline test for Pixtral (Vision) has completed. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastVisionModel loading with pre-quantized 4-bit (Pixtral-12B)
LoRA adapter configuration (vision + language layers)
Vision dataset loading (LaTeX_OCR)
UnslothVisionDataCollator integration
SFTTrainer training loop (3 steps)
Post-training vision inference

Pixtral Vision Notes¶

Uses FastVisionModel for native multimodal support
Requires UnslothVisionDataCollator for vision data
Dataset must include actual images with {"type": "image"} format
Pre-quantized to 4-bit for memory efficiency

Comparison with Text-Only¶

Aspect	Text-Only	Vision
Model Class	FastLanguageModel	FastVisionModel
Data Collator	None	UnslothVisionDataCollator
Dataset	Synthetic text	LaTeX_OCR (images)
LoRA Layers	Language only	Vision + Language

In [7]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

Out[7]:

{'status': 'ok', 'restart': False}