SFT Training Test: Pixtral (Vision)¶
Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Pixtral-12B using vision mode.
Model: unsloth/pixtral-12b-2409-bnb-4bit (pre-quantized 4-bit) Expected Result: Works - Uses native vision capabilities
Key features tested:
- FastVisionModel loading with pre-quantized 4-bit
- LoRA adapter configuration (vision + language layers)
- SFTTrainer with UnslothVisionDataCollator
- Vision dataset (LaTeX_OCR) with image inputs
- Post-training inference verification
Key Differences from Text-Only:
- Uses
FastVisionModelinstead ofFastLanguageModel - Uses
UnslothVisionDataCollatorfor vision data - Dataset includes actual images
- Chat format includes
{"type": "image"}elements
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
import torch
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Pixtral-12B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")
model, tokenizer = FastVisionModel.from_pretrained(
MODEL_NAME,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
print(f"Model loaded: {type(model).__name__}")
# Load Pixtral-12B with FastVisionModel for vision capabilities MODEL_NAME = "unsloth/pixtral-12b-2409-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...") model, tokenizer = FastVisionModel.from_pretrained( MODEL_NAME, load_in_4bit=True, use_gradient_checkpointing="unsloth", ) print(f"Model loaded: {type(model).__name__}")
Loading pixtral-12b-2409-bnb-4bit with FastVisionModel...==((====))== Unsloth 2025.12.10: Fast Llava patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading weights: 0%| | 0/585 [00:00<?, ?it/s]
The tied weights mapping and config for this model specifies to tie model.language_model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warningThe tokenizer you are loading from 'unsloth/pixtral-12b-2409-bnb-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Model loaded: LlavaForConditionalGeneration
In [3]:
Copied!
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for vision training model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradientsLoRA applied: 66,060,288 trainable / 7,079,149,568 total (0.93%)
In [4]:
Copied!
# Load vision dataset (LaTeX_OCR - 5 samples for testing)
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
return {
"messages": [
{"role": "user", "content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]}
]},
{"role": "assistant", "content": [
{"type": "text", "text": sample["text"]}
]}
]
}
converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"Dataset loaded: {len(converted_dataset)} vision samples")
# Load vision dataset (LaTeX_OCR - 5 samples for testing) from datasets import load_dataset from trl import SFTTrainer, SFTConfig dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]") instruction = "Write the LaTeX representation for this image." def convert_to_conversation(sample): return { "messages": [ {"role": "user", "content": [ {"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]} ]}, {"role": "assistant", "content": [ {"type": "text", "text": sample["text"]} ]} ] } converted_dataset = [convert_to_conversation(s) for s in dataset] print(f"Dataset loaded: {len(converted_dataset)} vision samples")
Dataset loaded: 5 vision samples
In [ ]:
Copied!
# SFT Training with Vision (minimal steps for testing)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=converted_dataset,
args=SFTConfig(
per_device_train_batch_size=1,
max_steps=3, # Minimal steps for testing
warmup_steps=1,
learning_rate=2e-4,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
output_dir="outputs_sft_pixtral_vision_test",
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
max_seq_length=1024,
),
)
print("Starting SFT Vision training (3 steps)...")
try:
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
print(f"Training completed. Final loss: {final_loss:.4f}")
SFT_VISION_SUPPORTED = True
except Exception as e:
print(f"Training failed: {e}")
SFT_VISION_SUPPORTED = False
# SFT Training with Vision (minimal steps for testing) trainer = SFTTrainer( model=model, tokenizer=tokenizer, data_collator=UnslothVisionDataCollator(model, tokenizer), train_dataset=converted_dataset, args=SFTConfig( per_device_train_batch_size=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), output_dir="outputs_sft_pixtral_vision_test", remove_unused_columns=False, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, max_seq_length=1024, ), ) print("Starting SFT Vision training (3 steps)...") try: trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 'N/A') print(f"Training completed. Final loss: {final_loss:.4f}") SFT_VISION_SUPPORTED = True except Exception as e: print(f"Training failed: {e}") SFT_VISION_SUPPORTED = False
In [6]:
Copied!
# Post-training inference test with vision
FastVisionModel.for_inference(model)
test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()
# Clear success/failure banner
print("=" * 60)
if SFT_VISION_SUPPORTED:
print("SFT Training: SUPPORTED for Pixtral (Vision)")
print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit")
print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset")
else:
print("SFT Training: NOT SUPPORTED for Pixtral (Vision)")
print("Reason: See error above")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test with vision FastVisionModel.for_inference(model) test_image = dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) response = tokenizer.decode(output[0], skip_special_tokens=True) # Clean up BPE artifacts from Mistral tokenizer family (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if SFT_VISION_SUPPORTED: print("SFT Training: SUPPORTED for Pixtral (Vision)") print("Model: FastVisionModel + pixtral-12b-2409-bnb-4bit") print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset") else: print("SFT Training: NOT SUPPORTED for Pixtral (Vision)") print("Reason: See error above") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")
============================================================
SFT Training: SUPPORTED for Pixtral (Vision)
Model: FastVisionModel + pixtral-12b-2409-bnb-4bit
Components: UnslothVisionDataCollator, LaTeX_OCR dataset
============================================================
Sample generation:
LaTeX representation for the given mathematical expression is:
\[
N = \frac{Z}{M} \leq \frac{P}{Q} \leq \frac{R}{S} \leq \frac{T}{U}
\]
Where:
- \( N \) is the numerator of the first fraction.
- \( Test Complete¶
The SFT Training Pipeline test for Pixtral (Vision) has completed. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastVisionModel loading with pre-quantized 4-bit (Pixtral-12B)
- LoRA adapter configuration (vision + language layers)
- Vision dataset loading (LaTeX_OCR)
- UnslothVisionDataCollator integration
- SFTTrainer training loop (3 steps)
- Post-training vision inference
Pixtral Vision Notes¶
- Uses FastVisionModel for native multimodal support
- Requires UnslothVisionDataCollator for vision data
- Dataset must include actual images with
{"type": "image"}format - Pre-quantized to 4-bit for memory efficiency
Comparison with Text-Only¶
| Aspect | Text-Only | Vision |
|---|---|---|
| Model Class | FastLanguageModel | FastVisionModel |
| Data Collator | None | UnslothVisionDataCollator |
| Dataset | Synthetic text | LaTeX_OCR (images) |
| LoRA Layers | Language only | Vision + Language |
In [7]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
Out[7]:
{'status': 'ok', 'restart': False}