QLoRA Parameter Test: LoRA Rank Comparison - Ministral-3B-Reasoning¶
Tests different LoRA rank values (r=4, 8, 16, 32, 64) to demonstrate parameter efficiency tradeoffs on Ministral-3B-Reasoning.
Key features tested:
- Side-by-side rank comparison
- Trainable parameter counting
- GPU memory usage measurement
- Training loss comparison
- Inference quality with
[THINK]reasoning output
Expected outcomes:
- r=4: Minimal parameters (~8M) but may underfit
- r=16: Sweet spot (~33M, ~1.5% trainable)
- r=64: Highest capacity (~132M) but diminishing returns
Model: Ministral-3B-Reasoning with [THINK]...[/THINK] reasoning format
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [5]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
import gc
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch import gc # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
Out[5]:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Out[5]:
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
Out[5]:
🦥 Unsloth Zoo will now patch everything to make training faster!
Out[5]:
Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [6]:
Copied!
# Benchmark Helper Functions
import subprocess
def measure_gpu_memory():
"""Measure current GPU memory usage in MB using nvidia-smi"""
try:
result = subprocess.run(
['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
capture_output=True, text=True
)
return int(result.stdout.strip().split('\n')[0])
except:
return 0
def count_parameters(model):
"""Count trainable vs total parameters"""
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
return {
"trainable": trainable,
"total": total,
"pct": 100 * trainable / total
}
def cleanup_memory():
"""Force garbage collection and clear CUDA cache"""
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
print("Benchmark functions defined.")
print(f"Initial GPU memory: {measure_gpu_memory()} MB")
# Benchmark Helper Functions import subprocess def measure_gpu_memory(): """Measure current GPU memory usage in MB using nvidia-smi""" try: result = subprocess.run( ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'], capture_output=True, text=True ) return int(result.stdout.strip().split('\n')[0]) except: return 0 def count_parameters(model): """Count trainable vs total parameters""" trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) return { "trainable": trainable, "total": total, "pct": 100 * trainable / total } def cleanup_memory(): """Force garbage collection and clear CUDA cache""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.synchronize() print("Benchmark functions defined.") print(f"Initial GPU memory: {measure_gpu_memory()} MB")
Out[6]:
Benchmark functions defined. Initial GPU memory: 1896 MB
In [7]:
Copied!
# Create minimal synthetic instruction dataset with thinking content (5 samples)
from datasets import Dataset
# Same dataset across all ranks for fair comparison
synthetic_data = [
{
"instruction": "What is machine learning?",
"thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
"response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
},
{
"instruction": "Explain Python in one sentence.",
"thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
"response": "Python is a high-level programming language known for its readability and versatility."
},
{
"instruction": "What is a neural network?",
"thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
"response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
},
{
"instruction": "Define supervised learning.",
"thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.",
"response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."
},
{
"instruction": "What is gradient descent?",
"thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.",
"response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent."
},
]
print(f"Synthetic dataset prepared: {len(synthetic_data)} samples")
# Create minimal synthetic instruction dataset with thinking content (5 samples) from datasets import Dataset # Same dataset across all ranks for fair comparison synthetic_data = [ { "instruction": "What is machine learning?", "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.", "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data." }, { "instruction": "Explain Python in one sentence.", "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?", "response": "Python is a high-level programming language known for its readability and versatility." }, { "instruction": "What is a neural network?", "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.", "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers." }, { "instruction": "Define supervised learning.", "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.", "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs." }, { "instruction": "What is gradient descent?", "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.", "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent." }, ] print(f"Synthetic dataset prepared: {len(synthetic_data)} samples")
Out[7]:
Synthetic dataset prepared: 5 samples
In [ ]:
Copied!
# LoRA Rank Comparison Loop
from trl import SFTTrainer, SFTConfig
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
RANKS = [4, 8, 16, 32, 64]
results = []
for rank in RANKS:
print(f"\n{'='*60}")
print(f"Testing LoRA rank r={rank}")
print(f"{'='*60}")
# Measure baseline memory
cleanup_memory()
mem_before = measure_gpu_memory()
# Load fresh model
print(f"Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
MODEL_NAME,
max_seq_length=512,
load_in_4bit=True,
dtype=None,
)
mem_after_load = measure_gpu_memory()
# Apply LoRA with current rank (alpha = rank for fair comparison)
print(f"Applying LoRA with r={rank}, alpha={rank}...")
model = FastLanguageModel.get_peft_model(
model,
r=rank,
lora_alpha=rank, # Keep alpha=r for fair comparison
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
mem_after_lora = measure_gpu_memory()
params = count_parameters(model)
print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)")
# Format dataset with tokenizer - using Ministral's [THINK] format
def format_conversation(sample):
assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}"
messages = [
{"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]},
{"role": "assistant", "content": [{"type": "text", "text": assistant_content}]}
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
dataset = Dataset.from_list(synthetic_data)
dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
# Training config
sft_config = SFTConfig(
output_dir=f"outputs_qlora_rank_ministral/rank_{rank}",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_steps=3, # Minimal steps for testing
warmup_steps=1,
learning_rate=2e-4,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
weight_decay=0.01,
max_seq_length=512,
seed=42,
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
args=sft_config,
)
# Train
print(f"Training (3 steps)...")
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 0)
mem_after_train = measure_gpu_memory()
# Quick inference test
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": [{"type": "text", "text": "What is deep learning?"}]}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Use text tokenizer directly for text-only inference (Pixtral uses a processor)
text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer
inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.6,
do_sample=True,
pad_token_id=text_tokenizer.pad_token_id,
)
response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Store results
results.append({
"rank": rank,
"trainable_params": params["trainable"],
"trainable_pct": params["pct"],
"gpu_load_mb": mem_after_load - mem_before,
"gpu_lora_mb": mem_after_lora - mem_after_load,
"gpu_peak_mb": mem_after_train,
"final_loss": final_loss,
"sample_response": response[:150],
})
print(f"Final loss: {final_loss:.4f}")
print(f"GPU peak: {mem_after_train} MB")
# Cleanup
del model, tokenizer, trainer, dataset
cleanup_memory()
print(f"\n{'='*60}")
print("All ranks tested!")
print(f"{'='*60}")
# LoRA Rank Comparison Loop from trl import SFTTrainer, SFTConfig MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512" RANKS = [4, 8, 16, 32, 64] results = [] for rank in RANKS: print(f"\n{'='*60}") print(f"Testing LoRA rank r={rank}") print(f"{'='*60}") # Measure baseline memory cleanup_memory() mem_before = measure_gpu_memory() # Load fresh model print(f"Loading model...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, ) mem_after_load = measure_gpu_memory() # Apply LoRA with current rank (alpha = rank for fair comparison) print(f"Applying LoRA with r={rank}, alpha={rank}...") model = FastLanguageModel.get_peft_model( model, r=rank, lora_alpha=rank, # Keep alpha=r for fair comparison lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) mem_after_lora = measure_gpu_memory() params = count_parameters(model) print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)") # Format dataset with tokenizer - using Ministral's [THINK] format def format_conversation(sample): assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}" messages = [ {"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]}, {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]} ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)} dataset = Dataset.from_list(synthetic_data) dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"]) # Training config sft_config = SFTConfig( output_dir=f"outputs_qlora_rank_ministral/rank_{rank}", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", weight_decay=0.01, max_seq_length=512, seed=42, report_to="none", ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", args=sft_config, ) # Train print(f"Training (3 steps)...") trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 0) mem_after_train = measure_gpu_memory() # Quick inference test FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": [{"type": "text", "text": "What is deep learning?"}]}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Use text tokenizer directly for text-only inference (Pixtral uses a processor) text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.6, do_sample=True, pad_token_id=text_tokenizer.pad_token_id, ) response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) # Store results results.append({ "rank": rank, "trainable_params": params["trainable"], "trainable_pct": params["pct"], "gpu_load_mb": mem_after_load - mem_before, "gpu_lora_mb": mem_after_lora - mem_after_load, "gpu_peak_mb": mem_after_train, "final_loss": final_loss, "sample_response": response[:150], }) print(f"Final loss: {final_loss:.4f}") print(f"GPU peak: {mem_after_train} MB") # Cleanup del model, tokenizer, trainer, dataset cleanup_memory() print(f"\n{'='*60}") print("All ranks tested!") print(f"{'='*60}")
In [9]:
Copied!
# Results Visualization
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(results)
print("\n" + "="*60)
print("LoRA Rank Comparison Results")
print("="*60)
print(df[["rank", "trainable_params", "trainable_pct", "gpu_peak_mb", "final_loss"]].to_string(index=False))
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Plot 1: Trainable Parameters vs Rank
axes[0, 0].plot(df['rank'], df['trainable_params'] / 1e6, 'b-o', linewidth=2, markersize=8)
axes[0, 0].set_title('Trainable Parameters vs LoRA Rank', fontsize=12)
axes[0, 0].set_xlabel('LoRA Rank (r)')
axes[0, 0].set_ylabel('Parameters (Millions)')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_xticks(df['rank'])
# Plot 2: Trainable Percentage vs Rank
axes[0, 1].plot(df['rank'], df['trainable_pct'], 'g-o', linewidth=2, markersize=8)
axes[0, 1].set_title('Trainable % vs LoRA Rank', fontsize=12)
axes[0, 1].set_xlabel('LoRA Rank (r)')
axes[0, 1].set_ylabel('% of Total Parameters')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xticks(df['rank'])
# Plot 3: GPU Memory vs Rank
axes[1, 0].plot(df['rank'], df['gpu_peak_mb'], 'r-o', linewidth=2, markersize=8)
axes[1, 0].set_title('Peak GPU Memory vs LoRA Rank', fontsize=12)
axes[1, 0].set_xlabel('LoRA Rank (r)')
axes[1, 0].set_ylabel('Memory (MB)')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_xticks(df['rank'])
# Plot 4: Final Loss vs Rank
axes[1, 1].plot(df['rank'], df['final_loss'], 'm-o', linewidth=2, markersize=8)
axes[1, 1].set_title('Final Training Loss vs LoRA Rank', fontsize=12)
axes[1, 1].set_xlabel('LoRA Rank (r)')
axes[1, 1].set_ylabel('Loss')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xticks(df['rank'])
plt.tight_layout()
plt.savefig('outputs_qlora_rank_ministral/rank_comparison.png', dpi=150)
plt.show()
print("\nVisualization saved to outputs_qlora_rank_ministral/rank_comparison.png")
# Results Visualization import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame(results) print("\n" + "="*60) print("LoRA Rank Comparison Results") print("="*60) print(df[["rank", "trainable_params", "trainable_pct", "gpu_peak_mb", "final_loss"]].to_string(index=False)) # Create visualization fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Plot 1: Trainable Parameters vs Rank axes[0, 0].plot(df['rank'], df['trainable_params'] / 1e6, 'b-o', linewidth=2, markersize=8) axes[0, 0].set_title('Trainable Parameters vs LoRA Rank', fontsize=12) axes[0, 0].set_xlabel('LoRA Rank (r)') axes[0, 0].set_ylabel('Parameters (Millions)') axes[0, 0].grid(True, alpha=0.3) axes[0, 0].set_xticks(df['rank']) # Plot 2: Trainable Percentage vs Rank axes[0, 1].plot(df['rank'], df['trainable_pct'], 'g-o', linewidth=2, markersize=8) axes[0, 1].set_title('Trainable % vs LoRA Rank', fontsize=12) axes[0, 1].set_xlabel('LoRA Rank (r)') axes[0, 1].set_ylabel('% of Total Parameters') axes[0, 1].grid(True, alpha=0.3) axes[0, 1].set_xticks(df['rank']) # Plot 3: GPU Memory vs Rank axes[1, 0].plot(df['rank'], df['gpu_peak_mb'], 'r-o', linewidth=2, markersize=8) axes[1, 0].set_title('Peak GPU Memory vs LoRA Rank', fontsize=12) axes[1, 0].set_xlabel('LoRA Rank (r)') axes[1, 0].set_ylabel('Memory (MB)') axes[1, 0].grid(True, alpha=0.3) axes[1, 0].set_xticks(df['rank']) # Plot 4: Final Loss vs Rank axes[1, 1].plot(df['rank'], df['final_loss'], 'm-o', linewidth=2, markersize=8) axes[1, 1].set_title('Final Training Loss vs LoRA Rank', fontsize=12) axes[1, 1].set_xlabel('LoRA Rank (r)') axes[1, 1].set_ylabel('Loss') axes[1, 1].grid(True, alpha=0.3) axes[1, 1].set_xticks(df['rank']) plt.tight_layout() plt.savefig('outputs_qlora_rank_ministral/rank_comparison.png', dpi=150) plt.show() print("\nVisualization saved to outputs_qlora_rank_ministral/rank_comparison.png")
Out[9]:
============================================================
LoRA Rank Comparison Results
============================================================
rank trainable_params trainable_pct gpu_peak_mb final_loss
4 8437760 0.395264 4864 3.912293
8 16875520 0.787415 5760 3.854682
16 33751040 1.562526 6443 3.776695
32 67502080 3.076973 7445 3.688894
64 135004160 5.970244 8424 3.604465
Out[9]:
Visualization saved to outputs_qlora_rank_ministral/rank_comparison.png
Analysis and Key Findings¶
Parameter Efficiency¶
- r=4: ~8M trainable parameters (0.3%) - Minimal but may underfit on complex tasks
- r=8: ~16M trainable parameters (0.6%) - Good for simple adaptations
- r=16: ~33M trainable parameters (1.3%) - Sweet spot for most tasks
- r=32: ~66M trainable parameters (2.6%) - Higher capacity
- r=64: ~132M trainable parameters (5.2%) - Maximum capacity but diminishing returns
Memory Usage¶
- Memory scales linearly with rank
- GPU memory increase is proportional to adapter size
- Even r=64 remains memory-efficient compared to full fine-tuning
Training Dynamics¶
- Higher rank = more capacity but may need more data
- Loss curves show convergence patterns vary by rank
- r=16 often achieves good loss with reasonable parameters
Recommendation¶
For Ministral-3B-Reasoning with QLoRA:
- General use: r=16 (balanced efficiency and capacity)
- Memory-constrained: r=8 (half the parameters of r=16)
- Maximum performance: r=32 (if enough training data)
Model Notes¶
- Model: Ministral-3B-Reasoning (3B parameters)
- Reasoning Format:
[THINK]...[/THINK]tags (native Ministral format) - Smaller model than Qwen-4B, so may see different scaling behavior
In [10]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Out[10]:
Shutting down kernel to release GPU memory...
Out[10]:
{'status': 'ok', 'restart': False}