QLoRA Advanced: Multi-Adapter Training - Ministral-3B-Reasoning¶

Demonstrates training multiple task-specific LoRA adapters and switching between them at inference.

Tasks/Adapters:

Technical Q&A - Precise technical explanations
Creative Writing - Imaginative, expressive responses
Code Explanation - Code-focused analysis

Key features demonstrated:

Training independent adapters on different datasets
Saving lightweight adapter weights (~33M params each)
Switching adapters at inference time
Same prompt → different responses per adapter
[THINK] reasoning capability preserved across all adapters

Benefits of multi-adapter approach:

Modular specialization without retraining base model
Hot-swap capabilities at runtime
Storage efficient (adapters are ~130MB each vs ~6GB base model)

Model: Ministral-3B-Reasoning with [THINK]...[/THINK] reasoning format

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import gc

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch import gc # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

Out[1]:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

Out[1]:

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

Out[1]:

🦥 Unsloth Zoo will now patch everything to make training faster!

Out[1]:

Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Helper Functions

def cleanup_memory():
    """Force garbage collection and clear CUDA cache"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

def count_parameters(model):
    """Count trainable vs total parameters"""
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return {"trainable": trainable, "total": total, "pct": 100 * trainable / total}

print("Helper functions defined.")
# Helper Functions def cleanup_memory(): """Force garbage collection and clear CUDA cache""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.synchronize() def count_parameters(model): """Count trainable vs total parameters""" trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) return {"trainable": trainable, "total": total, "pct": 100 * trainable / total} print("Helper functions defined.")

Out[2]:

Helper functions defined.

In [3]:

  Copied!     
 
# Task-Specific Datasets (5 samples each)
from datasets import Dataset

# Task 1: Technical Q&A - precise, factual responses
technical_data = [
    {
        "instruction": "What is a hash table?",
        "thinking": "The user wants a technical explanation. What are the key aspects? It's a data structure... uses hashing for O(1) lookups... handles collisions. Keep it precise but accessible.",
        "response": "A hash table is a data structure that maps keys to values using a hash function. It provides O(1) average-case time complexity for insertions, deletions, and lookups by computing an index from the key."
    },
    {
        "instruction": "Explain TCP vs UDP.",
        "thinking": "Compare two network protocols. TCP is reliable, connection-oriented. UDP is fast, connectionless. What are the key differences to highlight?",
        "response": "TCP provides reliable, ordered delivery with connection establishment and error checking. UDP is connectionless and faster but unreliable. TCP is used for web traffic; UDP for streaming and gaming."
    },
    {
        "instruction": "What is Big O notation?",
        "thinking": "Fundamental CS concept. It describes algorithmic complexity. How do I explain it simply? Focus on growth rate and worst-case analysis.",
        "response": "Big O notation describes the upper bound of an algorithm's time or space complexity as input size grows. O(n) means linear growth, O(n²) means quadratic, and O(1) means constant time."
    },
    {
        "instruction": "What is recursion?",
        "thinking": "Classic programming concept. A function calling itself with a base case. Should I mention stack frames? Keep it conceptual first.",
        "response": "Recursion is when a function calls itself to solve smaller instances of the same problem. It requires a base case to terminate and recursive cases that reduce toward the base case."
    },
    {
        "instruction": "Explain database indexing.",
        "thinking": "Database performance topic. Indexes speed up queries... B-trees are common. Trade-off with insert performance. Focus on the core concept.",
        "response": "Database indexing creates data structures that speed up query operations by allowing the database to find rows without scanning entire tables. Indexes trade storage space and insert speed for faster reads."
    },
]

# Task 2: Creative Writing - imaginative, expressive responses
creative_data = [
    {
        "instruction": "Write a haiku about coding.",
        "thinking": "Haiku: 5-7-5 syllables. Theme is coding. What imagery captures the experience? Bugs, late nights, the flow state... Let me craft something evocative.",
        "response": "Lines of logic flow\nBugs dance in the midnight glow\nFixed—the dawn says so"
    },
    {
        "instruction": "Describe a sunset poetically.",
        "thinking": "Poetry mode. What makes sunsets beautiful? Colors, transition, emotion. Use metaphor and sensory language.",
        "response": "The sun dips low, painting the sky in watercolors of amber and rose. Day surrenders to twilight's embrace as the horizon glows like embers of a dying fire, beautiful in its farewell."
    },
    {
        "instruction": "Create a metaphor for learning.",
        "thinking": "What is learning like? A journey? Growing? Building? Something that captures the gradual, sometimes difficult process...",
        "response": "Learning is climbing a mountain in fog—each step reveals new terrain, the path unclear until you've walked it. The summit exists, but the journey shapes you more than arrival."
    },
    {
        "instruction": "Write a short opening line for a mystery novel.",
        "thinking": "Mystery opening needs intrigue. Something that raises questions. A hook that makes readers want more. Tension from the first sentence.",
        "response": "The letter arrived three years after her death, written in handwriting that was unmistakably hers."
    },
    {
        "instruction": "Describe rain in an unusual way.",
        "thinking": "Avoid clichés. What's a fresh perspective on rain? Sound, feeling, unusual comparisons... Make it memorable.",
        "response": "The sky plays percussion on the rooftops—a million tiny drummers performing chaos into rhythm, washing the world in liquid applause."
    },
]

# Task 3: Code Explanation - code-focused analysis
code_data = [
    {
        "instruction": "Explain: def fib(n): return n if n<2 else fib(n-1)+fib(n-2)",
        "thinking": "Classic recursive Fibonacci. Base case n<2, recursive case adds previous two. What should I highlight? The elegance but also the O(2^n) complexity.",
        "response": "This is a recursive Fibonacci implementation. Base case: returns n for 0 or 1. Recursive case: sums the two preceding values. Elegant but O(2^n) time—use memoization for efficiency."
    },
    {
        "instruction": "What does [x*2 for x in range(5)] do?",
        "thinking": "Python list comprehension. Creates a list of doubled values. range(5) gives 0-4. Result is [0,2,4,6,8]. Explain the syntax pattern.",
        "response": "This list comprehension creates [0, 2, 4, 6, 8]. It iterates x through range(5) (0-4) and multiplies each by 2. Syntax: [expression for item in iterable]."
    },
    {
        "instruction": "Explain Python decorators.",
        "thinking": "Decorators wrap functions. They're higher-order functions. @syntax is syntactic sugar. Use cases: logging, timing, auth. Start with the concept.",
        "response": "Decorators are functions that modify other functions. The @decorator syntax wraps a function, adding behavior before/after execution. Common uses: logging, authentication, caching, timing."
    },
    {
        "instruction": "What is async/await in Python?",
        "thinking": "Asynchronous programming. Coroutines, event loop, non-blocking I/O. How do I explain this concisely? Focus on the why and basic usage.",
        "response": "async/await enables non-blocking I/O. 'async def' creates coroutines; 'await' pauses execution until a task completes. The event loop manages concurrent operations without threads."
    },
    {
        "instruction": "Explain: lambda x,y: x if x>y else y",
        "thinking": "Lambda function returning the max of two values. Anonymous function, single expression. Equivalent to max() for two values.",
        "response": "This lambda returns the larger of x and y using a ternary expression. It's equivalent to max(x,y). Lambda creates anonymous, single-expression functions."
    },
]

TASK_DATASETS = {
    "technical": technical_data,
    "creative": creative_data,
    "code": code_data,
}

print(f"Task datasets prepared:")
for name, data in TASK_DATASETS.items():
    print(f"  - {name}: {len(data)} samples")
# Task-Specific Datasets (5 samples each) from datasets import Dataset # Task 1: Technical Q&A - precise, factual responses technical_data = [ { "instruction": "What is a hash table?", "thinking": "The user wants a technical explanation. What are the key aspects? It's a data structure... uses hashing for O(1) lookups... handles collisions. Keep it precise but accessible.", "response": "A hash table is a data structure that maps keys to values using a hash function. It provides O(1) average-case time complexity for insertions, deletions, and lookups by computing an index from the key." }, { "instruction": "Explain TCP vs UDP.", "thinking": "Compare two network protocols. TCP is reliable, connection-oriented. UDP is fast, connectionless. What are the key differences to highlight?", "response": "TCP provides reliable, ordered delivery with connection establishment and error checking. UDP is connectionless and faster but unreliable. TCP is used for web traffic; UDP for streaming and gaming." }, { "instruction": "What is Big O notation?", "thinking": "Fundamental CS concept. It describes algorithmic complexity. How do I explain it simply? Focus on growth rate and worst-case analysis.", "response": "Big O notation describes the upper bound of an algorithm's time or space complexity as input size grows. O(n) means linear growth, O(n²) means quadratic, and O(1) means constant time." }, { "instruction": "What is recursion?", "thinking": "Classic programming concept. A function calling itself with a base case. Should I mention stack frames? Keep it conceptual first.", "response": "Recursion is when a function calls itself to solve smaller instances of the same problem. It requires a base case to terminate and recursive cases that reduce toward the base case." }, { "instruction": "Explain database indexing.", "thinking": "Database performance topic. Indexes speed up queries... B-trees are common. Trade-off with insert performance. Focus on the core concept.", "response": "Database indexing creates data structures that speed up query operations by allowing the database to find rows without scanning entire tables. Indexes trade storage space and insert speed for faster reads." }, ] # Task 2: Creative Writing - imaginative, expressive responses creative_data = [ { "instruction": "Write a haiku about coding.", "thinking": "Haiku: 5-7-5 syllables. Theme is coding. What imagery captures the experience? Bugs, late nights, the flow state... Let me craft something evocative.", "response": "Lines of logic flow\nBugs dance in the midnight glow\nFixed—the dawn says so" }, { "instruction": "Describe a sunset poetically.", "thinking": "Poetry mode. What makes sunsets beautiful? Colors, transition, emotion. Use metaphor and sensory language.", "response": "The sun dips low, painting the sky in watercolors of amber and rose. Day surrenders to twilight's embrace as the horizon glows like embers of a dying fire, beautiful in its farewell." }, { "instruction": "Create a metaphor for learning.", "thinking": "What is learning like? A journey? Growing? Building? Something that captures the gradual, sometimes difficult process...", "response": "Learning is climbing a mountain in fog—each step reveals new terrain, the path unclear until you've walked it. The summit exists, but the journey shapes you more than arrival." }, { "instruction": "Write a short opening line for a mystery novel.", "thinking": "Mystery opening needs intrigue. Something that raises questions. A hook that makes readers want more. Tension from the first sentence.", "response": "The letter arrived three years after her death, written in handwriting that was unmistakably hers." }, { "instruction": "Describe rain in an unusual way.", "thinking": "Avoid clichés. What's a fresh perspective on rain? Sound, feeling, unusual comparisons... Make it memorable.", "response": "The sky plays percussion on the rooftops—a million tiny drummers performing chaos into rhythm, washing the world in liquid applause." }, ] # Task 3: Code Explanation - code-focused analysis code_data = [ { "instruction": "Explain: def fib(n): return n if n<2 else fib(n-1)+fib(n-2)", "thinking": "Classic recursive Fibonacci. Base case n<2, recursive case adds previous two. What should I highlight? The elegance but also the O(2^n) complexity.", "response": "This is a recursive Fibonacci implementation. Base case: returns n for 0 or 1. Recursive case: sums the two preceding values. Elegant but O(2^n) time—use memoization for efficiency." }, { "instruction": "What does [x*2 for x in range(5)] do?", "thinking": "Python list comprehension. Creates a list of doubled values. range(5) gives 0-4. Result is [0,2,4,6,8]. Explain the syntax pattern.", "response": "This list comprehension creates [0, 2, 4, 6, 8]. It iterates x through range(5) (0-4) and multiplies each by 2. Syntax: [expression for item in iterable]." }, { "instruction": "Explain Python decorators.", "thinking": "Decorators wrap functions. They're higher-order functions. @syntax is syntactic sugar. Use cases: logging, timing, auth. Start with the concept.", "response": "Decorators are functions that modify other functions. The @decorator syntax wraps a function, adding behavior before/after execution. Common uses: logging, authentication, caching, timing." }, { "instruction": "What is async/await in Python?", "thinking": "Asynchronous programming. Coroutines, event loop, non-blocking I/O. How do I explain this concisely? Focus on the why and basic usage.", "response": "async/await enables non-blocking I/O. 'async def' creates coroutines; 'await' pauses execution until a task completes. The event loop manages concurrent operations without threads." }, { "instruction": "Explain: lambda x,y: x if x>y else y", "thinking": "Lambda function returning the max of two values. Anonymous function, single expression. Equivalent to max() for two values.", "response": "This lambda returns the larger of x and y using a ternary expression. It's equivalent to max(x,y). Lambda creates anonymous, single-expression functions." }, ] TASK_DATASETS = { "technical": technical_data, "creative": creative_data, "code": code_data, } print(f"Task datasets prepared:") for name, data in TASK_DATASETS.items(): print(f" - {name}: {len(data)} samples")

Out[3]:

Task datasets prepared:
  - technical: 5 samples
  - creative: 5 samples
  - code: 5 samples

In [ ]:

  Copied!     
 
# Train Multiple Adapters
from trl import SFTTrainer, SFTConfig

MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
OUTPUT_BASE = "outputs_qlora_multi_adapter_ministral"
adapter_paths = {}

for task_name, task_data in TASK_DATASETS.items():
    print(f"\n{'='*60}")
    print(f"Training adapter: {task_name}")
    print(f"{'='*60}")
    
    # Cleanup
    cleanup_memory()
    
    # Load fresh model
    print(f"Loading model...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        MODEL_NAME,
        max_seq_length=512,
        load_in_4bit=True,
        dtype=None,
    )
    
    # Apply LoRA
    print(f"Applying LoRA...")
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        lora_alpha=16,
        lora_dropout=0,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=42,
    )
    
    params = count_parameters(model)
    print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)")
    
    # Format dataset - using Ministral's [THINK] format
    def format_conversation(sample):
        assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}"
        messages = [
            {"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]},
            {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]}
        ]
        return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
    
    dataset = Dataset.from_list(task_data)
    dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
    
    # Training config
    sft_config = SFTConfig(
        output_dir=f"{OUTPUT_BASE}/{task_name}_adapter",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        max_steps=5,  # Few steps for each task
        warmup_steps=1,
        learning_rate=2e-4,
        logging_steps=1,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        max_seq_length=512,
        seed=42,
        report_to="none",
    )
    
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        dataset_text_field="text",
        args=sft_config,
    )
    
    # Train
    print(f"Training (5 steps)...")
    trainer_stats = trainer.train()
    final_loss = trainer_stats.metrics.get('train_loss', 0)
    print(f"Final loss: {final_loss:.4f}")
    
    # Save adapter
    adapter_path = f"{OUTPUT_BASE}/{task_name}_adapter"
    model.save_pretrained(adapter_path)
    tokenizer.save_pretrained(adapter_path)
    adapter_paths[task_name] = adapter_path
    print(f"Adapter saved to: {adapter_path}")
    
    # Cleanup
    del model, tokenizer, trainer, dataset
    cleanup_memory()

print(f"\n{'='*60}")
print("All adapters trained and saved!")
print(f"{'='*60}")
print("\nAdapter paths:")
for name, path in adapter_paths.items():
    print(f"  - {name}: {path}")
# Train Multiple Adapters from trl import SFTTrainer, SFTConfig MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512" OUTPUT_BASE = "outputs_qlora_multi_adapter_ministral" adapter_paths = {} for task_name, task_data in TASK_DATASETS.items(): print(f"\n{'='*60}") print(f"Training adapter: {task_name}") print(f"{'='*60}") # Cleanup cleanup_memory() # Load fresh model print(f"Loading model...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, ) # Apply LoRA print(f"Applying LoRA...") model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) params = count_parameters(model) print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)") # Format dataset - using Ministral's [THINK] format def format_conversation(sample): assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}" messages = [ {"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]}, {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]} ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)} dataset = Dataset.from_list(task_data) dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"]) # Training config sft_config = SFTConfig( output_dir=f"{OUTPUT_BASE}/{task_name}_adapter", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=5, # Few steps for each task warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", weight_decay=0.01, max_seq_length=512, seed=42, report_to="none", ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", args=sft_config, ) # Train print(f"Training (5 steps)...") trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 0) print(f"Final loss: {final_loss:.4f}") # Save adapter adapter_path = f"{OUTPUT_BASE}/{task_name}_adapter" model.save_pretrained(adapter_path) tokenizer.save_pretrained(adapter_path) adapter_paths[task_name] = adapter_path print(f"Adapter saved to: {adapter_path}") # Cleanup del model, tokenizer, trainer, dataset cleanup_memory() print(f"\n{'='*60}") print("All adapters trained and saved!") print(f"{'='*60}") print("\nAdapter paths:") for name, path in adapter_paths.items(): print(f" - {name}: {path}")

In [5]:

  Copied!     
 
# Adapter Switching Demonstration
from peft import PeftModel

print("="*60)
print("Adapter Switching Demonstration")
print("="*60)
print("\nSame prompt with different adapters:")
print("Prompt: 'Explain what a variable is.'")
print("="*60)

TEST_PROMPT = "Explain what a variable is."
results = {}

# Load base model once
cleanup_memory()
print("\nLoading base model...")
base_model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    dtype=None,
)

# Use text tokenizer directly for text-only inference (Pixtral uses a processor)
text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer

for task_name, adapter_path in adapter_paths.items():
    print(f"\n--- Loading {task_name} adapter ---")
    
    # Load adapter on base model
    adapted_model = PeftModel.from_pretrained(base_model, adapter_path)
    FastLanguageModel.for_inference(adapted_model)
    
    # Generate response - using Ministral multimodal format
    messages = [{"role": "user", "content": [{"type": "text", "text": TEST_PROMPT}]}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = adapted_model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.6,
            do_sample=True,
            pad_token_id=text_tokenizer.pad_token_id,
        )
    
    response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    results[task_name] = response
    
    print(f"Response preview: {response[:200]}...")
    
    # Unload adapter
    del adapted_model
    cleanup_memory()

# Cleanup base model
del base_model, tokenizer
cleanup_memory()

print(f"\n{'='*60}")
print("Demonstration complete!")
print(f"{'='*60}")
# Adapter Switching Demonstration from peft import PeftModel print("="*60) print("Adapter Switching Demonstration") print("="*60) print("\nSame prompt with different adapters:") print("Prompt: 'Explain what a variable is.'") print("="*60) TEST_PROMPT = "Explain what a variable is." results = {} # Load base model once cleanup_memory() print("\nLoading base model...") base_model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, ) # Use text tokenizer directly for text-only inference (Pixtral uses a processor) text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer for task_name, adapter_path in adapter_paths.items(): print(f"\n--- Loading {task_name} adapter ---") # Load adapter on base model adapted_model = PeftModel.from_pretrained(base_model, adapter_path) FastLanguageModel.for_inference(adapted_model) # Generate response - using Ministral multimodal format messages = [{"role": "user", "content": [{"type": "text", "text": TEST_PROMPT}]}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = adapted_model.generate( **inputs, max_new_tokens=150, temperature=0.6, do_sample=True, pad_token_id=text_tokenizer.pad_token_id, ) response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) results[task_name] = response print(f"Response preview: {response[:200]}...") # Unload adapter del adapted_model cleanup_memory() # Cleanup base model del base_model, tokenizer cleanup_memory() print(f"\n{'='*60}") print("Demonstration complete!") print(f"{'='*60}")

Out[5]:

============================================================
Adapter Switching Demonstration
============================================================

Same prompt with different adapters:
Prompt: 'Explain what a variable is.'
============================================================

Loading base model...

Out[5]:

==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Out[5]:

Loading weights:   0%|          | 0/458 [00:00<?, ?it/s]

Out[5]:

--- Loading technical adapter ---

Out[5]:

Response preview: Alright,ĠIĠneedĠtoĠexplainĠwhatĠaĠvariableĠis.ĠLetĠmeĠstartĠbyĠrecallingĠtheĠbasicĠconcept.ĠAĠvariableĠisĠaĠpieceĠofĠdataĠthatĠcanĠchangeĠorĠbeĠmodified.ĠButĠhowĠshouldĠIĠdefineĠit?ĊĊFirst,ĠperhapsĠIĠ...

--- Loading creative adapter ---

Out[5]:

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/peft/tuners/tuners_utils.py:282: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
  warnings.warn(

Out[5]:

Response preview: Okay,ĠIĠneedĠtoĠexplainĠwhatĠaĠvariableĠis.ĠIĠrecallĠthatĠinĠprogramming,ĠaĠvariableĠisĠaĠstorageĠlocationĠinĠaĠcomputer'sĠmemoryĠthatĠcanĠholdĠdata.ĠButĠmaybeĠIĠshouldĠmakeĠsureĠtoĠexplainĠthisĠinĠaĠ...

--- Loading code adapter ---

Out[5]:

Response preview: Okay,ĠsoĠtheĠuserĠisĠaskingĠaboutĠwhatĠaĠvariableĠis.ĠIĠneedĠtoĠexplainĠtheĠconceptĠofĠaĠvariableĠinĠaĠsimpleĠway.ĠLetĠmeĠthinkĠaboutĠhowĠtoĠdescribeĠit.ĊĊAĠvariableĠisĠaĠnameĠforĠaĠstorageĠlocationĠi...

Out[5]:

============================================================
Demonstration complete!
============================================================

In [6]:

  Copied!     
 
# Compare Responses Side by Side
print("="*60)
print("Response Comparison: Same Prompt, Different Adapters")
print("="*60)
print(f"\nPrompt: '{TEST_PROMPT}'\n")

for task_name, response in results.items():
    print(f"\n[{task_name.upper()} ADAPTER]")
    print("-" * 40)
    # Show full response
    print(response)
    print()
# Compare Responses Side by Side print("="*60) print("Response Comparison: Same Prompt, Different Adapters") print("="*60) print(f"\nPrompt: '{TEST_PROMPT}'\n") for task_name, response in results.items(): print(f"\n[{task_name.upper()} ADAPTER]") print("-" * 40) # Show full response print(response) print()

Out[6]:

============================================================
Response Comparison: Same Prompt, Different Adapters
============================================================

Prompt: 'Explain what a variable is.'


[TECHNICAL ADAPTER]
----------------------------------------
Alright,ĠIĠneedĠtoĠexplainĠwhatĠaĠvariableĠis.ĠLetĠmeĠstartĠbyĠrecallingĠtheĠbasicĠconcept.ĠAĠvariableĠisĠaĠpieceĠofĠdataĠthatĠcanĠchangeĠorĠbeĠmodified.ĠButĠhowĠshouldĠIĠdefineĠit?ĊĊFirst,ĠperhapsĠIĠshouldĠdefineĠitĠinĠtermsĠofĠprogramming,ĠsinceĠtheĠquestionĠseemsĠtoĠbeĠaboutĠthatĠcontext.ĊĊInĠprogramming,ĠaĠvariableĠisĠaĠnamedĠstorageĠlocationĠinĠmemoryĠthatĠcanĠholdĠdata.ĠItĠhasĠaĠname,ĠaĠdataĠtype,ĠandĠcanĠbeĠassignedĠaĠvalue.ĠForĠexample,ĠinĠmanyĠprogrammingĠlanguages,ĠyouĠcanĠhaveĠanĠinteger,Ġfloat,ĠorĠstringĠvariable.ĊĊButĠtheĠquestionĠisĠgeneral,ĠsoĠmaybeĠIĠshouldĠexplainĠitĠinĠaĠmoreĠfundamentalĠway.ĊĊInĠmathematics,ĠaĠvariableĠisĠaĠsymbolĠthatĠrepresentsĠanĠunknownĠor


[CREATIVE ADAPTER]
----------------------------------------
Okay,ĠIĠneedĠtoĠexplainĠwhatĠaĠvariableĠis.ĠIĠrecallĠthatĠinĠprogramming,ĠaĠvariableĠisĠaĠstorageĠlocationĠinĠaĠcomputer'sĠmemoryĠthatĠcanĠholdĠdata.ĠButĠmaybeĠIĠshouldĠmakeĠsureĠtoĠexplainĠthisĠinĠaĠwayĠthatĠisĠclearĠandĠconcise.ĊĊLetĠmeĠthinkĠofĠaĠsimpleĠanalogy.ĠMaybeĠIĠcanĠsayĠthatĠaĠvariableĠisĠlikeĠaĠlabeledĠboxĠinĠaĠdrawerĠthatĠcanĠholdĠaĠspecificĠtypeĠofĠitem,ĠlikeĠaĠpenĠorĠaĠbook,ĠdependingĠonĠwhatĠyouĠputĠinĠit.ĊĊSo,ĠhowĠwouldĠIĠphraseĠthis?AĠvariableĠisĠaĠnamedĠstorageĠlocationĠinĠaĠcomputer'sĠmemoryĠthatĠholdsĠdataĠofĠaĠspecificĠtype.ĠForĠexample,ĠinĠprogramming,ĠyouĠmightĠhaveĠaĠvariableĠnamedĠ"age"ĠthatĠcanĠstoreĠanĠintegerĠrepresentingĠa


[CODE ADAPTER]
----------------------------------------
Okay,ĠsoĠtheĠuserĠisĠaskingĠaboutĠwhatĠaĠvariableĠis.ĠIĠneedĠtoĠexplainĠtheĠconceptĠofĠaĠvariableĠinĠaĠsimpleĠway.ĠLetĠmeĠthinkĠaboutĠhowĠtoĠdescribeĠit.ĊĊAĠvariableĠisĠaĠnameĠforĠaĠstorageĠlocationĠinĠaĠcomputer'sĠmemoryĠthatĠholdsĠdata.ĠItĠcanĠbeĠassignedĠaĠvalueĠandĠcanĠbeĠchangedĠasĠneeded.ĠForĠexample,ĠinĠprogramming,ĠaĠvariableĠlikeĠ'x'ĠcanĠstoreĠaĠnumberĠandĠweĠcanĠchangeĠitsĠvalueĠtoĠsomethingĠelse.ĊĊLetĠmeĠdraftĠtheĠresponse.AĠvariableĠisĠaĠnameĠforĠaĠstorageĠlocationĠinĠaĠcomputer'sĠmemoryĠthatĠholdsĠdata.ĠItĠcanĠbeĠassignedĠaĠvalueĠandĠcanĠbeĠchangedĠasĠneeded.ĠForĠexample,ĠinĠprogramming,ĠaĠvariableĠlikeĠ'x'ĠcanĠstoreĠaĠnumberĠandĠwe

Analysis and Key Findings¶

Multi-Adapter Benefits¶

Modularity: Each adapter specializes in a specific task
Efficiency: Adapters are ~130MB each vs ~6GB base model
Flexibility: Hot-swap adapters at runtime
Preservation: Base model capabilities remain intact

Response Style Differences¶

Adapter	Expected Style
Technical	Precise, factual, structured
Creative	Expressive, metaphorical, imaginative
Code	Code-focused, syntax-aware, practical

Reasoning Preservation¶

All adapters should maintain the [THINK]...[/THINK] reasoning pattern from the base Ministral-3B-Reasoning model, but the thinking content may reflect the adapter's specialization.

Production Usage¶

# Load base model once
base_model = load_model()

# Switch adapters based on task
if user_task == "technical":
    model = PeftModel.from_pretrained(base_model, "technical_adapter")
elif user_task == "creative":
    model = PeftModel.from_pretrained(base_model, "creative_adapter")
# etc.

Key Insight¶

Multi-adapter training enables creating specialized "modes" for an LLM without the overhead of multiple full models. The same base model can serve different use cases by loading the appropriate adapter.

Model Notes¶

Model: Ministral-3B-Reasoning (3B parameters)
Reasoning Format: [THINK]...[/THINK] tags (native Ministral format)

In [7]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[7]:

Shutting down kernel to release GPU memory...

Out[7]:

{'status': 'ok', 'restart': False}