SFT Training Test: Qwen3-4B-Thinking-2507¶
Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Qwen3-4B-Thinking-2507.
Key features tested:
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- SFTTrainer with synthetic dataset including
<think>content - Training the model to produce self-questioning reasoning
- Post-training inference verification
Thinking Style: Self-questioning internal dialogue
- "What is the user asking here?"
- "Let me think about the key concepts..."
- "How should I structure this explanation?"
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")
model, tokenizer = FastLanguageModel.from_pretrained(
MODEL_NAME,
max_seq_length=1024, # Increased for thinking content
load_in_4bit=True,
dtype=None, # Auto-detect
)
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, dtype=None, # Auto-detect ) print(f"Model loaded: {type(model).__name__}")
Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...==((====))== Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading weights: 0%| | 0/398 [00:00<?, ?it/s]
Model loaded: Qwen3ForCausalLM
In [3]:
Copied!
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
LoRA applied: 33,030,144 trainable / 2,526,543,360 total (1.31%)
In [4]:
Copied!
# Create minimal synthetic instruction dataset with thinking content (5 samples)
# Using self-questioning internal dialogue style for thinking
from datasets import Dataset
synthetic_data = [
{
"instruction": "What is machine learning?",
"thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
"response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
},
{
"instruction": "Explain Python in one sentence.",
"thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
"response": "Python is a high-level programming language known for its readability and versatility."
},
{
"instruction": "What is a neural network?",
"thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
"response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
},
{
"instruction": "Define supervised learning.",
"thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.",
"response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."
},
{
"instruction": "What is gradient descent?",
"thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.",
"response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent."
},
]
# Format as chat conversations with thinking content
def format_conversation(sample):
# Combine thinking and response with proper tags
assistant_content = f"<think>\n{sample['thinking']}\n</think>\n\n{sample['response']}"
messages = [
{"role": "user", "content": sample["instruction"]},
{"role": "assistant", "content": assistant_content}
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
dataset = Dataset.from_list(synthetic_data)
dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
print(f"Dataset created: {len(dataset)} samples")
print(f"\nSample formatted text:")
print(dataset[0]['text'][:500] + "...")
# Create minimal synthetic instruction dataset with thinking content (5 samples) # Using self-questioning internal dialogue style for thinking from datasets import Dataset synthetic_data = [ { "instruction": "What is machine learning?", "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.", "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data." }, { "instruction": "Explain Python in one sentence.", "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?", "response": "Python is a high-level programming language known for its readability and versatility." }, { "instruction": "What is a neural network?", "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.", "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers." }, { "instruction": "Define supervised learning.", "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.", "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs." }, { "instruction": "What is gradient descent?", "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.", "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent." }, ] # Format as chat conversations with thinking content def format_conversation(sample): # Combine thinking and response with proper tags assistant_content = f"\n{sample['thinking']}\n \n\n{sample['response']}" messages = [ {"role": "user", "content": sample["instruction"]}, {"role": "assistant", "content": assistant_content} ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)} dataset = Dataset.from_list(synthetic_data) dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"]) print(f"Dataset created: {len(dataset)} samples") print(f"\nSample formatted text:") print(dataset[0]['text'][:500] + "...")
Map: 0%| | 0/5 [00:00<?, ? examples/s]
Dataset created: 5 samples Sample formatted text: <|im_start|>user What is machine learning?<|im_end|> <|im_start|>assistant <think> What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition. </think> Machine learning is a subset of artificial intelligence where computers learn patterns from data.<|im_end|> ...
In [ ]:
Copied!
# SFT Training (minimal steps for testing)
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="outputs_sft_qwen_think_test",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_steps=3, # Minimal steps for testing
warmup_steps=1,
learning_rate=2e-4,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
weight_decay=0.01,
max_seq_length=1024, # Increased for thinking content
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
args=sft_config,
)
print("Starting SFT training with thinking content (3 steps)...")
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
print(f"Training completed. Final loss: {final_loss:.4f}")
# SFT Training (minimal steps for testing) from trl import SFTTrainer, SFTConfig sft_config = SFTConfig( output_dir="outputs_sft_qwen_think_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", weight_decay=0.01, max_seq_length=1024, # Increased for thinking content seed=42, ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", args=sft_config, ) print("Starting SFT training with thinking content (3 steps)...") trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 'N/A') print(f"Training completed. Final loss: {final_loss:.4f}")
In [6]:
Copied!
# Post-training inference test
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "What is deep learning?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1024, # Increased to allow full thinking + response
temperature=0.6,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
# Get generated token IDs only (exclude prompt)
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0][input_length:].tolist()
# Token-based parsing using </think> token ID
THINK_END_TOKEN_ID = 151668
if THINK_END_TOKEN_ID in generated_ids:
end_idx = generated_ids.index(THINK_END_TOKEN_ID)
thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip()
final_response = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip()
think_tag_found = True
else:
thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
final_response = "(Model did not complete thinking - increase max_new_tokens)"
think_tag_found = False
print("=" * 60)
print("SFT Training Pipeline Test (Thinking Mode)")
print("=" * 60)
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO'}")
print(f"Output tokens: {len(generated_ids)}")
print(f"\nTHINKING CONTENT:")
print(thinking[:400] + "..." if len(thinking) > 400 else thinking)
print(f"\nFINAL RESPONSE:")
print(final_response[:300] + "..." if len(final_response) > 300 else final_response)
if think_tag_found and thinking and final_response:
print("\n✅ SFT Training Pipeline Test PASSED")
else:
print("\n⚠️ Test completed but output may need review")
# Post-training inference test FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": "What is deep learning?"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1024, # Increased to allow full thinking + response temperature=0.6, top_p=0.95, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) # Get generated token IDs only (exclude prompt) input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0][input_length:].tolist() # Token-based parsing using token ID THINK_END_TOKEN_ID = 151668 if THINK_END_TOKEN_ID in generated_ids: end_idx = generated_ids.index(THINK_END_TOKEN_ID) thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip() final_response = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip() think_tag_found = True else: thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() final_response = "(Model did not complete thinking - increase max_new_tokens)" think_tag_found = False print("=" * 60) print("SFT Training Pipeline Test (Thinking Mode)") print("=" * 60) print(f" token found: {'✅ YES' if think_tag_found else '❌ NO'}") print(f"Output tokens: {len(generated_ids)}") print(f"\nTHINKING CONTENT:") print(thinking[:400] + "..." if len(thinking) > 400 else thinking) print(f"\nFINAL RESPONSE:") print(final_response[:300] + "..." if len(final_response) > 300 else final_response) if think_tag_found and thinking and final_response: print("\n✅ SFT Training Pipeline Test PASSED") else: print("\n⚠️ Test completed but output may need review")
============================================================ SFT Training Pipeline Test (Thinking Mode) ============================================================ </think> token found: ✅ YES Output tokens: 1024 THINKING CONTENT: Okay, the user is asking "What is deep learning?" Hmm, this seems like a pretty basic question about AI, but I should be careful not to assume their level of knowledge. Maybe they're a complete beginner? Or could they be a student studying machine learning? First, I need to define deep learning clearly but simply. I should avoid jargon overload. The key is to connect it to something they might k... FINAL RESPONSE: **Deep learning** is a **subset of machine learning** that uses **artificial neural networks** with multiple layers (hence "deep") to automatically learn complex patterns from large amounts of data. It’s designed to mimic the human brain’s ability to recognize patterns in unstructured data (like ima... ✅ SFT Training Pipeline Test PASSED
Test Complete¶
The SFT Training Pipeline test with thinking content has completed successfully. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration (r=16, all projection modules)
- Synthetic dataset creation with
<think>tags and self-questioning style - SFTTrainer training loop (3 steps)
- Post-training inference with thinking output
Thinking Style Trained¶
- Self-questioning internal dialogue
- "What is the user asking?" pattern
- "How should I structure this?" reasoning
Ready for Production¶
If this test passed, your environment is ready for:
- Full SFT fine-tuning on larger datasets with thinking content
- Chain-of-thought training workflows
- Model saving and deployment with thinking capabilities
In [7]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
Out[7]:
{'status': 'ok', 'restart': False}