Fine-tuning an LLM with Hugging Face Trainer¶

In the previous notebook we fine-tuned a language model using a manual PyTorch training loop:

custom Dataset and DataLoader
explicit optimizer.step(), scheduler.step(), loss.backward()
manual evaluation

This notebook uses the same model and dataset, but relies on Hugging Face datasets and Trainer to handle most of the training boilerplate.

Goal:

Show how to fine-tune the same model and dataset with less code.
Connect the high-level Trainer API to the manual loop from the previous notebook.

We still:

Load a pretrained Llama model from disk.
Load the Guanaco / OpenAssistant JSONL dataset from disk.
Tokenize the data for causal language modeling.
Train for a small number of steps.
Run inference and compare base vs fine-tuned model.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to verify GPU access.

In [30]:

  Copied!     
 
import os
from dataclasses import dataclass

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

torch.__version__
import os from dataclasses import dataclass import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, ) from datasets import load_dataset torch.__version__

Out[30]:

'2.9.1+cu130'

In [31]:

  Copied!     
 
@dataclass
class Config:
    # Model from HuggingFace Hub (same as D3_02)
    HF_LLM_MODEL: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

    # Output directory for fine-tuned model
    output_dir: str = "ft_model_trainer"

    # Data
    max_length: int = 256

    # Optimization
    batch_size: int = 1          # reduced for memory efficiency
    num_epochs: int = 1
    learning_rate: float = 5e-6
    weight_decay: float = 0.01
    warmup_ratio: float = 0.1
    gradient_accumulation_steps: int = 16  # increased to compensate

    # For demo, use small subsets
    train_subset_size: int = 500
    val_subset_size: int = 100

    seed: int = 42
    device: str = "cuda" if torch.cuda.is_available() else "cpu"

cfg = Config()
cfg
@dataclass class Config: # Model from HuggingFace Hub (same as D3_02) HF_LLM_MODEL: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Output directory for fine-tuned model output_dir: str = "ft_model_trainer" # Data max_length: int = 256 # Optimization batch_size: int = 1 # reduced for memory efficiency num_epochs: int = 1 learning_rate: float = 5e-6 weight_decay: float = 0.01 warmup_ratio: float = 0.1 gradient_accumulation_steps: int = 16 # increased to compensate # For demo, use small subsets train_subset_size: int = 500 val_subset_size: int = 100 seed: int = 42 device: str = "cuda" if torch.cuda.is_available() else "cpu" cfg = Config() cfg

Out[31]:

Config(HF_LLM_MODEL='TinyLlama/TinyLlama-1.1B-Chat-v1.0', output_dir='ft_model_trainer', max_length=256, batch_size=1, num_epochs=1, learning_rate=5e-06, weight_decay=0.01, warmup_ratio=0.1, gradient_accumulation_steps=16, train_subset_size=500, val_subset_size=100, seed=42, device='cuda')

The hyperparameters here mirror those from the manual PyTorch notebook:

batch_size, num_epochs, learning_rate, weight_decay, warmup_ratio, and gradient_accumulation_steps have the same meaning.
max_length controls the maximum sequence length after tokenization.
For the demo we only use small subsets of the full dataset (train_subset_size and val_subset_size) so that fine-tuning finishes quickly.

The difference is that we will pass these values into TrainingArguments instead of using them directly in a manual training loop.

In [32]:

  Copied!     
 
def set_seed(seed: int):
    import random
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(cfg.seed)
def set_seed(seed: int): import random random.seed(seed) torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) set_seed(cfg.seed)

Out[32]:

[No output generated]

In [33]:

  Copied!     
 
tokenizer = AutoTokenizer.from_pretrained(cfg.HF_LLM_MODEL)

# Many causal LMs do not define a pad token by default
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Pad token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id)

model = AutoModelForCausalLM.from_pretrained(
    cfg.HF_LLM_MODEL,
    dtype=torch.bfloat16,  # Use bfloat16 for memory efficiency
)
model.to(cfg.device)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

n_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {n_params / 1e6:.1f}M")
print(f"Model dtype: {next(model.parameters()).dtype}")
tokenizer = AutoTokenizer.from_pretrained(cfg.HF_LLM_MODEL) # Many causal LMs do not define a pad token by default if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token print("Pad token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id) model = AutoModelForCausalLM.from_pretrained( cfg.HF_LLM_MODEL, dtype=torch.bfloat16, # Use bfloat16 for memory efficiency ) model.to(cfg.device) # Enable gradient checkpointing for memory efficiency model.gradient_checkpointing_enable() n_params = sum(p.numel() for p in model.parameters()) print(f"Number of parameters: {n_params / 1e6:.1f}M") print(f"Model dtype: {next(model.parameters()).dtype}")

Out[33]:

Pad token: </s> ID: 2

Out[33]:

Number of parameters: 1100.0M
Model dtype: torch.bfloat16

Dataset format and inference prompt¶

We use the Guanaco / OpenAssistant dataset. Each entry in the JSONL files has a single field:

"text": "### Human: <instruction>### Assistant: <ideal answer>"

During fine-tuning:

The model sees the full text string.
It is trained as a causal language model to predict the next token at each position.

At inference time we only have a new user instruction. We must recreate the same format the model saw during training, for example:

### Human: How do I build a PC?### Assistant:

and let the model generate the continuation.

In [34]:

  Copied!     
 
def build_prompt_for_inference(user_instruction: str) -> str:
    """
    Build a Guanaco-style prompt for a new instruction at inference time.
    The dataset format looks like:
        "### Human: ...### Assistant: ..."
    """
    return f"### Human: {user_instruction}### Assistant:"
def build_prompt_for_inference(user_instruction: str) -> str: """ Build a Guanaco-style prompt for a new instruction at inference time. The dataset format looks like: "### Human: ...### Assistant: ..." """ return f"### Human: {user_instruction}### Assistant:"

Out[34]:

[No output generated]

In [35]:

  Copied!     
 
# Load dataset from HuggingFace Hub (same as D3_02)
dataset = load_dataset("timdettmers/openassistant-guanaco")
dataset
# Load dataset from HuggingFace Hub (same as D3_02) dataset = load_dataset("timdettmers/openassistant-guanaco") dataset

Out[35]:

Repo card metadata block was not found. Setting CardData to empty.

Out[35]:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [36]:

  Copied!     
 
print(dataset)
print("Example training entry:")
print(dataset["train"][0])

# For the demo, restrict to small subsets
train_dataset = dataset["train"].select(range(cfg.train_subset_size))
val_dataset   = dataset["test"].select(range(cfg.val_subset_size))  # Note: "test" not "validation"

len(train_dataset), len(val_dataset)
print(dataset) print("Example training entry:") print(dataset["train"][0]) # For the demo, restrict to small subsets train_dataset = dataset["train"].select(range(cfg.train_subset_size)) val_dataset = dataset["test"].select(range(cfg.val_subset_size)) # Note: "test" not "validation" len(train_dataset), len(val_dataset)

Out[36]:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})
Example training entry:
{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog'}

Out[36]:

(500, 100)

Tokenization and preprocessing with `datasets`¶

Instead of writing a custom PyTorch Dataset class and a DataLoader, we now:

Use datasets.load_dataset to read the JSONL files.
Define a tokenization function that:
- tokenizes the "text" field,
- truncates or pads to max_length,
- sets labels equal to input_ids for causal language modeling.
Apply this function to the whole dataset with dataset.map(...).

The result is a Dataset object that already returns tokenized fields, which Trainer can use directly.

In [37]:

  Copied!     
 
def tokenize_function(batch):
    """
    Tokenize the 'text' field for causal language modeling.

    We:
    - truncate or pad to cfg.max_length,
    - set labels = input_ids (shift is handled by the model internally).
    """
    enc = tokenizer(
        batch["text"],
        truncation=True,
        max_length=cfg.max_length,
        padding="max_length",
    )
    # For causal LM supervised fine-tuning, labels are often the same as input_ids
    enc["labels"] = enc["input_ids"].copy()
    return enc

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],  # drop original string to keep only tokenized fields
)

tokenized_val = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)

tokenized_train[0]
def tokenize_function(batch): """ Tokenize the 'text' field for causal language modeling. We: - truncate or pad to cfg.max_length, - set labels = input_ids (shift is handled by the model internally). """ enc = tokenizer( batch["text"], truncation=True, max_length=cfg.max_length, padding="max_length", ) # For causal LM supervised fine-tuning, labels are often the same as input_ids enc["labels"] = enc["input_ids"].copy() return enc tokenized_train = train_dataset.map( tokenize_function, batched=True, remove_columns=["text"], # drop original string to keep only tokenized fields ) tokenized_val = val_dataset.map( tokenize_function, batched=True, remove_columns=["text"], ) tokenized_train[0]

Out[37]:

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Out[37]:

{'input_ids': [1,
  835,
  12968,
  29901,
  1815,
  366,
  2436,
  263,
  3273,
  18707,
  1048,
  278,
  29527,
  749,
  310,
  278,
  1840,
  376,
  3712,
  459,
  1100,
  29891,
  29908,
  297,
  7766,
  1199,
  29973,
  3529,
  671,
  6455,
  4475,
  304,
  7037,
  1601,
  459,
  1100,
  583,
  297,
  278,
  23390,
  9999,
  322,
  274,
  568,
  8018,
  5925,
  29889,
  2277,
  29937,
  4007,
  22137,
  29901,
  376,
  7185,
  459,
  1100,
  29891,
  29908,
  14637,
  304,
  263,
  9999,
  3829,
  988,
  727,
  338,
  871,
  697,
  1321,
  7598,
  363,
  263,
  3153,
  1781,
  470,
  2669,
  29889,
  512,
  7766,
  1199,
  29892,
  445,
  1840,
  338,
  10734,
  8018,
  297,
  278,
  10212,
  9999,
  29892,
  988,
  263,
  1601,
  459,
  1100,
  29891,
  5703,
  261,
  756,
  7282,
  3081,
  975,
  278,
  281,
  1179,
  322,
  1985,
  5855,
  310,
  1009,
  22873,
  29889,
  450,
  10122,
  310,
  263,
  1601,
  459,
  1100,
  29891,
  508,
  1121,
  297,
  5224,
  281,
  1179,
  322,
  12212,
  5703,
  358,
  28602,
  1907,
  363,
  17162,
  29892,
  408,
  278,
  5703,
  261,
  756,
  2217,
  297,
  1760,
  573,
  304,
  7910,
  281,
  1179,
  470,
  3867,
  2253,
  1985,
  5855,
  29889,
  13,
  13,
  4789,
  296,
  5925,
  756,
  15659,
  7037,
  1601,
  459,
  1100,
  583,
  297,
  6397,
  2722,
  1316,
  408,
  3240,
  737,
  322,
  5172,
  9687,
  29892,
  988,
  263,
  2846,
  2919,
  14582,
  2761,
  263,
  7282,
  11910,
  310,
  278,
  9999,
  313,
  29933,
  440,
  575,
  669,
  341,
  728,
  295,
  29892,
  29871,
  29906,
  29900,
  29896,
  29941,
  467,
  512,
  1438,
  6397,
  2722,
  29892,
  17162,
  4049,
  3700,
  4482,
  281,
  1179,
  29892,
  9078,
  23633,
  29892,
  322,
  12212,
  289,
  1191,
  17225,
  3081,
  29892,
  8236,
  304,
  263,
  6434,
  988,
  896,
  526,
  14278,
  373,
  278,
  5703,
  261,
  363,
  1009,
  7294,
  22342,
  29889,
  910,
  26307,
  508,
  1121,
  297,
  4340,
  1462,
  23881,
  310,
  281,
  1179,
  322],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [1,
  835,
  12968,
  29901,
  1815,
  366,
  2436,
  263,
  3273,
  18707,
  1048,
  278,
  29527,
  749,
  310,
  278,
  1840,
  376,
  3712,
  459,
  1100,
  29891,
  29908,
  297,
  7766,
  1199,
  29973,
  3529,
  671,
  6455,
  4475,
  304,
  7037,
  1601,
  459,
  1100,
  583,
  297,
  278,
  23390,
  9999,
  322,
  274,
  568,
  8018,
  5925,
  29889,
  2277,
  29937,
  4007,
  22137,
  29901,
  376,
  7185,
  459,
  1100,
  29891,
  29908,
  14637,
  304,
  263,
  9999,
  3829,
  988,
  727,
  338,
  871,
  697,
  1321,
  7598,
  363,
  263,
  3153,
  1781,
  470,
  2669,
  29889,
  512,
  7766,
  1199,
  29892,
  445,
  1840,
  338,
  10734,
  8018,
  297,
  278,
  10212,
  9999,
  29892,
  988,
  263,
  1601,
  459,
  1100,
  29891,
  5703,
  261,
  756,
  7282,
  3081,
  975,
  278,
  281,
  1179,
  322,
  1985,
  5855,
  310,
  1009,
  22873,
  29889,
  450,
  10122,
  310,
  263,
  1601,
  459,
  1100,
  29891,
  508,
  1121,
  297,
  5224,
  281,
  1179,
  322,
  12212,
  5703,
  358,
  28602,
  1907,
  363,
  17162,
  29892,
  408,
  278,
  5703,
  261,
  756,
  2217,
  297,
  1760,
  573,
  304,
  7910,
  281,
  1179,
  470,
  3867,
  2253,
  1985,
  5855,
  29889,
  13,
  13,
  4789,
  296,
  5925,
  756,
  15659,
  7037,
  1601,
  459,
  1100,
  583,
  297,
  6397,
  2722,
  1316,
  408,
  3240,
  737,
  322,
  5172,
  9687,
  29892,
  988,
  263,
  2846,
  2919,
  14582,
  2761,
  263,
  7282,
  11910,
  310,
  278,
  9999,
  313,
  29933,
  440,
  575,
  669,
  341,
  728,
  295,
  29892,
  29871,
  29906,
  29900,
  29896,
  29941,
  467,
  512,
  1438,
  6397,
  2722,
  29892,
  17162,
  4049,
  3700,
  4482,
  281,
  1179,
  29892,
  9078,
  23633,
  29892,
  322,
  12212,
  289,
  1191,
  17225,
  3081,
  29892,
  8236,
  304,
  263,
  6434,
  988,
  896,
  526,
  14278,
  373,
  278,
  5703,
  261,
  363,
  1009,
  7294,
  22342,
  29889,
  910,
  26307,
  508,
  1121,
  297,
  4340,
  1462,
  23881,
  310,
  281,
  1179,
  322]}

In [38]:

  Copied!     
 
# Ensure the datasets return PyTorch tensors
tokenized_train.set_format(type="torch")
tokenized_val.set_format(type="torch")

tokenized_train
# Ensure the datasets return PyTorch tensors tokenized_train.set_format(type="torch") tokenized_val.set_format(type="torch") tokenized_train

Out[38]:

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 500
})

In [39]:

  Copied!     
 
os.makedirs(cfg.output_dir, exist_ok=True)

training_args = TrainingArguments(
    output_dir=cfg.output_dir,
    per_device_train_batch_size=cfg.batch_size,
    per_device_eval_batch_size=cfg.batch_size,
    num_train_epochs=cfg.num_epochs,
    learning_rate=cfg.learning_rate,
    weight_decay=cfg.weight_decay,
    warmup_ratio=cfg.warmup_ratio,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    gradient_accumulation_steps=cfg.gradient_accumulation_steps,
    bf16=torch.cuda.is_available(),  # Use bf16 instead of fp16
    report_to="none",
    load_best_model_at_end=True,
)
training_args
os.makedirs(cfg.output_dir, exist_ok=True) training_args = TrainingArguments( output_dir=cfg.output_dir, per_device_train_batch_size=cfg.batch_size, per_device_eval_batch_size=cfg.batch_size, num_train_epochs=cfg.num_epochs, learning_rate=cfg.learning_rate, weight_decay=cfg.weight_decay, warmup_ratio=cfg.warmup_ratio, logging_steps=10, eval_strategy="epoch", save_strategy="epoch", gradient_accumulation_steps=cfg.gradient_accumulation_steps, bf16=torch.cuda.is_available(), # Use bf16 instead of fp16 report_to="none", load_best_model_at_end=True, ) training_args

Out[39]:

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=None,
hub_revision=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=no,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
liger_kernel_config=None,
load_best_model_at_end=True,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=ft_model_trainer/runs/Dec29_02-12-28_25f2198b1a05,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=loss,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1,
optim=OptimizerNames.ADAMW_TORCH_FUSED,
optim_args=None,
optim_target_modules=None,
output_dir=ft_model_trainer,
overwrite_output_dir=False,
parallelism_config=None,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
project=huggingface,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=None,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=SaveStrategy.EPOCH,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
trackio_space_id=trackio,
use_cpu=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.01,
)

Using `TrainingArguments` and `Trainer`¶

TrainingArguments defines how training should work:

per_device_train_batch_size, num_train_epochs, learning_rate, weight_decay, warmup_ratio, and gradient_accumulation_steps correspond directly to the values we used in the manual PyTorch loop.
evaluation_strategy="epoch" and save_strategy="epoch" tell Trainer to run evaluation and save checkpoints at the end of each epoch.
fp16=True enables mixed precision training on GPU, similar to using torch.cuda.amp.

Trainer will:

construct DataLoaders internally,
run the training loop (forward, loss, backward, optimizer step, scheduler step),
handle evaluation and model saving.

Conceptually, under the hood it performs the same sequence of operations as our manual loop in the previous notebook.

In [40]:

  Copied!     
 
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_val, )

Out[40]:

[No output generated]

In [41]:

  Copied!     
 
train_result = trainer.train()
train_result
train_result = trainer.train() train_result

Out[41]:

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.

Out[41]:

<IPython.core.display.HTML object>

Out[41]:

TrainOutput(global_step=32, training_loss=2.4102229923009872, metrics={'train_runtime': 30.9312, 'train_samples_per_second': 16.165, 'train_steps_per_second': 1.035, 'total_flos': 794505510912000.0, 'train_loss': 2.4102229923009872, 'epoch': 1.0})

In [42]:

  Copied!     
 
trainer.save_model(cfg.output_dir)
tokenizer.save_pretrained(cfg.output_dir)
print(f"Model saved to {cfg.output_dir}")
trainer.save_model(cfg.output_dir) tokenizer.save_pretrained(cfg.output_dir) print(f"Model saved to {cfg.output_dir}")

Out[42]:

Model saved to ft_model_trainer

In [43]:

  Copied!     
 
metrics = trainer.evaluate()
metrics
metrics = trainer.evaluate() metrics

Out[43]:

<IPython.core.display.HTML object>

Out[43]:

{'eval_loss': 1.7101176977157593,
 'eval_runtime': 1.1446,
 'eval_samples_per_second': 87.366,
 'eval_steps_per_second': 87.366,
 'epoch': 1.0}

After calling trainer.train():

Trainer has iterated over the training dataset for num_train_epochs epochs.
For each step, it has:
- computed the loss,
- backpropagated the gradients,
- updated the optimizer and learning rate scheduler.
After each epoch, it has:
- evaluated on the validation dataset,
- saved a checkpoint,
- optionally kept the best-performing model in memory.

The metrics dictionary from trainer.evaluate() contains at least the validation loss (eval_loss), which is directly comparable to the validation loss from the manual PyTorch notebook.

In [44]:

  Copied!     
 
# Base model (unmodified) from HuggingFace Hub
base_model = AutoModelForCausalLM.from_pretrained(
    cfg.HF_LLM_MODEL,
    dtype=torch.bfloat16,
).to(cfg.device)
base_model.eval()

# Fine-tuned model (from Trainer output)
ft_model = AutoModelForCausalLM.from_pretrained(
    cfg.output_dir,
    dtype=torch.bfloat16,
).to(cfg.device)
ft_model.eval()
# Base model (unmodified) from HuggingFace Hub base_model = AutoModelForCausalLM.from_pretrained( cfg.HF_LLM_MODEL, dtype=torch.bfloat16, ).to(cfg.device) base_model.eval() # Fine-tuned model (from Trainer output) ft_model = AutoModelForCausalLM.from_pretrained( cfg.output_dir, dtype=torch.bfloat16, ).to(cfg.device) ft_model.eval()

Out[44]:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
)

In [45]:

  Copied!     
 
def generate_response(model, instruction: str, max_new_tokens: int = 128):
    """
    Generate a reply from the model given a human instruction.
    We create a Guanaco-style prompt:
        "### Human: ...### Assistant:"
    and let the model continue.
    """
    prompt_text = build_prompt_for_inference(instruction)
    inputs = tokenizer(prompt_text, return_tensors="pt").to(cfg.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.pad_token_id,
        )

    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return text
def generate_response(model, instruction: str, max_new_tokens: int = 128): """ Generate a reply from the model given a human instruction. We create a Guanaco-style prompt: "### Human: ...### Assistant:" and let the model continue. """ prompt_text = build_prompt_for_inference(instruction) inputs = tokenizer(prompt_text, return_tensors="pt").to(cfg.device) with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=tokenizer.pad_token_id, ) text = tokenizer.decode(output_ids[0], skip_special_tokens=True) return text

Out[45]:

[No output generated]

In [46]:

  Copied!     
 
# Pick an example from the validation set
example = val_dataset[11]
example_text = example["text"]

# Crude split to get the human instruction and assistant answer, for display
if "### Human:" in example_text and "### Assistant:" in example_text:
    human_part = example_text.split("### Human:")[1].split("### Assistant:")[0].strip()
    assistant_part = example_text.split("### Assistant:")[1].strip()
else:
    human_part = example_text
    assistant_part = ""

print("### HUMAN (PROMPT) ###")
print(human_part)

print("\n### GROUND TRUTH ASSISTANT ###")
print(assistant_part)

print("\n### BASE MODEL ###")
print(generate_response(base_model, human_part))

print("\n### FINE-TUNED MODEL ###")
print(generate_response(ft_model, human_part))
# Pick an example from the validation set example = val_dataset[11] example_text = example["text"] # Crude split to get the human instruction and assistant answer, for display if "### Human:" in example_text and "### Assistant:" in example_text: human_part = example_text.split("### Human:")[1].split("### Assistant:")[0].strip() assistant_part = example_text.split("### Assistant:")[1].strip() else: human_part = example_text assistant_part = "" print("### HUMAN (PROMPT) ###") print(human_part) print("\n### GROUND TRUTH ASSISTANT ###") print(assistant_part) print("\n### BASE MODEL ###") print(generate_response(base_model, human_part)) print("\n### FINE-TUNED MODEL ###") print(generate_response(ft_model, human_part))

Out[46]:

### HUMAN (PROMPT) ###
How do I build a PC?

### GROUND TRUTH ASSISTANT ###
Building a PC can be a rewarding experience, but it requires some technical knowledge and patience. Here are the general steps you can follow:

1. Determine your budget and what you want to use the PC for (gaming, work, etc.). This will help you choose the appropriate components.

2. Choose the components you want to use (CPU, motherboard, RAM, storage, GPU, power supply, case, and peripherals).

3. Make sure all the components are compatible with each other. Check the manufacturer's website and online forums for compatibility information.

4. Gather the necessary tools (screwdriver, thermal paste, etc.).

5. Install the CPU onto the motherboard, following the manufacturer's instructions.

6. Install the RAM onto the motherboard, following the manufacturer's instructions.

7. Install the storage (hard drive or solid-state drive) onto the motherboard or into the case, following the manufacturer's instructions.

8. Install the power supply into the case, following the manufacturer's instructions.

9. Install the GPU onto the motherboard, following the manufacturer's instructions.

10. Connect all the necessary cables to the motherboard, power supply, and peripherals.

11. Turn on the PC and make sure it boots up properly.

12. Install the operating system and necessary drivers.

13. Test the PC to make sure all the components are working properly.

These are the basic steps, but there may be some variation depending on the specific components you choose. It's important to follow the manufacturer's instructions and take your time to avoid damaging any components. There are also many online resources, such as YouTube tutorials and PC building forums, that can provide additional guidance and tips.### Human: Thank you. Is it better to build my own PC or to just buy one off the shelf? Give me the pros and cons of each approach.

### BASE MODEL ###

Out[46]:

### Human: How do I build a PC?### Assistant: Sure, let's break it down. Building a PC involves putting together different parts and components to create an overall system that meets your computing needs. Here are the steps involved:

1. Choose your computer hardware - You need to choose the motherboard, processor, RAM, graphics card, storage devices, power supply unit (PSU), and other essential components such as fans, heat sinks, and cables.

2. Install the motherboard - The motherboard is the brains of your PC, it controls everything from power supply, CPU, RAM, to video output. It connects all the other

### FINE-TUNED MODEL ###

Out[46]:

### Human: How do I build a PC?### Assistant: To build a personal computer, you need to follow these steps:

1. Determine your budget and needs. Consider the type of operating system you want (Windows or Linux), storage capacity, processing power, graphics card, monitor, and other peripherals like keyboards, mice, speakers, and printers. 2. Choose a case and motherboard. A case is where your components go, while a motherboard is what connects them together. Check for compatibility with your chosen processor, RAM, and storage device. 3. Select a processor. There are many types of processors available depending

Summary and comparison to manual PyTorch fine-tuning¶

In this notebook we:

Loaded the same pretrained Llama model and Guanaco dataset as in the manual PyTorch notebook.
Used datasets.load_dataset to load the JSONL files directly into a DatasetDict.
Applied tokenization with dataset.map, creating input_ids, attention_mask, and labels fields.
Configured training behaviour via TrainingArguments.
Used Trainer to handle:
- batching and shuffling,
- gradient accumulation,
- mixed precision (on GPU),
- learning rate scheduling,
- evaluation and checkpointing.
Saved the fine-tuned model and compared its outputs to the base model.

Conceptually, Trainer performs the same operations as the manual PyTorch training loop from the previous notebook: forward pass, loss computation, backward pass, optimizer step, and scheduler step. The main difference is that these steps are now handled by a higher-level API, letting us focus on model, data, and hyperparameters rather than on training boilerplate.

In [ ]:

  Copied!     
 
# Shut down the kernel to release memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shut down the kernel to release memory import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Fine-tuning an LLM with Hugging Face Trainer¶

Dataset format and inference prompt¶

Tokenization and preprocessing with datasets¶

Using TrainingArguments and Trainer¶

Summary and comparison to manual PyTorch fine-tuning¶

Tokenization and preprocessing with `datasets`¶

Using `TrainingArguments` and `Trainer`¶