D3 04 Quantization

Quantization¶

Quantization is a technique used to reduce the computational requirements and memory footprint of deep learning models. It works by reducing the precision of the numbers used to represent model parameters (weights) and activations, which can lead to significant improvements in GPU memory usage and inference speed.

No description has been provided for this image

When using quantization, weights are converted to lower precision (e.g., from 32-bit floating-point to 4-bit integers) when loading the model. During inference, activations may remain in higher precision.

Benefits of Quantization:¶

Reduced Model Size: Quantization can significantly reduce the model's memory footprint, making it easier to deploy on edge devices.
Improved Inference Speed: Lower precision arithmetic operations are generally faster, leading to reduced latency.

Quantization Methods¶

There are many different quantization methods. An overview of the different quantization methods supported by the Huggingface Transformers library can be found in the Huggingface Transformers documentation.

Hands-On Example: Quantizing a Pre-trained Model¶

In this example, we will quantize a pre-trained chat model to reduce its GPU memory usage and we will measure the inference speed. We will use the bitsandbytes library.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to verify GPU access.

In [31]:

  Copied!     
 
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from datasets import load_dataset

torch.__version__
# Import necessary libraries import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig from datasets import load_dataset torch.__version__

Out[31]:

'2.9.1+cu130'

In [32]:

  Copied!     
 
# Utility function to get model size:
def get_model_size(model):
    model_size = 0
    for param in model.parameters():
        model_size += param.nelement() * param.element_size()
    for buffer in model.buffers():
        model_size += buffer.nelement() * buffer.element_size()
    return model_size
# Utility function to get model size: def get_model_size(model): model_size = 0 for param in model.parameters(): model_size += param.nelement() * param.element_size() for buffer in model.buffers(): model_size += buffer.nelement() * buffer.element_size() return model_size

Out[32]:

[No output generated]

Explanation: This function determines the size of a model, which is important to know since we are working with limited memory. It does not only take into account the number of parameters (weights & biases) and their datatypes, but also the buffers, which a model uses to e.g. run averages in batch normalization layers.

Choose a model and load tokenizer¶

In [33]:

  Copied!     
 
# Use TinyLlama for consistency with D3_02/D3_03 notebooks
HF_LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

chat_tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)
# For some models, we need to set a padding token and the padding side:
chat_tokenizer.pad_token_id = chat_tokenizer.eos_token_id
chat_tokenizer.padding_side = 'left'

print(f"Model: {HF_LLM_MODEL}")
print(f"Tokenizer: pad_token_id={chat_tokenizer.pad_token_id}, padding_side={chat_tokenizer.padding_side}")
# Use TinyLlama for consistency with D3_02/D3_03 notebooks HF_LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" chat_tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL) # For some models, we need to set a padding token and the padding side: chat_tokenizer.pad_token_id = chat_tokenizer.eos_token_id chat_tokenizer.padding_side = 'left' print(f"Model: {HF_LLM_MODEL}") print(f"Tokenizer: pad_token_id={chat_tokenizer.pad_token_id}, padding_side={chat_tokenizer.padding_side}")

Out[33]:

Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Tokenizer: pad_token_id=2, padding_side=left

Load model in FP32¶

In [34]:

  Copied!     
 
chat_model_32bit = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    device_map="cuda", 
    dtype=torch.float32,
    trust_remote_code=True)
chat_pipe_32bit = pipeline("text-generation", model=chat_model_32bit, tokenizer=chat_tokenizer)
print(f"The FP32 model uses {get_model_size(chat_model_32bit)/1024**3:.2f} GB of GPU memory")
chat_model_32bit = AutoModelForCausalLM.from_pretrained( HF_LLM_MODEL, device_map="cuda", dtype=torch.float32, trust_remote_code=True) chat_pipe_32bit = pipeline("text-generation", model=chat_model_32bit, tokenizer=chat_tokenizer) print(f"The FP32 model uses {get_model_size(chat_model_32bit)/1024**3:.2f} GB of GPU memory")

Out[34]:

Device set to use cuda

Out[34]:

The FP32 model uses 4.10 GB of GPU memory

Create a pipeline for the model¶

A pipeline is a simple way to perform inference with a model. For text-generation models, we can use the following format for the question to the model:

In [35]:

  Copied!     
 
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you solve the equation 2x + 3 = 7?"},
]
messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Can you solve the equation 2x + 3 = 7?"}, ]

Out[35]:

[No output generated]

In [36]:

  Copied!     
 
%time output = chat_pipe_32bit(messages, max_new_tokens=500, return_full_text=False)
print(output[0]["generated_text"])
%time output = chat_pipe_32bit(messages, max_new_tokens=500, return_full_text=False) print(output[0]["generated_text"])

Out[36]:

CPU times: user 918 ms, sys: 83.9 ms, total: 1 s
Wall time: 1.01 s
Yes, I can solve the equation 2x + 3 = 7.

Solution:

2x + 3 = 7
(2 * x) + 3 = 7
x = (-3/2) * (7/2) = -1.5

Therefore, the solution is:
-1.5

Alternatively, you can use a calculator to solve this equation.

Load model in BF16¶

In [37]:

  Copied!     
 
chat_model_16bit = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    device_map="cuda",
    dtype=torch.bfloat16,
    trust_remote_code=True)
chat_pipe_16bit = pipeline("text-generation", model=chat_model_16bit, tokenizer=chat_tokenizer)
print(f"The BF16 model uses {get_model_size(chat_model_16bit)/1024**3:.2f} GB of GPU memory")
chat_model_16bit = AutoModelForCausalLM.from_pretrained( HF_LLM_MODEL, device_map="cuda", dtype=torch.bfloat16, trust_remote_code=True) chat_pipe_16bit = pipeline("text-generation", model=chat_model_16bit, tokenizer=chat_tokenizer) print(f"The BF16 model uses {get_model_size(chat_model_16bit)/1024**3:.2f} GB of GPU memory")

Out[37]:

Device set to use cuda

Out[37]:

The BF16 model uses 2.05 GB of GPU memory

In [38]:

  Copied!     
 
%time output = chat_pipe_16bit(messages, max_new_tokens=500, return_full_text=False)
print(output[0]["generated_text"])
%time output = chat_pipe_16bit(messages, max_new_tokens=500, return_full_text=False) print(output[0]["generated_text"])

Out[38]:

CPU times: user 1.22 s, sys: 26.1 ms, total: 1.25 s
Wall time: 1.26 s
Yes, I can solve the equation 2x + 3 = 7 using arithmetic operations in Python. Here's an example:

```python
x = float(input("Enter a value for x: "))

# Check if x is a valid number
if not isinstance(x, float):
    print("Invalid input. Please enter a valid number.")
    exit()

# Calculate the value of y using x as a variable
y = 2 * x + 3

# Display the result
print("The value of y is", y)
```

In this example, we prompt the user to enter a value for x using the `input()` function. We then check if x is a float using the `isinstance()` function. If it is not a float, we print an error message and exit the program.

If x is a float, we calculate the value of y using the `2 * x + 3` expression and display the result using the `print()` function.

Load model in 4 bit quantization¶

In [39]:

  Copied!     
 
chat_model_4bit = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    device_map="cuda", 
    dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    trust_remote_code=True)
chat_pipe_4bit = pipeline("text-generation", model=chat_model_4bit, tokenizer=chat_tokenizer)
print(f"The 4bit model uses {get_model_size(chat_model_4bit)/1024**3:.2f} GB of GPU memory")
chat_model_4bit = AutoModelForCausalLM.from_pretrained( HF_LLM_MODEL, device_map="cuda", dtype=torch.bfloat16, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ), trust_remote_code=True) chat_pipe_4bit = pipeline("text-generation", model=chat_model_4bit, tokenizer=chat_tokenizer) print(f"The 4bit model uses {get_model_size(chat_model_4bit)/1024**3:.2f} GB of GPU memory")

Out[39]:

Device set to use cuda

Out[39]:

The 4bit model uses 0.70 GB of GPU memory

In [40]:

  Copied!     
 
%time output = chat_pipe_4bit(messages, max_new_tokens=500, return_full_text=False)
print(output[0]["generated_text"])
%time output = chat_pipe_4bit(messages, max_new_tokens=500, return_full_text=False) print(output[0]["generated_text"])

Out[40]:

CPU times: user 462 ms, sys: 0 ns, total: 462 ms
Wall time: 462 ms
Yes, here's the equation: 

2 * 3 + 7 = 2 * 3 + 7

Let's break down each term:

- 2 * 3 = 6
- 6 + 7 = 13

Therefore, the result of the equation is 13.

In [ ]:

Questions to Consider:¶

How does the size reduction differ between the three data types (FP32, BFLOAT16, 4 bit quantization)?
How does the inference speed depend on the data type used for loading the model?

The first question was already answered, but for the second question we only checked inference speed for a single prompt, where the overhead might take much longer than the actual calculation on the GPU. To get a better idea of the difference in speed, we will now load a dataset and query the model a lot of times.

In [41]:

  Copied!     
 
# Load the guanaco dataset from HuggingFace Hub
# Using 16 samples (reduced from 64) for reasonable batch inference timing on consumer GPUs
guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train[:16]')
guanaco_train
# Load the guanaco dataset from HuggingFace Hub # Using 16 samples (reduced from 64) for reasonable batch inference timing on consumer GPUs guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train[:16]') guanaco_train

Out[41]:

Repo card metadata block was not found. Setting CardData to empty.

Out[41]:

Dataset({
    features: ['text'],
    num_rows: 16
})

In [42]:

  Copied!     
 
guanaco_train
guanaco_train

Out[42]:

Dataset({
    features: ['text'],
    num_rows: 16
})

In [43]:

  Copied!     
 
guanaco_train[0]
guanaco_train[0]

Out[43]:

{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog'}

In [44]:

  Copied!     
 
guanaco_train[0]['text'].split('###')
guanaco_train[0]['text'].split('###')

Out[44]:

['',
' Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.',
' Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.',
' Human: Now explain it to a dog']

In [45]:

  Copied!     
 
guanaco_train[0]['text'].split('###')[1].removeprefix(' Human: ')
guanaco_train[0]['text'].split('###')[1].removeprefix(' Human: ')

Out[45]:

'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.'

In [46]:

  Copied!     
 
guanaco_train[0]['text'].split('###')[2].removeprefix(' Assistant: ')
guanaco_train[0]['text'].split('###')[2].removeprefix(' Assistant: ')

Out[46]:

'"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.'

In [47]:

  Copied!     
 
guanaco_train = guanaco_train.map(lambda entry: {
    'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
    'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
})
guanaco_train = guanaco_train.map(lambda entry: { 'question1': entry['text'].split('###')[1].removeprefix(' Human: '), 'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ') })

Out[47]:

[No output generated]

In [48]:

  Copied!     
 
guanaco_train[2]
guanaco_train[2]

Out[48]:

{'text': '### Human: Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?### Assistant: Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\'t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.\n\nThis method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points.### Human: I didn\'t understand how pulling and pushing works. What are we pulling or pushing? And how does it work in terms of code or maths?### Assistant: It has to do with a process called differentiation, which means finding a derivative: a function that tells us whether another function is going up or down at a given point. For example, the derivative of `f(x) = x` is `f\'(x) = 1`, because it\'s always going up at a 1:1 ratio. This can be done for a variety of functions; notably, if you know the derivatives of `f(x)` and `g(x)`, you can also get the derivative of `f(g(x))` using a formula called the chain rule. Neural networks happen to be made of differentiable functions, so we can take the derivative of parts or all of it.\n\nTo use this for "pushing" and "pulling", we\'ll put two images through the neural network. Let\'s say the images are of a cat and a dog, so we want to increase the distance between the two. We pick one neuron weight from the network and make it a variable `x`, then construct a function that calculates the output of the network based on it and all the other parameters; let\'s call it `N(x)`. The distance between the cat and dog outputs would be `f(x) = N(cat) - N(dog)`. (Of course, the real output would have more than 1 dimension, but we\'re simplifying.) We now want to nudge the weight such that it moves the two outputs slightly further apart. For that, we can simply take the derivative! If `f\'(x)` is positive, that means that increasing the weight will move them further apart, so we should do that. If it\'s negative, then it\'ll move them closer, so we\'ll want to slightly decrease the weight instead. Apply this to all the neurons enough times and your network will soon converge to a pretty good cat-dog separator!',
'question1': 'Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?',
'answer1': 'Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\'t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.\n\nThis method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points.'}

In [50]:

  Copied!     
 
# Reduced batch_size from 64 to 8 for 16GB GPUs (3 models loaded simultaneously)
%time answers_32bit = chat_pipe_32bit(guanaco_train['question1'][:], batch_size=8)
# Reduced batch_size from 64 to 8 for 16GB GPUs (3 models loaded simultaneously) %time answers_32bit = chat_pipe_32bit(guanaco_train['question1'][:], batch_size=8)

Out[50]:

CPU times: user 1min 29s, sys: 358 ms, total: 1min 30s
Wall time: 1min 30s

In [51]:

  Copied!     
 
%time answers_16bit = chat_pipe_16bit(guanaco_train['question1'][:], batch_size=8)
%time answers_16bit = chat_pipe_16bit(guanaco_train['question1'][:], batch_size=8)

Out[51]:

CPU times: user 46.2 s, sys: 316 ms, total: 46.5 s
Wall time: 46.9 s

In [52]:

  Copied!     
 
%time answers_4bit = chat_pipe_4bit(guanaco_train['question1'][:], batch_size=8)
%time answers_4bit = chat_pipe_4bit(guanaco_train['question1'][:], batch_size=8)

Out[52]:

CPU times: user 48.9 s, sys: 342 ms, total: 49.3 s
Wall time: 49.7 s

In [ ]:

Conclusion¶

In this notebook, we introduced quantization as a technique to reduce model size and improve inference speed. You learned how to quantize a model and evaluate its performance.

In [ ]:

In [22]:

  Copied!     
 
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shut down the kernel to release memory import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[22]:

{'status': 'ok', 'restart': False}

In [ ]: