Ever since OpenAI introduced ChatGPT UI in November 2022, the way we interact, process and consume information has changed drastically. It was science fiction to have a long and meaningful conversation with a machine until this point.
ChatGPT, which is based architecture known as Generative Pre-trained Transformer (GPT), a type of Large Language Models (LLMs). The most common architecture currently being used for LLMs is known as ‘Multi-headed Attention Transformer network’. This architecture is the backbone of several modern language models like GPT-4, LLaMA, Clause-3 etc. A transformer was initially designed to translate a text from one language to the other. Previously a translation-model would just be a replacement of word equivalent in the respective languages. But transformer networks were able to understand the context of the word (or token) and do a smarter translation.
Popular Language Models
GPT-4: Generative Pre-trained Transformer 4 (GPT-4) is a multi-modal model created by OpenAI. They have several versions of this model with context window ranging from 8,192 to 32,768 tokens. GPT-4o (Omni) is a multilingual version of GPT-4 which achieved state-of-the-art results in voice, multilingual, and vision benchmarks, setting new records in audio speech recognition and translation
Gemini: Gemini is another multimodal LLM introduced by Google in 2023, which succeeded the LaMDA and PaLM models. Gemini has context length of 32,768 tokens. Nano-1 (1.8 billion parameters) and Nano-2 (3.25 billion parameters), are distilled from larger Gemini models, designed for use by edge devices such as smartphones and laptops.
Llama: Large Language Model Meta AI (Llama), is an autoregressive language model by Meta AI (formerly Facebook). The latest release Llama-3 has two model sizes: 8B and 70B parameters. The models have been pre-trained on approximately 15 trillion tokens of text gathered from publicly available sources with the instruct models fine-tuned on publicly available instruction datasets.
Claude: Claude is a LLM released by Anthropic in 2023. This model is trained using an approach called Constitutional AI. This approach is designed to be harmless and helpful without relying on extensive human feedback by employing supervised and reinforcement learning. In the supervised learning phase, the model generates responses to prompts, self-critiques these responses based on a set of guiding principles (constitution) and updated the response. In the 2nd phase, responses are generated, and an AI compares their compliance with the constitution. This dataset of AI feedback is used to train a preference model that evaluates responses based on how much they satisfy the constitution. Claude 3, the latest iteration has a context window of 200,000 tokens.
Falcon: This is a LLM introduced by Technology Innovation Institute (TII) this model has a 40 billion and 7 billion variations. Falcon is based on autoregressive decoder-only model, a neural network that can process text in parallel. This model is open sourced under Apache 2.0 license.
Need for Efficiency in LLMs
Power consumption: current vs forecast |
The International Energy Agency (IEA) published their annual report which has a forecast for global energy use over the next two years (2025 and 2026). This estimate included projections of energy consumption by data centers associated with AI and crypto mining. The estimated energy consumption by data centers is to range between 620-1 050 TWh in 2026, with the base case for demand at just over 800 TWh – up from 460 TWh in 2022. Comparing the average electricity demand of a typical Google search (0.3 Wh) to that of OpenAI’s ChatGPT (2.9 Wh per request), and considering the 9 billion daily searches, this would result in nearly 10 TWh of additional electricity consumption annually. These estimates are for using pre-trained models to generate responses (Inference) of user queries. The energy consumption for training such models is several magnitudes higher.
Power needs to train different LLM models |
Improving the efficiency of LLMs is essential for reducing their environmental impact, as it lowers electricity consumption and carbon emissions. Efficient models also cut operational costs, making AI more accessible and sustainable. Additionally, efficient LLMs can be run on low-power devices like mobiles and wearables without significant impact to battery life of the device.
Understanding LLM Efficiency Fundamentals
Model or LLM efficiency refers to the measure of how well a machine learning model performs in terms of resource metrics while maintaining high levels of accuracy and consistency. These performance metrics can be one or more of computational power, memory, time etc. A model is said to be lightweight if the resources required are very few and the model doesn’t need specialized hardware to run. On the contrary, a heavy model is one which needs a lot of resources and usually a specialized hardware setup.
Efficiency metrics can be described in terms of computational efficiency, memory efficiency and energy efficiency in two stages of LLM life cycle, training and inference. Some of the key metric of model efficiency are:-
Time Efficiency: This can be further divided into Inference time , which the measure of how fast a model predicts or generates output and training time, which the time needed to train a model initialized with random weights.
Memory Efficiency: It can be measured both in terms of the physical memory of the model and the memory (RAM and GPU) required during the training and inference phase. A model which has a large memory requirement will naturally be slower than one with a small memory need due to the fact that data has to travel a longer distance to reach the computation block of the system. Memory efficiency can also be termed as parameter efficiency as a model with a smaller number of parameters needs less memory.
Energy efficiency: Energy efficiency is the amount of energy required to train or run the model. Efficient models minimize energy usage, which is particularly important for large-scale deployments or in environments where power is limited. The energy can be either the power drawn by the hardware directly or the energy required to cool the hardware during training and inference. Energy efficiency is exponential in nature i.e. if the hardware draws twice the energy, then it will get twice as hot, and the will need significantly more energy to cool. Cooling the hardware is also not environmentally friendly as it needs a lot of fresh water, which cannot be reused for drinking.
Achieving efficiency is a balancing act. When we improve the efficiency of a model, almost always we see a dip the accuracy. But this is not always 1:1 mapping. For example, when we compare BERT and its optimized model DistilBERT, we see that there is a remarkable improvement in the efficiency of the model, but the accuracy loss is very small or negligible. In many application, this change is accuracy may not matter at all, but the benefits of the improved efficiency are significant.
BERT vs DistillBERT: Efficiency and performance |
Model Compression Techniques and Pruning Methods
Definition and purpose of model compression.
Model compression or model pruning is a set of techniques used to reduce the size and computational requirements of LLMs, without significantly sacrificing performance. This enables the language models to run faster, consume less memory, and require less power. Model Pruning involves selectively removing parts of the model that contribute little to its performance. This can be achieved by removing less important weights or neurons from the model, which reduces its size and complexity while maintaining most of its predictive power.
Weight pruning:
Weight pruning is a set of methods to increase the sparsity (count of zero-valued elements in a tensor) of a network's weights. Sparsity is a measure of how many elements in a tensor are exact zeros, relative to the tensor size. Typically, weights with the smallest absolute values are considered less important and are pruned away. The most straight-forward way to prune is to take a trained model and prune it once; also called one-shot pruning. This approach is surprisingly effective, but also leaves a lot of potential sparsity untapped.
A sample model/tensor weight before and after pruning |
Neuron Pruning:
This method involves removing entire neurons or units in a layer, which reduces the layer's dimensionality. Not all neurons in a neural network contribute equally to the final output. Some neurons might be redundant or contribute very little to the network's decision-making process. Neuron pruning targets these less important neurons for removal. One common criterion used in neuron pruning is the average activation of a neuron. Neurons that have low activation values across different inputs may be considered less important and thus candidates for pruning. Another approach involves looking at the gradients during backpropagation. Neurons that receive smaller gradients are likely to have less impact on the network's learning and can be pruned.
Features or neurons of two models (fc1 and fc2) before and after pruning |
Neuron pruning is often considered a form of structured pruning because it removes entire units, leading to a more predictable reduction in the network’s complexity and size. Weight pruning is categorized as unstructured pruning, which removes individual weights or connections without necessarily reducing the overall structure of the network.
Layer pruning:
In this approach, entire structures such as filters in convolutional layers, channels, or even entire layers are pruned. This type of pruning leads to more predictable improvements in inference time and memory usage because it simplifies the network architecture. This approach also has the highest deviation in accuracy of the final model.
Model Quantization
The word ‘quantization’ is borrowed from the field of digital signal processing (DSP) which defines the process of converting an analog signal (continuous signal with infinite precision) to a digital signal. This process maps a signal or data of infinite precision to finite precision. With respect to LLMs, quantization represents mapping the high precision number (floating point numbers) in the model to lower precision ones (integers). A LLM is represented as a tensor, which is a large collection of numbers representing the various aspects of a token of training data. These numbers are usually represented by a 32-bit floating-point number. This means one cell in a tensor uses 32 bits or 4 Bytes of memory. LLAMA-2, which uses 13 billion parameters would need 13,000,000,00*32 bits = 416,000,000,000 bits (≈ 48.43 GB).
Post-Training Quantization(PTQ)
In post-training quantization, a pre-trained model, which is of full precision, is quantized to the required precision during the inference stage. For example, a 32-bit floating point number in the tensor is mapped to a lower precision integer value like 16-bit, 8-bit or 4-bit fixed point integer. This lowers the accuracy of the model inference in a small way, but the gains in terms of efficiency are huge. A 32-bit full precision model which needs over 50GB of memory, will need just a quarter of that if it is quantized to 8-bit fixed point integer representation, with no noticeable change in the model accuracy.
Quantized sample neural network |
PTQ can be tailored based on the resources available. For example, a 32-bit full precision model can be quantized to a 16-bit model to be used on a cloud-based web application or can be quantized to a 4-bit model to be used on a mobile phone. This approach provides the flexibility and freedom to adjust the model as per the demand and the availability of resources.
Quantization-aware Training(QAT)
Quantization-Aware Training is an advanced technique, which incorporates the effects of quantization during the LLM training. By simulating the quantization of weights and activations as the model is being trained, QAT enables the model to adapt to the lower precision, often resulting in better accuracy after quantization compared to PTQ. During QAT, LLM is trained using floating-point precision, but quantization is simulated in the forward pass stage. This means that while gradients are computed in high precision, the weights and activations are quantized to lower precision during the forward pass to mimic the effects of quantization. The model's weights are quantized to a lower precision during the forward pass, but the original floating-point weights are retained for the backward pass. During QAT, it’s important to determine the appropriate range for quantization. The model adjusts the scale and zero-point parameters for each layer, which defines how floating-point numbers are mapped to integers. These parameters are fine-tuned during training to optimize the quantization process.
Post-Training Quantization and Quantization-Aware Training has robust set of tools and libraries, with TensorFlow Lite and PyTorch being the most used for QAT, while platforms like ONNX Runtime, TensorRT, and OpenVINO are widely used for PTQ.
Knowledge Distillation
Knowledge distillation is the process of training the smaller and lighter model to mimic the larger pre-trained model. In knowledge distillation, the capabilities of the larger model are 'distilled' or transferred to the smaller model. The large model is more general purpose in nature and capable of inferring a more varied data set than is required in a typical application. For example, a large model can identify and classify all types of mammals, but in our case, we may be concerned about identifying only a certain breed of dog. We can distill only the ability to identify that breed and discard rest of the knowledge making the model a lot more efficient.
Teacher-student framework is a commonly used framework to achieve knowledge distillation.
Teacher-student framework |
Teacher Model: - Teacher model is either a pretrained large model or a collection of several models. This model serves as the benchmark in terms of inference accuracy, i.e. this model can do the inference required for the application and a lot more.
Student Model: - This is the smaller, specialized model which needs to be trained using the knowledge of the larger teacher model. The student model is more suitable for use in daily applications due to it being more efficient than the Teacher Model. Once trained using the teacher model, this can perform a more streamlined set of tasks almost as accurately as the Teacher Model.
Response-based distillation
In this technique, the smaller student learns from the larger teacher model by trying to replicate the prediction or inference of the larger model. The teacher model generates pseudo-labels or soft labels on the training data. The student model is then trained on the same dataset with pseudo-labels. The Student Model tries to match the prediction by minimizing the loss function between the two predictions. This method can be employed in cases where the Teacher Model has a large number of output classes compared to the Student Model. This method is the simplest of the three to implement as all we need are the pseudo-labels from the larger teacher Model. If we are distilling from a popular open source model, pre-computed pseudo-labels can be used.
Response-based distillation |
Feature-based distillation
In feature-based distillation, the Student-Model is trained to capture the Teacher-Model’s data knowledge in the intermediate layers. This is achieved by teaching the Student-Model the same feature-activations and the Teacher-model by minimizing the difference between their activation. This is typically done using a loss function that measures the distance between the representations learned by the teacher and student models, such as the mean squared error or the ‘Kullback-Leibler divergence’.
Feature-based distillation |
Relation-based distillation
In relation-based distillation, a Student Model is trained to learn a relationship between the input examples and the output labels by focusing on transferring the underlying relationships between the inputs and outputs. First, the Teacher Model generates a set of relationship matrices or tensors that capture the dependencies between the input examples and the pseudo-labels. The Student Model is then trained to learn the same relationship matrices or tensors by minimizing a loss function that measures the difference between the relationship matrices or tensors predicted by the Student Model and those generated by the Teacher Model. This approach helps the Student Model learn a more robust and generalizable relationship between the input examples and output labels than it would be able to learn from scratch. This is possible since the Teacher Model has already learnt this relationship and can pass the knowledge to the Student Model.
Relation-based distillation |
Sparse Models
A Sparse Matrix LLM (Large Language Model) refers to a model where its weights are represented using a sparse matrix instead of a traditional dense matrix. A sparse matrix is a matrix in which most of the elements are zeroed out or irrelevant, while only a small percentage of the elements are non-zero and contain significant values. This reduces the memory footprint and computational overhead because operations are only performed on the non-zero elements. Using sparse matrix representation results in significant reductions in storage, bandwidth, and computational power needs making the model more efficient.
In model pruning discussed earlier, we start with a large model initialized with random weights, train it iteratively to create LLM which is useful. We then prune this large network to obtain a smaller and more efficient network. The dataset needed to create the initial LLM (like GPT-4) is several hundred petabytes. With multi-modal data like images and videos, that dataset size will grow exponentially. Also training such a large network needs several thousand epochs (iterations of training). This approach is impractical and potentially is a major roadblock in further development of LLMs.
To address this challenge, the training steps can be rearranged. As usual, we start with a large model initialized with random weights. However, instead of training this model immediately, we first prune it to create a sparse matrix representation, known as a sparse model. This sparse model is then trained on the same dataset. Initially, it was believed that this approach would fail to capture all the features of the dataset, resulting in lower accuracy and performance. Additionally, it was thought that the sparse model would be unable to maintain the same gradient dynamics as a model trained without pruning. This belief persisted until 2019, when two MIT researchers introduced the 'Lottery Ticket Hypothesis.' According to this hypothesis, a randomly initialized dense network contains at least one sub-network (a pruned network), referred to as the "lottery ticket," which, when trained in isolation, can achieve performance and accuracy comparable to the original dense network. This implies that the lottery ticket network will have the same performance as the dense network and yet be a lot more efficient and lightweight. The catch here is identifying the 'lottery ticket' network (sparse-mask).
Training using a sparse network |
The method proposed and currently the only known way to find the lottery ticket is more resource intensive and inefficient than training a dense model and pruning it. There is a lot of research currently under way in this field, with a goal to find the lottery-ticker network more efficiently. So, what is the point of this approach which is a more inefficient and resource intensive method to reach the same goal? The two main benefits of this approach are reusability and transformability of the sparse-mask obtained from the above steps
Reusable sparse-mask: - Once the sparse-mask is computed, it can be reused several times to train the model efficiently on different dataset, as long as the model remains the same. For example, if we have the sparse-mask for previously trained models like BeRT, we can use that sparse-mask to prune the original model and retrain it on a different dataset to obtain similar performance and accuracy with significantly less resources for both training and inference. This can be reused several times as data evolves over time.
Transform sparse-mask: - Several studies have been conducted and found that the sparse-mask for networks is similar. They are easily transformed from one sparse-mask to the other. Similar networks refers to networks which use the same underlying architecture or are of the same family. For example, ResNet-18 and ResNet-150. There is a consolidated effort to find a generalized way of transforming this network.
Conclusion
It is clear that there is no one right path to choose when it comes to optimizing LLMs. Several factors like the dataset size, the end use, resource available influence the methodology we choose to optimize the LLM. There is always a tradeoff between accuracy and efficiency and striking the right balance is key here. The field of Large Language Models stands at a crossroads of immense potential and significant challenges. By continuing to innovate in model architecture, optimization techniques, and evaluation methods, we can work towards a future where the power of these models is harnessed efficiently and responsibly. The journey ahead is sure to be filled with exciting discoveries and breakthroughs, pushing the boundaries of what's possible in artificial intelligence and human-machine interaction.