LLM from Scratch – Zachary Lai

Introduction

Who am I and why am I writing this?

I am a student finishing up my undergraduate Computer Science degree, and made it my goal this quarter to really understand how language models work.

I often catch myself thinking I know concepts, but can't really explain them. I am going to be completing different projects to understand the workings of a language model, and writing summaries about what I learned from each mini-project. I think this is the best way to reinforce what I learn, and ensure I fully understand each concept. I know all this information is available all over the internet, but I figured I might as well post these summaries since I am writing them. Writing has always been a weakness for me so this is also a way for me to improve that.

The project

For this project, I built a language model from scratch following along with Sebastian Raschka's book "Build a Large Language Model (From Scratch)" where he walks through the entire process of building a large language model from handling the data, coding the attention mechanisms and GPT architecture, pretraining a model, and fine tuning models for classification and instruction following tasks.

Training a large language model is a very expensive task, so the book trains the model on a small dataset so users can train it on their local machine. Later on, we load in the weights of a more capable model and perform fine tuning tasks on it. Even though the scale of the models are smaller than the models we use every day, the architecture is still the same as the original GPT-2 and transformer.

Chapter 1 - Understanding Large Language Models

What is a large language model?

Before I explain what I built, I'll cover the basics of language models. A large language model is a deep neural network trained on very large amounts of text data that can respond to and generate text. They are called large language models due to the size of the training data, and the size of the network itself. Language models are an improved form of natural language processing, having a greater ability to 'understand' text than rule following methods. LLMs can perform a variety of tasks such as summarizing text, text translation, sentiment analysis, and question answering.

Language models are trained to predict the next word in a sequence, and iterate over this many times to generate text. The key feature they contain is the transformer architecture, which pays attention to different parts of the input and helps understand the complexities of human language well. I will cover transformers and attention in more detail later.

The two main parts of building a LLM are pretraining and fine-tuning. Pretraining is the process of training the model on the text dataset to gain an understanding of human language. After pretraining, fine tuning is used to train the model on a smaller dataset to teach it specific tasks such as classification or instruction following.

Chapter 2 - Working with Text Data

The first task to create a LLM from scratch is preparing the input text for the model to train on. I will cover how we tokenized the text, encoded them into vector representations, and created a data loading method to produce the input-output pairs for the LLM to train on.

Language models cannot take in the human text directly, so we need to translate this text into embeddings, or vector representations to pass to the model. To do this we first break up the full text into tokens, or smaller units of text. Tokenization is a deep topic, and I will cover that in more depth with my next project. The key idea is that the long string of text is broken up into smaller words, parts of words, and special characters. These are then each assigned an integer ID through a map to identify it. This group of token IDs is referred to as the vocabulary. It is necessary to create a reverse mapping too, as the model will output token IDs that we will need to convert back to text. Special tokens for unknown words and end of text were also added to handle new and unknown words not in our vocabulary and separate pieces of text in the input. The book walks through building a simple tokenizer from scratch to understand the concept of it, but imports a popular tokenization scheme called byte pair encoding from the tiktoken library to use for the model. The BPE tokenizer was used to train several models such as GPT-2 and GPT-3, and it breaks down unknown words into individual characters to handle unfamiliar words well.

Now, we need to assemble these tokens into input-target pairs to feed to the model for training. Since language models are trained for next word prediction, the input will be a sequence of token IDs, with the target being the next token ID directly after them. The book walks through creating a data loader to use a sliding window approach to iterate through the full input text creating these pairs for a specified input length as PyTorch tensors. We create an input tensor, and a target tensor which is the input tensor shifted right by one token so its last element is the next token after the input, and will not contain the first element in the input tensor.

The last step is to convert the token IDs into embedding vectors. The embeddings are normally of high dimensionality, but the book walks through examples of them in low dimension for educational purposes. The embeddings are vector representations of each token, initialized with random values that will be learned by the model during training through the backpropagation algorithm.

The self attention mechanism in LLMs doesn't have any context of position of the tokens, so we add positional information to the embeddings. To keep the paper from being too long I won't go in depth on positional embeddings, but I will cover them more in depth in a future project. We use absolute positional embeddings, which are used by OpenAI's GPT models, which focus on a token's specific location in a sequence. To do this, we create an additional embedding layer of the same size as the current embeddings with a sequence of numbers to give context on each token's positional location in the sequence.

The book often walks through smaller scale examples for educational purposes, but the actual model we are building uses embeddings of 256 dimensions, a vocabulary of size 50,257 from the BPE tokenizer, and trains the model on the text "The Verdict", which produced 5,145 tokens after tokenization. Now that the embeddings are created, we move on to implementing the attention mechanisms.

Chapter 3 - Coding Attention Mechanisms

One of the core parts of the LLM architecture is the attention mechanism. Chapter 3 focused purely on the attention mechanism, teaching how to implement simplified self attention, self attention, causal attention, and multi-head attention. Attention is a concept that started with Bahdanau attention in recurrent neural networks, which later were found not necessary for LLMs and the transformer architecture with a self-attention mechanism was introduced in place of RNNs.

Simplified self-attention

Self-attention allows each element in an input sequence to consider and weigh the importance of every other element in the sequence. The book first walks through implementing a simplified self-attention mechanism. This mechanism initialized random weights for each token in the sequence, and computed attention scores for a given query token (selected input token) by taking the dot product of the query embedding and all token weights. These attention scores are then normalized using PyTorch's softmax function, which also ensures all weights are positive. We then compute the context vector with respect to the given query token as a weighted sum of each input vector multiplied by its attention weight, and then repeat this to calculate attention weights and context vectors with respect to each input token. This could be computed using for loops in Python, but it is much more efficient to use matrix multiplication. The book walks through the implementation of a self attention Python class to compute the context vectors for each token in the input sequence.

Self-attention

The simplified self-attention mechanism was first explained to gain an understanding of the concept, the book next moved on to implement the self-attention mechanism used in the original transformer architecture that uses trainable weight matrices that are updated during the training phase.

The weight matrices are used to compute the query, key and value vectors, and then we use those to get the keys and values for every input element. As before, we need to compute attention scores, but this time take the dot product of the query and key which were obtained from the inputs and weight matrices. Then we normalize the attention scores using softmax as before to get the attention weights. Lastly, the context vector is computed, but this time is a weighted sum of the attention weights and their corresponding value vectors.

This improvement of the simplified self-attention mechanism is important because we transformed input data into queries, keys, and values using the trainable weight matrices which will be adjusted during training.

Causal attention

Language modeling should only depend on previous words in a sequence, so we implemented causal attention to mask all future tokens in the sequence. This step comes after computing the attention weights by creating a mask with -inf values for all values ahead of the given token, and then applying the softmax function which converts the inputs into a probability distribution, and all -inf values are converted to zero using the softmax.

Dropout is also used in this step, which randomly ignores values in the hidden layer to prevent overfitting, where the model would train too specifically on the training data. The amount of values to apply dropout on is specified by a dropout rate parameter. The book instructed how to modify the previous class to add the causal mask and dropout features.

Multi-head attention

The last step is to apply the previous class over multiple heads that each operate independently. This multi-head attention module was first built by stacking multiple causal attention modules. To do this we had multiple weight, query, key, and value matrices. This implementation processes each attention module sequentially, but it can be implemented more efficiently by running them in parallel.

This improvement was achieved by computing the outputs for all heads at the same time using matrix multiplication. The query, key, and value tensors are reshaped and transposed, then the results are combined after computing attention. The code from the book used small embedding sizes and number of attention heads for users to follow along with, but for reference the smallest GPT-2 model uses 12 attention heads, and the largest uses 25.

Implementing and understanding the attention mechanism was one of the hardest parts of this project, but gave me a real insight into how LLMs are so good at processing and 'understanding' text.

Chapter 4 – Implementing a GPT Model from Scratch to Generate Text

Chapter 4 focuses on all the parts of the architecture of GPT models besides the attention mechanism. The beginning of the book started at a small scale for understanding, but now works with the size of a small GPT-2 model, around 124 million parameters (trainable weights). In this chapter, the book walks through the implementation of transformer blocks, with layer normalization, shortcut connections, feed forward neural networks with GELU activation, and the first text output from the model.

Layer normalization

When training deep neural networks, there is often a vanishing gradient problem which makes it difficult for the network to adjust weights and minimize the loss function. To combat this we added layer normalization, which adjusts the outputs of a layer to have a mean of 0 and variance of 1. We do this by creating a small class to apply layer normalization, with trainable parameters of scale and shift.

Feed forward network with GELU activations

The book then covers implementation of a simple feed forward neural network that increases the input dimension by four, applies a GELU activation function, then decreases the size by four to match the input size. The GELU activation function is used as its smoothness helps with better optimization during training. This network is simple, but helps the model learn and generalize data from the large representation space created. Additionally, the input and output dimensions being the same size helps simplify the architecture and makes stacking layers much easier.

Shortcut connections

The vanishing gradient is a problem in neural networks where the gradient gets progressively smaller during backpropagation, which makes it hard to optimize and train in earlier layers. Shortcut connections add the output of one layer to the output of a later layer to create a shorter path for the gradient to flow, as a solution to the vanishing gradient problem. The book walks through code to explain and visually display the vanishing gradient problem, and teaches how to implement shortcut connections.

Transformer block

The next step is creating a transformer block, one of the key features of the GPT architecture. The transformer block combines many of these previously learned features like multi-head attention, layer normalization, dropout, and feed forward layers with GELU activations. At a high level, the self attention mechanism helps to identify relationships between elements of the input sequence, and the feed forward network modifies the individual data at each position.

The book then shows how to implement a transformer block class, which follows the flow of layer normalization, masked multi head attention, dropout, with a shortcut connection to the output layer. Then followed by layer normalization, the feed forward network with GELU activation, and the dropout with another shortcut connection to the output. The output of this transformer block will have the same dimension as the input, and will most likely be fed into another layer of the LLM. Transformer blocks like these are repeated in the GPT model and often contain billions of parameters in large models.

Coding the GPT model and generating text

With the transformer block implemented, we then coded the whole GPT model with tokenization and embeddings, 12 transformer blocks, and a final layer normalization and linear output layer to embed each token layer into a 50,257 dimensional embedding. The output is this size, because the size of the tokenizer's vocabulary is 50,257 so it has a value for each token in the vocabulary.

We should remember LLMs are trained for next word prediction, so we want the most likely token to appear next from the given output. To get this, we apply softmax to turn the output logits to a probability distribution and take the largest value as the most likely next token. This is known as greedy decoding. The index of this element is the token ID we can decode to get the token in text.

This process only gets the next token, so we need to iterate through this process, getting the predicted next token, appending it to the input and then repeating the process until a specified number of iterations is reached. The text generated from this output is not coherent, as expected since we did not train the model in this chapter and were using random weights.

Following along this chapter and learning about the architecture of the GPT model, I was surprised that many of the features were relatively simple to implement. The part I found interesting was how each of these components impacted the performance of the model and text generation.

Chapter 5 - Pretraining on Unlabeled Data

With the implementation of the GPT model done, chapter 5 moves onto implementing a training function, evaluation techniques, and loading pretrained weights from a larger model.

Evaluation

Evaluation is a big part of AI and language models, the first part the book touches on is calculating a loss metric for the model generated outputs. The book walks through step by step on how to calculate negative average log probability from the target outputs, which is easier to manage in optimization than the scores directly. This is also known as cross entropy loss, which is a common measure in machine learning that measures the difference between two probability distributions, here it is the targets and predictions. Perplexity is commonly used with cross entropy loss, and measures how well the model prediction probability distribution matches the actual distribution in the dataset. The book then goes through the steps of creating training and validation sets to track the loss during training.

Training a LLM

Next, we implement the code for the pretraining of the model. This code iterates over training episodes, where it iterates over batches in each training epoch, resets the loss gradients, calculates the loss on the current batch, and backpropagates and updates model weights using loss gradients. When this code was run we could see the loss significantly decrease with each step, but the validation loss was much higher than training loss, showing the model overfitting to the training data.

Temperature scaling and top-k sampling

Temperature scaling and top-k sampling are techniques we used to combat overfitting. Temperature scaling adds a probabilistic selection to the next token prediction, instead of the greedy encoding we previously had. Temperature scaling just divides the logits by a number greater than 0, a higher temperature greater than 1 results in more uniform token probabilities, resulting in a more diverse selection of the prediction of the next token. Conversely, a temperature less than zero provides a more confident prediction of the next token. In addition to temperature scaling, top-k sampling limits the exploration of the next token to k tokens with the top probabilities, where k is a parameter passed in. This is done by giving a -inf value to all of the tokens that do not contain the top k probabilities. These techniques combined can increase the diversity and creativity of the next token and text generated by the model.

Loading weights

The last part of the chapter provides instructions on how to load and save model weights in PyTorch, and how to load pretrained weights from OpenAI. This is important to learn as you can download and fine tune larger models without having to pretrain them on your machine, especially if you do not have the compute or cloud access for it.

Chapter 6 - Fine-tuning for Classification

Now that the model is pretrained, the book finishes up with fine-tuning. It covers two popular types of fine-tuning, classification and instruction following. Chapter 6 focuses on classification fine-tuning, where the model classifies input text as one of two groups: spam or not spam. To do this we crafted a new dataset, split, and created dataloaders with examples of text labeled either spam or not spam. We then initialized the model with pretrained weights from the GPT-2 small model from OpenAI.

The key part of this chapter is adding a classification head to our architecture. To do this, we replaced the output layer that mapped the hidden representation to 50,257 dimensions, with a smaller layer that maps to two classes, 0 and 1 for our two labels. The rest of the architecture stays exactly the same. We then froze the model, making all previous layers unable to train, and ran it only on the last layer normalization and new output layer. We only focus on the last token in the output layer as it has the context of every token before it, and need to transform this last token output to class label predictions and evaluate its accuracy.

Like before, we can use the softmax function to turn the output to a probability distribution although it is not necessary, and then take the index with the maximum value as our output label. We then compared the outputs with the true labels to evaluate the model for its classification accuracy. We use the cross entropy loss as a proxy to maximize accuracy since classification accuracy is not a differentiable function. To train the model on all the supervised data, we followed the same flow as training the self supervised model in the last chapter.

Chapter 7 - Fine-tuning to Follow Instructions

The last chapter walks through the implementation of fine tuning the model to follow instructions. This is one of the most popular fine-tuning uses, and is used for developing LLMs as chatbots and personal assistants.

Similar to the last chapter, we first prepare a dataset specific to instruction fine tuning, with entries that have instructions, inputs and outputs, and we can format them in the Alpaca prompt style. The Alpaca prompt style has a system prompt first, followed by titles sections with instructions, input and output. With the instruction fine-tuning, we implemented a custom collate function to pad the inputs since they are not all the same size. Once the data is formatted, split, and dataloaders are created we can move on to training.

The book then walks through similar steps of loading a pretrained LLM, this time using the medium size of the GPT-2 model weights with 355 million parameters as a larger model is needed for more accurate performance on instruction following tasks. Finally we train the model and test it with instructions like 'Rewrite the sentence using a simile' and 'What type of cloud is typically associated with thunderstorms?', some with input like the first, and some without an input field like the second. There are multiple ways to evaluate an instruction following LLM, we developed a method to automate the response evaluations using another LLM.

Conclusion and Takeaways

Completing this project has taught me a lot about the inner workings and architecture of a large language model, and made me realize that I really did not know as much about the before this book as I thought I did. I now feel I can confidently explain any of the concepts in this book to someone else. I also learned how to use PyTorch, and writing up this summary definitely helped reinforce the concepts and is something I want to continue moving forward to improve my writing and explaining of complex topics.

I highly recommend this book to anyone else wanting a deeper understanding of language models, and plan to read and follow another book by Sebastian Raschka "Build a Reasoning Model (From Scratch)" in the future.

My next projects are going to be from this list by Ahmad Osman to dive even deeper into some of the topics in this book, and other parts of more modern LLMs, and will be providing write ups and/or X posts with my learnings from each.