In the past few years, the AI and ML industry has witnessed a meteoric rise in the development & application of the NLP systems as researchers have been able to implement NLP practices in highly flexible and task-agnostic ways for downstream transferring tasks.
Initially, it was the single-layer representations that used word vectors, and were then fed to the task-specific architecture. Next, it was the RNN architecture that used multi-layer representations & contextual state to form better representations. And most recently, we have the transfer language models or pre-trained recurrent models that have entirely removed the need for task-specific architectures by fine-tuning these networks.
The transfer language models have proved to be a major turning point in the NLP industry as they have resulted in tremendous progress on challenging tasks like answering questions, reading comprehensions or blocks of text, textual entailment, and much more.
However, despite their advantages, transfer language models have a major limitation as they require task-specific finetuning or task-specific dataset to achieve the desired performance on a task. Furthermore, transfer language models also require developers to finetune the datasets to hundreds of thousands of examples specific to a particular task.
It goes without saying that removing the requirement for task-specific dataset, and task-specific finetuning will be highly desirable, and beneficial for the NLP industry for numerous reasons.
Issues with Existing Pre-Trained Transfer Language Models or Recurrent Models
- Limiting the Practicality & Applicability
First and foremost, the requirement of a large dataset with labeled data for each task limits the applicability & practicality of the language models. Language models find their applications in a wide variety of tasks ranging from generating a short story, to correcting grammatical errors, to generating examples on a concept. At times, it is a challenging task to collect a large supervised dataset with labeled data, especially when the process needs to be repeated for every individual task.
- Exploiting Spurious Correlations in Training Data
Limitations & narrowness of the training distribution coupled with expressiveness of the model can result in a fundamental growth in potential to exploit spurious correlations in training data. The potential to exploit the training data can result in problems during the fine-tuning and pre-training paradigm because the transfer language models are designed in a way to absorb a large amount of information during pre-training.
Furthermore, work on prior models have indicated that large models do not result in better out of distribution each & every time. Furthermore, it’s also been indicated that generalization achieved under such a paradigm can result in poor performance primarily because the model is highly specific to the training data, and cannot perform well on situations beyond the scope of the training data.
- Comparison with Human Learning
Finally when compared to transfer language models, humans do not require a large training dataset when it comes to learning a majority of language tasks. Most often, a brief directive in a person’s natural language or a small demonstration of the language task is adequate for a human to understand and perform a language task with a certain level of competitiveness.
Human’s ability to adapt has numerous practical advantages as it allows them to either switch between different skill sets or mix them together to better perform during a dialect, something that’s beyond the capabilities of the current NLP systems.
Tackling the Issues with Meta Learning & GPT-3
A possible solution to the above challenges is the use of meta learning, a concept in modern ML that allows a model to develop a larger & broader set of skills & ability to recognize patterns while training, and then uses these learned abilities during interference to adapt rapidly, or recognize the required task.
Meta Learning is being implemented in language model architecture via a technique called “in-context learning” that uses text input of a pre-trained language model as a task specification. In the process, the model conditions on a natural language instruction, and might even use a few demonstrations, and the model is then expected to complete the rest of the task by predicting the next steps.
The only major issue with Meta Learning is that although it has shown positive potential, it’s still inferior to the fine-tuning approach in natural language architecture, and it needs further improvement in order to become a practical method for overcoming language tasks.
In addition to meta learning, another method that’s gaining popularity is increasing the capacity of transformer language models. In the past few years, transfer models have witnessed a substantial increase in their capacity with the RNSS18 model with 100 million parameters, the DCLT18 model with 300 million parameters, the RWC19 model with 1.5 billion parameters, the SSP19 model with 8 billion parameters, the RSR19 model with 11 billion parameters, and the TUR20 model with 17 billion parameters.
Increasing the capacity of the model or increasing the parameters has historically resulted in improvements in text synthesis, and there’s been an indication that log loss, that correlates with downstream tasks also follows a smooth trend of improving with the scale.
That brings us to the GPT-3 model that has over 175 billion parameters, and when it was launched, it was the transfer language model with the highest capacity. Let’s now talk about the GPT-3 model.
An Introduction to the GPT-3 Model
The GPT-3 is an autoaggressive language model with over 175 billion parameters that was released by OpenAI in 2020. GPT-3 is also classified as a large language model that just like its predecessor the GPT-2 model is a decoder-only deep learning transformer model that uses convolution-based architecture to generate textual data.
The GPT-3 model measures its own context-learning abilities, and the GPT-3 model is evaluated on over two dozen NLP datasets and multiple novel tasks. For every individual task, the GPT-3 model is evaluated under three conditions,
- Few Shot Learning or In-Context Learning: In few shot learning, the GPT-3 model allows as many distributions that can fit well into the model’s context window.
- One Shot Learning: In one shot learning, the model allows only one demonstration.
- Zero Shot Learning: In zero shot learning, there are no demonstrations, and there’s only an instruction in natural language that’s fed to the model.
Broadly speaking, the GPT-3 model achieves desired performance in zero-shot, and one-shot settings, and in the few-shot setting, it outperforms the state-of-the-art transfer models most of the time. Furthermore, the GPT-3 model performs well in one-shot, and zero-shot settings at natural language tasks designed to test on the fly reasoning, or requires rapid attention like using novel words after a sentence, or unscrambling words, or performing arithmetic operations. On the other hand, when operated in a few-shot setting, the GPT-3 model generates synthetic news articles that resemble human writing when passed through human evaluators.
GPT-3 Model: Approach
The GPT-3 model uses a conventional pre-training approach that comprises model, data, and training, and it resembles the pre-training process followed by the RWC-19 transfer language model. The GPT-3 model scales up the model size, the dataset size, diversity of the dataset, and increases the length of the training period.
The model also uses an in-context learning approach that once again resembles the RWC-19 model’s approach, but tweaks things up a bit by systematically exploring different settings for learning patterns within the context of the dataset.
So, let’s start by exploring these settings, and evaluate how the GTP-3 model performs on different settings.
Fine Tuning
Fine-tuning the model has been the conventional approach in transfer language models, and this approach involves updating the weights of a pre-trained model by training the model on a supervised dataset that’s specific to the desired task, and hundreds of thousands of labeled examples are used during the process.
The fine-tuning approach is beneficial because it returns strong performance across numerous benchmarks. On the other hand, the main limitation of using the fine-tuning approach is that it requires a new & large dataset for every individual task, has the potential to exploit spurious features of the training dataset, can potentially result in unfair comparison with human performance, and poor generalization for out-of-distribution.
The current scope of the GPT-3 model does not implement the fine-tuning approach because of its task-agnostic performance, although fine-tuning can be applied to the GPT-3 model in the future.
Few Shot
Few Shot is a term that refers to the setting where the GPT-3 model is given a few demonstrations of the task during interference as conditioning, but the weights of the model are not updated. In the few shot settings, the dataset typically has an example with a context, and a desired completion (for example, a French sentence, and its English translation). The few shot setting gives the model K examples of context, and completion, and it then provides the model with one final context, and expects the model to provide the completion.
The major advantage of using the few shot setting is that it significantly reduces the need for task-specific data, and also reduces the potential to learn a narrow distribution from a large dataset that's fine-tuned narrowly. On the other hand, the major disadvantage of using few shot learning is that the results delivered in the few shot setting are not up to the mark, and significantly poor when compared to other state of the art models that are fine-tuned.
One Shot
In the one shot setting, the model is provided only with a single demonstration, and the rest is similar to the few shot setting. The reason why one shot setting is relevant in transfer language models is because out of all the three settings, one shot is the one that resembles the way in which tasks are communicated to humans the best. It’s because in most of the tasks, it's common to give one demonstration of the task otherwise it might be difficult to understand the context of the task.
Zero Shot
In the zero shot setting, there are no demonstrations, and the model is given a natural language instruction that describes the task. The zero shot method is the one that offers maximum convenience, is robust, and also avoids spurious correlations, but it’s also the most challenging of all the three settings. Its because in some cases, it’s difficult even for us humans to figure out the context of a task without seeing a demonstration first.
Regardless, for some tasks, zero-shot setting is the one that resembles how humans perform natural language tasks the closest.
The above figure compares the few shot, the one shot, and the zero shot setting when performing a natural language task of taking an English sentence, and translating it into French.
GPT-3: Model Architecture
The GPT-3 model uses the same architecture as the one used in the GPT-2 model, and it includes pre-normalization, modified initialization, and reversible tokenization techniques as they were used on the GPT-model with the exception of using an alternate strategy for locally banded sparse attention patterns, and alternating dense layers in the transformer layers, similar to Sparse Transformer.
To study the dependency of the model’s performance on the model size, the developers have trained 8 different model sizes that range over three different orders of magnitude from 125 million to over 175 billion parameters, the last one of them being called the GPT-3 model. Prior work related to LLM models have indicated that Scaling of validation loss with a sufficient amount of training data should be an approximate smooth power law as a function of size. Training models of varying sizes allows developers to test the hypothesis for both downstream language tasks as well as for validation loss.
The above figure compares the size & architecture of the 8 different models used for development of GPT-3. Here, n(params) defines the total number of trainable patterns, n(layers) defines the total number of layers in the model, d(model) defines the number of units in each layer of the bottleneck, and d(head) defines the dimensions of each attention head. The context window for each model is the same with 2048 tokens.
Furthermore, to minimize the transfer of data between the nodes, the model is partitioned across the GPUs along the depth & the width of the dimensions. The architectural parameters for each model have been chosen on the basis of computational efficiency, & load-balancing to maximize precision in the layout of models across GPUs.
Training Datasets
Typically, the large language models use datasets that have expanded significantly with recent developments, and they culminate in the Common Crawl dataset that consists of over a trillion different words. The size of the dataset is adequate enough to train the GPT-3 model without updating on the same sequence multiple times. However, studies & performance analysis indicate that lightly filtered versions or unfiltered versions of the Common Crawl dataset have low quality when compared to more curated dataset.
To tackle the issue of the average quality of the dataset, developers took 3 steps to boost the quality of the dataset.
- Developers downloaded & filtered a version of the Common Crawl dataset based on a range similar to high-quality reference corpora.
- Developers performed fuzzy duplication at the document level across the dataset in an attempt to preserve the integrity of their held-out validation set as an effective measurement of overfitting, and also to prevent redundancy.
- Developers also added high-quality reference corpora to the training data to augment the Common Crawl dataset, and to further increase the diversity of the dataset.
The following figure shows the final proportion or mixture of the datasets used for training the GPT-3 model. The Common Crawl data consisted of over 45 TB of plaintext before filtering that was reduced to 570 GB of data after filtering, a rough equivalent to over 400 billion byte-pair encoded tokens. It's worth noting that datasets in the training that are viewed as higher-quality are sampled with more frequency instead of sampling the dataset proportion to their size. As a result, datasets like Books2 & Common Crawl are sampled less than one time during training, whereas the other datasets are sampled multiple times. It allows the model to accept a small amount of overfitting in exchange for training on training data with a higher quality.
A significant concern with large language models that are pre-trained on a large amount of internet data with the capacity to memorize & learn a large amount of content is the potential contamination of downstream tasks by having their development or test sets seen during the pre-training process. To reduce such potential contamination, the developers searched for any overlaps with the test & development sets of the benchmarks studied for GPT-3, and attempted to remove these overlaps.
The above image shows the total compute used during the training of the GPT-3 model. The model uses Scaling Laws for Neural Language Models to train much larger models on fewer tokens than typical. As a result, both GPT-3 and RoBERTa-Large model, that is 10x smaller than the GPT-3 model took nearly 50 petaflops/day of compute during the pre-training process.
Evaluation
For the few shot learning, the model evaluates each example present in the evaluation data set by drawing K examples randomly from that task’s training dataset as conditioning, and delimits it by 1 or 2 newlines depending upon the task. For Storycloze, and LAMBADA, the model draws conditioning examples from the development set & evaluates it on the test set because of unavailability of a supervised training set. For Winograd, there exists only one dataset, and so the conditioning samples are drawn directly from it.
K can be any value ranging from 0 to the maximum amount allowed by the model's context window which is next = 2048 for all the models, and it typically fits about 10 to 100 examples. Larger values of K often result in better results, but not always which is why when the model has a test set, and a separate development set available, the model experiments on a few values of K on the development set, and based on the results, it runs the best value on the test set.
Furthermore, on the tasks that require selecting a correct completion from multiple options, the developers provide K examples of correction plus context completion, and follow it up by providing one example of context only, and the tasks are then compared on the basis of LM likelihood of each completion. For tasks that require binary classification, the models often give options more semantically, and with more meaningful names, and then treats the task as multiple choice, and sometimes also frames the task similar to what is done by the RSR model & architecture.
For the tasks that require free-form completion, the model uses beam search with identical parameters as used in the RSR framework, with a beam of length 4, and a penalty of 0.6. The model is then scored using either the F1 similarity score, exact match, or BLEU, depending on the standard for the dataset.
Results
The above figure displays the training curves for the 8 models used in the GPT-3 model architecture, as described in the previous sections. Similar to the results from the KMH language model, the performance of the GPT-3 model follows a proper law when using training compute effectively. There is a slight difference from the law only when the trend is extended by two more orders of magnitude. It might occur to people that the improvements in cross-entropy loss might be a result of modeling spurious details of the training corpus. However, the improvements in the cross-entropy loss lead to consistent gains in the overall performance across a broad spectrum of a variety of NLP tasks.
Before evaluating the 8 different models on a wide range of training data, the datasets are grouped into 8 different categories that represent similar tasks. These categories are
- Evaluation on traditional language modeling tasks, and tasks that resemble language modeling like Cloze tasks, or sentence/paragraph completion tasks.
- Evaluation on “closed-book” question answering tasks.
- Evaluating the model’s ability to translate between languages (especially one-shot and few-shot)
- Evaluating the model’s performance on Winograd Schema-like tasks.
- Evaluating on datasets that involve commonsense reasoning or question answering.
- Evaluating on reading comprehension tasks.
- Evaluating on the SuperGLUE benchmark suite.
- Exploring NLI.
Language Modeling, Completion, and Cloze Tasks
In this section, the GPT-3 model’s performance is evaluated on the traditional language modeling tasks as well as tasks that require the prediction of a single word of interest, or completing a paragraph or a sentence, or completing a piece of a text. Let’s discuss them in brief detail.
Language Modeling
The GPT-3 model calculates the zero-shot perplexity on the PTB or the Penn Tree Bank dataset. The model omits Wikipedia-related tasks because it's already included in the model’s training data, and the one billion word benchmark is also omitted because it causes a significant amount of friction of the dataset being within the training data. However, the PTB dataset tackles these issues because it can predate the modern internet. The largest model in the GPT-3 model architecture ets new SOTA on the PTB dataset by a noteworthy margin of 15 points, and achieves a perplexity of 20.50.
LAMBADA
The LAMBADA dataset is used to test the modeling of the model on long-range dependencies in paragraphs or texts. It means that the model is asked to predict the last word of a sentence after reading the paragraph for the context. Furthermore, the continuous scaling of the language models yields diminishing returns on the benchmark.
The GPT-3 model achieves 76% accuracy on LAMBADA, and has a gain of over 8% over previous best models. Furthermore, the LAMBADA model demonstrates the flexibility of few-shot learning as it addressed the problem in a way that occurs classically with the dataset. The completion of a sentence in LAMBADA is usually the last word of the sentence, but as a language model cannot know that, it assigns a probability not only to the correct ending, but also to other continuations in the paragraph.
Furthermore, when the examples fed to the GPT-3 model are modified in a certain way, the model returns an accuracy of over 86%, an increase of over 18% over previous models. Additionally, the results also indicated that the performance of the model in a few-shot setting increases proportionally with the increase in model size. Although this strategy reduces the smallest model in the GPT-3 architecture by 20%, it enhances the accuracy of the primary GPT-3 model with 175 billion parameters by 10%.
Closed Book Question Answering
Closed Book Question Answering is an attempt to measure the GPT-3 model’s ability to answer questions based on broad factual knowledge. Because such questions often have a high amount of possible queries, the task is normally achieved using an information retrieval system that allows the model to find relevant text in combination with the model that learns to generate a response to an answer given the retrieved text, and the question.
The above image compares the result for the GPT-3 model compared with different models, and running on different datasets. On the TriviaQA dataset, the model achieves an accuracy score of 64.3% in the zero-shot setting, while it achieves an accuracy score of 68%, and 71.2% in one-shot, and few-shot settings respectively.
It can evidently be seen that the GPT-3 model in zero-shot setting outperforms the fine-tuned T5-11B model by over 14%.
The above figure shows the performance of the GPT-3 model grows smoothly with an increase in the model size. The performance suggests that the language models continue to learn from the dataset as their capacity increases.
Final Thoughts
It would be safe to say that GPT-3 was a revolutionizing phase in the LLM industry as GPT-3 helped in pushing the limits of what a language model could do. It was the developments made, and obstacles overcome by GPT-3 that paved the way for the most advanced, and accurate large language model till date, the GPT-4.