Apple is working on an AI health coaching service, according to reports

Apple Watch Series 8

Many people aspire to reach their peak health and wellness, but often struggle to achieve it. As a result, the wellness and health market is booming with fitness trackers, sleep trackers, meditation apps and more.

According to a Bloomberg report, Apple is working on an AI-powered health coaching service.

The service, codenamed Quartz, will use Apple Watch data to create tailored coaching programs that can make suggestions regarding better exercising, sleeping and eating practices, according to the report.

Also: Which Apple Watch is right for you? Series 8, Ultra, SE, and more compared

Users would have to pay a monthly fee for the service, which would be available on its own app. A launch date for the service was not disclosed, but the report says it is planned for next year.

It makes sense that Apple is capitalizing on the AI craze by introducing innovative ways to incorporate AI into its products, especially in an area like health and wellness that has lots of user interest too.

Apple also has health related services launching in the short term.

The tech-giant plans to make its iPhone health app accessible to more users by rolling out an iPad version of the app, which will allow users to view their health data on a larger screen. The iPad health app will be packaged in with iPadOS 17, according to the report.

Also: Boston Dynamics robot dog can answer your questions now, thanks to ChatGPT

The company's health app is also anticipated to get new tools for tracking moods and vision conditions, such as nearsightedness.

The initial version of the mood tracker allows users to log their moods and emotions, answer questions about their day and compare their results over time. However, in the future Apple plans to optimize by adding mood recognition via speech as well as other device interactions, according to the report.

Both the iPad health app and the initial mood tracker are expected to be unveiled in June at Apple's Worldwide Developers Conference.

Apple

Microsoft makes its AI-powered Designer tool available in preview

Microsoft makes its AI-powered Designer tool available in preview Kyle Wiggers 7 hours

Today, Microsoft Designer, Microsoft’s AI-powered design tool, launched in public preview with an expanded set of features.

Announced in October, Designer is a Canva-like web app that can generate designs for presentations, posters, digital postcards, invitations, graphics and more to share on social media and other channels. It leverages user-created content and DALL-E 2, OpenAI’s text-to-image AI, to ideate designs, with drop-downs and text boxes for further customization and personalization.

“Since October, the AI models have steadily improved, and we’ve worked to weave these powerful capabilities throughout the Designer canvas in even more delightful ways while keeping you in control,” Bryan Rognier, GM at Microsoft’s 365 Consumer division, wrote in a blog post published today.

Now Designer can generate written captions and hashtags relevant for social media posts, offering several suggestions users can choose from. It can also create animated visuals, complete with backgrounds and text transitions, powered by AI.

Microsoft Designer

New features coming to Microsoft’s AI-powered Designer tool. Image Credits: Microsoft

In the future, Designer will gain additional editing features, Microsoft says, including the ability to place an object in a specific spot in a graphic and automatically fill in the rest of a picture. Forthcoming “erase” and “replace background” options, meanwhile, will let users brush over objects, people or backdrops they didn’t intend to be in a graphic.

Designer will remain free during the preview period, Microsoft says — it’s available via the Designer website and in Microsoft’s Edge browser through the sidebar. Once the Designer app is generally available, it’ll be included in Microsoft 365 Personal and Family subscriptions and have “some” functionality free to use for non-subscribers, though Microsoft didn’t elaborate.

Addressing some of the legal questions that’ve sprung up recently around AI-powered image-generation systems, Microsoft says that users will have “full” usage rights to commercialize the images they create with Designer and Image Creator. It’s unclear whether that might change in the future, though, given the ongoing court battles involving OpenAI and other startups commercializing generative AI tools.

Hugging Face Introduces an Open-Source Competitor to ChatGPT: HuggingChat

In an exciting development, Hugging Face has recently launched HuggingChat, a freely available AI chatbot that is poised to give OpenAI’s ChatGPT a run for its money. With a strong focus on personalization and fostering genuine connections between users and their virtual companions, HuggingChat offers a range of text composition services, including programming code.

HuggingChat is accessible online as a standalone platform provided by Hugging Face. Developers who wish to integrate this chatbot into their existing software can seamlessly do so using Hugging Face’s API. While HuggingChat possesses similar capabilities to ChatGPT, it differentiates itself through its unique features.

When queried, HuggingChat explicitly highlights its commitment to personalization and the establishment of an “authentic bond” with users and their avatar friends. Users can choose from pre-existing character avatars or customize their own. Notably, HuggingChat subtly points out its training in emotions and empathy, subtly implying that it does not adopt a “neutral stance” like ChatGPT. Although a brief experiment suggests that HuggingChat has ambitious goals, it has not yet reached the same level as ChatGPT. My own experience with HuggingChat showed many answers seemed to be long-winded with answers that sounded like they came from a madman, in other cases it seemed a little confused. For instance:

However, being an open-source software, it has the potential for rapid improvement.

To create HuggingChat, Hugging Face collaborated with the nonprofit organization LAION and utilized their Open Assistant AI model. LAION is renowned for developing the Stability AI training dataset, which was instrumental in creating Stable Diffusion. Hugging Face, known for their open-source generative AI models such as BLOOM, has expanded their endeavors into the realm of conversational interfaces with the introduction of HuggingChat.

In a recent development, Stability AI released the first batch of their new open-source StableLM models, designed for writing text and code. Although HuggingChat does not explicitly state its intention to replace ChatGPT, the Open Assistant team openly expresses their desire to go beyond mere replication.

The GitHub page of Open Assistant provides further insight into their ambitious vision: “We aim to create the assistant of the future, capable of not only generating emails and cover letters, but also performing meaningful tasks, utilizing APIs, dynamically researching information, and much more. Additionally, we strive to make the assistant personalized and extendable by anyone. To achieve this, we prioritize an open and accessible approach, ensuring that the assistant is lightweight and efficient enough to operate on consumer hardware.”

The release of HuggingChat marks an important milestone in the development of AI chatbots, showcasing the continuous advancements in the field. With its emphasis on personalization and potential for rapid improvement, HuggingChat has the potential to shape the future of AI-driven conversational interfaces, making them more versatile and accessible to a wider audience.

The Unite.ai team is excited about this release and will continue testing various prompts.

OpenAssistant RELEASED! The world's best open-source Chat AI!OpenAssistant RELEASED! The world's best open-source Chat AI!
Watch this video on YouTube

Best Architecture for Your Text Classification Task: Benchmarking Your Options

Best Architecture for Your Text Classification Task: Benchmarking Your Options
Image by Editor

In our previous article, we covered a variety of approaches to building a text classification model based on what modern NLP currently has to offer.

With old-school TF-IDF approaches, pre-trained embedding models, and transformers of various shapes and sizes to choose from, we wanted to give some practical advice based on our own experience. Which models are best suited for different situations? What are some use cases you can find in your own line of work?

To add extra flavor, we want to show you a real-life example of benchmarks for those different approaches and compare them using a dataset we chose for this quick follow-up article.

Describing the Dataset and Task

To illustrate our ideas, we chose The Twitter Financial News, an English-language dataset containing an annotated corpus of finance-related tweets. It’s commonly used to build finance-related content classification models that sort tweets into a number of topics.

It’s a medium-sized dataset, which is perfect for us to illustrate how different models perform. Also fairly diverse, the size allows us to train and evaluate models relatively quickly.

What’s interesting about this domain is that financial language is usually crisp and laconic. There are plenty of proper nouns describing brands, terms, and related entities, and the models need to learn to distinguish them from common nouns with completely different meanings. Intuitively, fine-tuning pre-trained generic-language models in this domain should boost overall performance and accuracy.

The dataset consists of around 21,000 items. Not too small, it’s also not too large, making it perfect for showing off the advantages and disadvantages of each model and approach. Let’s come back to this once we have the results.

And finally, the dataset has 20 classes. It’s no common classification task, where you have to distinguish between a handful of sentiment classes and emotional tones. There’s an imbalance too. With a 60x+ difference between the most and least frequent classes, some approaches can be expected to underperform.

Let’s see how different models will perform in our benchmarking test.

Describing the Approach

Based on our previous article, FastText, BERT, RoBERTa (with second-stage tuning), and GPT-3 are our choices for assessing their performance and efficiency. The dataset was split into training and test sets with 16,500 and 4500 items, respectively. After the models were trained on the former, their performance and efficiency (inference time) were measured on the latter.

To train a FastText model, we used the fastText library with the corresponding command line tool. We prepared the dataset by inserting labels into texts with the proper prefix, ran the fasttext supervised command to train a classifier, and waited a couple minutes to produce the model on a CPU-only machine. The next command, fasttext predict, gave us predictions for the test set and model performance.

As for transformers, we chose three slightly different models to compare: BERT (more formal, best-base-uncased), RoBERTa-large, and an adapted version of the latter tuned for sentiment classification on a couple finance-related datasets (check it out on the HuggingFace website). The transformers library stood in for our experiments, though it entails writing some code to actually run training and evaluation procedures. A single machine with the A100 GPU handled training, which took 20–28 minutes until early stopping conditions were met for each model. The trained models were stored in a MLFlow registry.

To train a classifier based on the GPT-3 model, we referred to the official documentation on the OpenAI website and used the corresponding command line tool to submit data for training, track its progress, and make predictions for the test set (more formally, completions, a better term for generative models). Since the work itself happened on OpenAI’s servers, we didn’t use any particular hardware. It only took a regular laptop to create a cloud-based model. We trained two GPT-3 variations, Ada and Babbage, to see if they would perform differently. It takes 40–50 minutes to train a classifier in our scenario.

Once training was complete, we evaluated all the models on the test set to build classification metrics. We chose macro average F1 and weighted average F1 to compare them, as that let us estimate both precision and recall in addition to seeing if dataset imbalance influenced the metrics. The models were compared on their inference speed in milliseconds per item with a batch size of one. For the RoBERTa model, we also include an ONNX-optimized version as well as inference using an A100 GPU accelerator. Measuring the average response time from our tuned Babbage model gave us the GPT-3 speed (note that OpenAI applies some rate limiters, so the actual speed might be lower or higher depending on your terms of use).

Results

How did the training work out? We arranged the results in a couple tables to show you the end product and the effect we observed.

Best Architecture for Your Text Classification Task: Benchmarking Your Options
Photo by Author

What caught our eye first is that fasttext lagged far behind. With that said, it took minimal resources in terms of computation, time, and training, and it gave us a low bar benchmark.

How about the transformers? As expected, RoBERTa delivered better results than BERT, which is easy to attribute to the size advantage it had. It’s also generally better with domain-specific classification tasks. To be fair, we specifically selected a large RoBERTa architecture for this comparison, and the base RoBERTa model might have performed similarly to BERT despite differences in the underlying corpus and training methodology.

The tangible gap between the F1 metrics for BERT and RoBERTa could also have been caused by the fact that we’re dealing with a fairly large number of classes. The dataset has imbalances that larger models tend to capture better. But that’s just our suspicion and proving it would require more experimentation. You can also see that the domain-pretrained RoBERTa offered a tiny accuracy boost, though it’s insignificant. It’s hard to say if the pre-trained domain-tuned model was actually worthwhile for our experiment.

Next comes GPT-3. We selected the Ada and Babbage models for a fair comparison with BERT and RoBERTa-large since they have excellent parameter sizes that grow gradually (from 165 million parameters in BERT and 355 million in RoBERTa-large to 2.7 billion in Ada and 6.7 billion in Babbage) and can show whether the model size really gives a proportional performance boost. Surprisingly, Ada and Babbage both deliver almost the same metrics, and they actually lose to RoBERTa even without domain-specific pre-training. But there’s a reason for that. Remember that GPT-3 API-accessible models actually give users a generative inference interface, so they try to predict a token that would classify each example in the classification task. RoBERTa and other models from transformers, on the other hand, have the last layers of their architecture configured correctly for classification. Imagine a proper logit or softmax at the end that returns the likelihood of all the classes for any data item you pass to it. While the huge GPT-3 would be sufficient to tackle classification for one of 20 classes by generating the right token class, it’s overkill here. Let’s just not forget that the GPT-3 model is fine-tuned and accessed with just three lines of code, unlike RoBERTa, which takes work to roll out on your architecture.

Best Architecture for Your Text Classification Task: Benchmarking Your Options
Photo by Author

Let’s now finally compare the models and their inference setups in terms of their request execution speed. Since we’re not just training them for performance and accuracy, we need to take into account how fast they return us their inference for new data. We clocked the online synchronous requests to the models and tried to understand the best niche for each.

The winner here is fasttext. Its accuracy, however, forces us to keep moving down the list.

Between the RoBERTa and GPT-3 setups, we can see that GPT-3 is relatively fast despite being the largest, especially given that its response time includes two-sided network communication to their API endpoint. The actual inference here is small. That’s obviously good, especially since this is a pretty simple solution to set up, fine-tune, and implement model calls for. While it can be expensive, especially if you plan on sending a lot of data frequently, the cost-benefit decision is up to you.

The GPU-hosted version is the winner among the RoBERTa setups. The GPUs add a huge performance boost to inference computations, though hosting your model server on GPU machines might price the project out of your budget. Rolling out a GPU-based model server can also be tricky, especially if you haven’t done it before.

You also need to remember that despite these benchmarks are all being fast in terms of returning the results of your model requests, you shouldn’t forget to do some planning and break down how you plan to use the models in your project. Real-time inference or asynchronous batch requests? Accessed over the internet or within your local network? Do you have overhead for your business logic operations on top of the model response? All that can add much more time overhead to each request over the actual model inference calculation itself.

Conclusions and Follow-up Ideas

What have we learned? We tried to demonstrate a real-life example of the balance between the difficulty of running various models, their resulting accuracy metrics, and their response speed when they are ready to be used. Obviously, figuring out what to use when and how given your project is a challenge. But we hope to leave you with some guidance?—?there’s no silver bullet when it comes to GPT models. We all have to count our money as well, especially when it comes to machine learning.

Here at Toloka, we’re working hard on a platform that will enable users to train, deploy, and use a transformer like RoBERTa with the same three API calls as GPT-3 API.

In our next article, we’ll run a couple more experiments on how to mitigate the effects of disbalanced datasets and upsample or downsample classes for balance. Our suspicion is that the GPT-3 generative approach will perform better than RoBERTa-large. We’ll also discuss how these results might change if we take on a much smaller dataset, and we point out exactly when and where GPT-3+ models will outperform all the others in classification tasks. Stay tuned and check out more of our work over at the Toloka ML team’s blog.
Aleksandr Makarov is a senior product manager in Toloka.ai leading the product development of Toloka.ai ML platform, a former healthtech entrepreneur and co-founder of Droice Labs

More On This Topic

  • What is Text Classification?
  • Automated Text Classification with EvalML
  • Sentiment Analysis API vs Custom Text Classification: Which one to choose?
  • Better data apps with Streamlit's new layout options
  • Which flavor of BERT should you use for your QA task?
  • Are You Still Using Pandas to Process Big Data in 2021? Here are two better…

What is HuggingChat? Everything to know about this open-source AI chatbot

HuggingChat

More on AI tools

  • How to use ChatGPT to build your resume
  • How to use Bing Image Creator (and why it's better than DALL-E 2)
  • How to use ChatGPT: Everything you need to know
  • How to use ChatGPT to write code
  • How to use ChatGPT to write Excel formulas

Quantive Singularity: A Breakthrough in AI-Driven Strategic Intelligence

As a foremost provider of strategy execution software and services, Quantive has announced its latest ground-breaking development, Quantive Singularity. This strategic intelligence platform, powered by artificial intelligence, is designed to empower businesses to make well-informed decisions as they pursue their strategic goals. In an increasingly competitive and uncertain market, leaders need access to timely insights, forecasts, and intelligence to guide their organizations towards success. Quantive Singularity caters to these requirements, enabling leaders to gain a deeper understanding of their company’s performance, manage risks proactively, and make data-driven decisions.

Revolutionizing Decision-Making with AI-Driven Strategic Insights: Quantive’s CEO, Ivan Osmak, highlights that the ability of an organization to adapt quickly and deliver ambitious results is essential for thriving in today’s business landscape. Quantive Singularity provides leaders with crucial information regarding growth opportunities, the root causes of issues, and suggested actions to address challenges, all conveniently available whenever needed.

The platform offers senior leaders real-time updates on their strategic priorities and improved organizational visibility. Specifically, Quantive Singularity provides:

  1. Domain-specific machine learning that forecasts an organization’s performance against its objectives, ensuring businesses can anticipate future risks and challenges.
  2. AI-driven analysis of strategic progress and vital business KPIs, revealing the reasons behind a company’s successes and difficulties, which in turn enables business leaders to make informed decisions and adjust strategies to continue pursuing their goals.
  3. The discovery of necessary actions to expedite progress, eliminate obstacles, and align resources with strategic outcomes, pushing the business forward and helping it achieve strategic objectives in spite of any hurdles.

Innovations and the Road Ahead: The introduction of Quantive Singularity follows several advancements aimed at delivering increased value to its customers. In March, the company acquired AuxinOKR and launched Quantive Consulting, which helps clients implement organizational change management and operationalize transformation across all levels of their business. Furthermore, in February, Quantive integrated new AI capabilities into its flagship Quantive Results platform.

With a mission to help organizations create greater strategic agility and excel at execution, Quantive offers two primary products: Quantive Results, a leading strategy execution platform based on the Objectives and Key Results (OKR) methodology, and Quantive Signals, a business observability platform. These offerings enable over 2,000 customers to close the gap between strategy and execution, allowing them to achieve their best possible outcomes. The company, initially known as Gtmhub, has continually evolved to cater to the ever-changing business landscape.

Five Years of GPT Progress

Five Years of GPT Progress
Image by Editor

In this article, I discuss the generative pre-trained transformer (GPT) line of work, and how it has evolved over time. I focus on the SOTA models, and the differences between them. There are a bunch of different articles summarizing these papers, but nothing that I’m aware of that explicitly focuses on the differences between them.

I focus on the GPT line of research as that’s what’s driving the current fever pitch of development. There’s a ton of prior work before large GPTs (eg the n-gram models from the 2000s, BERT, etc) but this post is super long, so I’m gonna save those for future articles.

GPT

Abstract

The first GPT paper is interesting to read in hindsight. It doesn’t appear like anything special and doesn’t follow any of the conventions that have developed. The dataset is described in terms of GB rather than tokens, and the number of parameters in the model isn’t explicitly stated. To a certain extent, I suspect that the paper was a side project at OpenAI and wasn’t viewed as particularly important; there’s only 4 authors, and I don’t remember it particularly standing out at the time.

The architecture is remarkably unchanged compared to GPT-3:

  • Decoder-only transformer, with 12 layers, 768 embedding dimension, 12 attention heads, and 3072 (4x the embedding dimensions).
  • They use Adam, with a warm up, and anneal to 0 using a cosine schedule.
  • Initialize weights to N(0, 0.02), using BPE with a vocab of 40000 merges.
  • Activations are GELUs.
  • Context of 512
  • 117M parameters
  • Learned position embedding, not the sinusoidal ones from “Attention is all you need”.

The number of parameters isn’t explicitly discussed, but appears to be roughly 120M, easily enough to fit on a single V100 or a standard consumer GPU (rough estimate of 120M parameters for the model, 240M for the optimizer, for 360M parameters; assuming each is a float32, then this takes up 4 bytes * 360M = 1440MB/1.4GB.

They use the BooksCorpus dataset (~20M tokens), training for 100 epochs with a batch size of 64. 20M tokens is a very small dataset by modern standards, as is a batch size of 64.

The most surprising thing compared to modern GPTs is that they train for 100 epochs. Modern GPTs rarely ever see repeated data, and if they do, they typically only see certain datapoints a small number of times (2-4x), and the entire dataset is never repeated 100x.

GPT-2

Abstract

GPT-2 is where the language models start to get big. This is the first time that OpenAI trains a model with >1B parameters. We start to see scale as a primary concern; in GPT, the authors trained a single model, but here, the authors train a range of models, with sizes ranging from GPT to 10x GPT (which is the actual GPT-2 model).

The differences in architecture compared to GPT are as follows:

  • They layernorm the inputs and add an additional layernorm to the output of the final self-attention block
  • Weights are scaled by layer by 1/sqrt(n)
  • Vocabulary of ~50k (up from ~40k)
  • Context of 1024 (up from 512)
  • Batches of 512 (up from 64)
  • Largest model is 1.5B parameters

The dataset is much, much bigger, going from roughly 20M tokens (4GB) of data consisting of publicly available books, to 9B tokens1 (40GB) of text scraped from the internet (WebText).

It’s unclear if they trained the model for 100 epochs as before; they say they followed the same training procedure, so presumably they did. Again, this is a significant departure from later work.

Nothing here is particularly different from GPT; most of the changes are related to making the model bigger. The only other changes are the layernorm changes and the weight scaling, which don’t seem to make a big difference (although, as always, more ablations would be nice).

GPT-3

Abstract

Here is where the era of truly large language models began, and the current AI bubble excitement took off. In the paper, the authors train 10 models, varying from 125M parameters (”GPT-3 Small”) to 175B parameters (”GPT-3”).

Five years of GPT progress

For each of the models, the architectures are identical to GPT-2 with the exception that they use “alternating dense and locally banded sparse attention patterns in the layers of the transformer.” The sparse attention here refers to the attention mechanism introduced in the Sparse Transformer, which lets attention scale proportional to Equation (where Equation is the context length). The standard dot-product attention mechanism scales proportional to Equation , so this is a substantial gain. I would have loved a proper ablation to see what difference sparse vs dense attention makes, but alas.

I’m very curious why they used sparse attention. Reproductions and later papers uniquely use dense attention. As this paper came before FlashAttention and some of the other algorithmic innovations that make dense attention faster, maybe this was a computational bottleneck? It’s really unclear.

They don’t provide any detail about the computational architecture, i.e. how they distributed the model. The authors claim it’s because it doesn’t really matter, but I think it was restricted for competitive reasons, as it makes the paper much more difficult to reproduce. Megatron, which I’ll discuss later, was highly influential because they went into detail about how they made model parallelism work for their GPT.

What I find really interesting about the GPT-3 paper is that I don’t think this gets published in a top journal (nature/science), maybe not even NeurIPS. This isn’t a critique of GPT-3— it’s a critique of the modern conference circuit, and if anything, a celebration of the culture that OpenAI has. Most of the conference publishing circuit is driven by novelty, even if it’s not what we need. The GPT-3 paper, however, was a largely engineering driven paper; they made the model bigger and it worked much better! That’s not novel from a research perspective, but is transformative from an application perspective.

This is particularly problematic because we know that adding complexity to our models increases performance (see: R^2 vs adjusted R^2 for simple linear models). Because of the need for novelty, there are many research projects that don’t get pursued because they’re “only” engineering projects, or they “only” do hyper-parameter tuning and wouldn’t be able to get published, even if they had impressive performance improvements. That OpenAI went against the grain here is a credit to them.

This is a strength of OpenAI (and Stability.ai, Midjourney, basically everywhere that’s not FAIR/Google Brain/Deepmind/etc). You could alternatively frame it as a weakness of the more academic labs that have promotion/performance review policies driven by publications.

Jurassic-1

PDF

I wasn’t sure whether or not to include Jurassic-1. It’s a model from the Israeli tech company AI21 Labs. I haven’t heard a lot about them, but the paper’s cited by a bunch of the papers later on in the article; they trained a 178B parameter model that outperformed GPT-3 in a few categories, and was faster for inference. It’s impressive that they’re competing with DeepMind, OpenAI, Nvidia, etc. despite only having raised <$$10M at the time. They made a zero-shot and few-shot test suite publicly available.

Like many other papers, they don’t go into detail about the engineering details behind training a large model (178B parameters) over 800 GPUs:

Five years of GPT progress

The paper is remarkably sparse on details, which I suspect was done for competitive reasons, just like GPT-4.

Facebook is the only company to go into detail about their experiences training a 175B parameter model, just like Nvidia is the only company to go into detail about the computational architecture required to train a LLM over many GPUs (see: the Megatron paper, next). In both cases, the companies are commoditizing their complements and strengthening their main lines of business by making it easier to train large models.

Jurassic uses a different architecture from GPT-3, but again, doesn’t go into much detail:

  • 76 layers (vs 96 layers for GPT-3)
  • They use the SentencePiece tokenizer, with a large vocabulary of 256K (vs GPT-3 which used BPE w/ ~50k tokens).

Neither of these changes are material, in my opinion. I think what we’re seeing is that there’s a relatively large degree of freedom in model architectures which produce similar results. This is borne out by their evaluation, which has results similar to GPT-3 (better in some categories, worse in others), although Jurassic-1 is faster for inference due to being shallower.

We’re starting to see a consistent pattern emerge:

  • Papers introduce a bunch of changes, their own dataset, and have a new SOTA
  • but they don’t do a proper ablation, so it’s tough to understand what was important and what drove the improvements

GPT-2, GPT-3, Jurassic-1, etc. all did this.

Megatron-Turing NLG

Megatron was a highly influential paper that introduced efficient model-parallel architectures. If you’re interviewing for a LLM job today, you’re going to be expected to be familiar with it. Megatron introduced tensor parallelism, a variant of model parallelism that splits the models to allow for intra-layer model parallelism, achieving 76% as efficient as a single GPU baseline (although the baseline is only 30% of peak FLOPS).

Prior to megatron, the published SOTA for model parallelism was to use model pipelining, e.g. GPipe. However, this was difficult to do and not well supported by code. There were attempts to support tensor parallelism, e.g. Mesh-Tensorflow, which introduced a language for specifying a general class of distributed computations in TensorFlow, but nothing had really dominated. Interestingly, the first author had just left DeepMind 1 year before this was published, so this was possibly his first project at Nvidia.

Megatron has the realization that, if you have a neural network like this:

Equation

and you split Equation, i.e. along the columns, then Equation, so you don’t need to do any synchronization to calculate Equation. Consequently, the only points where you need synchronization (all-reduces) in the transformer are:

  1. In the forward pass, to concatenate the model activations after the MLP block before adding dropout
  2. In the backwards pass, at the start of the self-attention block.

Five years of GPT progress
Now, I strongly suspect this is what GPT-3 and Jurassic-1 both did, but neither went into detail about the specific parallelism models they used, other than to say (from GPT-3):

To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network.

Presumably, this style of parallelism is what is meant by “model parallelism within each matrix multiply,” as I find it hard to imagine what else they could mean.

Gopher

Abstract

Gopher was a LLM trained by DeepMind. Interestingly, the lead author joined OpenAI shortly after it was published, along with a few of the coauthors. The architecture was the same as GPT-2, except:

  • They use RMSNorm (instead of layernorm)
  • Use relative positional encoding scheme from Transformer-XL (while GPT-* used a learned positional embedding)
  • They use SentencePiece (instead of BPE). This seems to be an Alphabet thing; many of the Alphabet papers use SentencePiece, while most of the non-Alphabet world uses BPE.

The paper was very interesting from a computational perspective, as they went into detail about how they trained their model and made it work:

  • They used optimizer state partitioning (ZeRO)
  • Megatron-style model parallelism
  • And rematerialization/gradient checkpointing to save memory.

These are all now the standard techniques used to train large models. To the best of my knowledge, Gopher was the first paper to put all of these together and release details about doing so publicly.

It’s interesting— often, big labs don’t include details for comeptitive reasons. Here, because DeepMind was (arguably) behind, they went into extensive detail. I think we’ll see this increase with LLM research from everyone that’s not OpenAI/Anthropic, as the others don’t live/die by the commercial success of their API, and have strong incentives to make it easier for **others** to train large models (and thereby commoditize their complements).

For the paper, DeepMind built a dataset called MassiveText, which was as follows:

Five years of GPT progress

Interestingly, this is much smaller than the dataset OpenAI used for GPT-3. GPT-3 had roughly 45TB of text, while MassiveText “only” had about 10.5TB.

They used this dataset to trained a large model on 300B tokens. The dataset consists of 2.343 trillion tokens, so this is only 12.8%. A much smaller subset. This is interesting to compare to the earlier GPTs, which, if you recall, used 100 epochs (so they saw each token in the dataset 100 times— while Gopher only saw 10% of their tokens once)!

The Gopher appendices have some great work; someone finally did ablations! They looked at:

  • Adafactor vs Adam, and found that Adafactor was much less stable
  • Lower-precision training, trying runs with float16, bfloat16, float32, RandRound, and using bfloat16 parameters with float32 in the optimiser state (rounding randomly). They found that using float32 parameters for optimisation updates only mitigated the performance loss, saving a substantial amount of memory.
  • Scaling context length; they show how performance increases as the context length increases. Improvements see diminishing returns, but consistently improve. Performance looks roughly proportionate to Equation (where Equation is the context length).

It’s really nice to see detailed empirical work like this— it’s a welcome change from the other papers that failed to do this.

Chinchilla

Abstract

Chinchilla is an incredibly influential paper that established scaling laws. It’s one of my favorite papers from the last few years, as it actually does science in a way that physicists would agree with. One answer to “is something science” is to say, if you were to meet a historical scientist in your field, could you teach them something? And if you brought Chinchilla to researchers to, say, Radford et. al in 2017, it would advance their work by several years.

Chinchilla trained over 400 GPT-style transformers, ranging in size from 70M to 16B parameters, and fit the following equation (N is the number of parameters in the LM, and D is the number of tokens in the dataset):

Equation

Choosing Equation to minimize

Equation

where the Huber loss is

Equation

Here, we can think of E as the “irreducible loss” from the dataset, i.e. the loss if we trained an infinitely large model on an infinite stream of tokens. The authors find that the optimal model is (from nostalgebraist on into the implications of Chinchilla):

Five years of GPT progress

The implication here is that the model size & data size matter roughly equally, which is interesting, given how much attention & effort goes to scaling up the model, and how little attention is given to the dataset.

The authors then used this equation to determine the optimal model size for the Gopher compute budget, and trained it on more tokens— 1.4T tokens, 4.6x the number of tokens Gopher was trained on. This model, being 4x smaller, has a radically smaller memory footprint and is much faster/cheaper to sample from.

The Chinchilla paper has been highly influential. Almost every team that I’ve been talking to that is training a LLM right now talks about how they’re training a Chinchilla optimal model, which is remarkable given that basically everything in the LLM space changes every week.

The standard practice before Chinchilla was to train your model for 300B tokens, which is what GPT-3, Gopher, and Jurassic-1 all did. Chinchilla reveals how wasteful that was; basically, all of these papers made themselves more expensive to infer by training models that were too large.

Changes from Chinchilla (otherwise the same as Gopher):

  • AdamW instead of Adam (there’s an interesting footnote regarding the choice of optimizer: “a model trained with AdamW only passes the training performance of a model trained with Adam around 80% of the way through the cosine cycle, though the ending performance is notably better”)
  • Uses a modified SentencePiece tokenizer that is slightly different from Gopher (doesn’t apply NFKC normalisation)
  • They compute the forward + backward pass in bfloat16, but store a float32 copy of the weights in the optimizer state. They find that this is basically identically efficient to using float32 everywhere.

All of the changes are ablated extensively in the appendix. None of these are particularly novel.

PaLM

Speaking of training models that were too large- we have PaLM! Palm was really, really big. As far as I’m aware, it’s the largest dense language model trained to date, at 540B parameters, requiring 6144 TPUs to train on (this is 3 entire TPU pods, each consisting of 2048 TPUs). This is incredibly expensive! Probably only Google has the resources + infrastructure to do this.

… unfortunately, they were training PaLM at the same time chinchilla was being written. Very suboptimal.

Changes from GPT-3:

  • Multi-query attention. Shares K/V embeddings for each head, but has separate Q embeddings. Makes it much faster during inference time.
  • Uses parallel transformer blocks, which improves training time by 15%. As it was trained using 6144 TPU v4 chips for 1200 hours, the total training cost (at public prices) is between Equation3.22 per chip-hour, for a total of Equation22M. So this change saved Equation3M.
  • SwiGLU activations, rather than the GELU activation used by GPT-3
  • RoPE embeddings, rather than the learned embeddings
  • Shares the input-output embeddings
  • No bias vectors
  • SentencePiece with 256k tokens

So, a ton of changes! Again, a bunch of these are common, e.g. using the learned embeddings that GPT-3 had is very passé, and almost no one does it now.

LLaMa

Abstract

LLaMa combined a bunch of the best feartures from PaLM and Chinchilla:

  • Pre-normalize the input of each transformer sub-layer
  • RMSNorm, instead of LayerNorm, as done in Gopher
  • SwiGLU activation function from PaLM (but a dimension of Equation instead of Equation, as in PaLM)
  • Uses rotary positional embeddings (RoPE) instead of the absolute positional embeddings, as done in PaLM
  • Uses AdamW, as done in Chinchilla

I think that LLaMa is the recipe to follow for the current SOTA in training large models.

Computational changes:

  • Uses efficient attention (Rabe & Staats, FlashAttention)
  • Gradient checkpointing
  • Interestingly, they appear to be using float32s everywhere (or at least, don’t say otherwise)

These are all similar to Gopher. The one obvious optimization they missed is to use lower precision, as Chinchilla did; I’m curious why they didn’t.

My one complaint is that I wish they would have trained the model for longer. The learning curve is very far from convergence! This paper is, in my mind, the shining example showing how well smaller models can do when trained well.

GPT-4

This is where I’d include information about GPT-4, if there was any. Unfortunately, the GPT-4 technical report contains almost no information:

GPT-4 is a Transformer-style model [33] pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [34]. Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

Five years of GPT progress
As a result, I’m not going to talk about it, as there’s not much to say. Hopefully OpenAI changes their mind and releases some information about their model.2 Conclusion

This is it, as of March ‘23. I’m sure something new will come along and invalidate all of this.

What have I missed? Add comments on Substack or email me and I’ll update it.

Articles I’m reading:

  • Why didn’t DeepMind build GPT-3?
  1. The paper itself doesn’t report the number of tokens, but OpenWebText, the open source reproduction, gets nine billion, using OpenAI’s tokenizer. ↩
  2. To be clear, I highly doubt this will ever happen. ↩

Finbarr Timbers is a ML researcher who writes about the bleeding edge of AI research on his blog, Artificial Fintelligence. He used to be a researcher at DeepMind. Finbarr’s research focuses on large models, language and otherwise. He is also an angel investor writing small, early cheques in deep tech, primarily in AI. In a past life, Finbarr studied econometrics and mathematical economics at the LSE.

Original. Reposted with permission.

More On This Topic

  • Progress Bars in Python with tqdm for Fun and Profit
  • A checklist to track your Data Science progress
  • Here is What I’ve Learned in 2 Years as a Data Scientist
  • Data Science is Not Becoming Extinct in 10 Years, Your Skills Might
  • What 2 years of self-teaching data science taught me
  • How a Single Mistake Wasted 3 Years of My Data Science Journey

Microsoft Designer brings AI-powered graphic design to the masses

Microsoft Designer

This image was created with the prompt: 'Create an Instagram post from ZDNET announcing the Microsoft Designer launch'.

Microsoft is launching its new Designer app, which provides an impressive example of how far generative artificial intelligence (AI) has come.

Microsoft Designer is a graphic design tool that uses AI to create new content from your prompts, powered by the latest version of OpenAI's Dall-E.

Also: The best laptops for graphic design

The tech giant is already waist-deep in generative AI waters with services that include Bing Chat and Bing Image Creator, and hefty investments across its suite of products.

The AI boom has reached an unprecedented level, with some experts certain it will severely disrupt the job market and others already using it daily to improve productivity at work and at home. Meanwhile, tech companies are working hard to create the next ChatGPT or AI image creator, and other AI tools.

Also: How to turn any photo into a professional headshot with Canva

Into this increasingly crowded market comes Designer, which is kind of like Canva on steroids. Designer lets users simply write a description of the output they want and then it uses generative AI to respond with a created graphic design.

This AI-led service represents a big break from traditional content apps, where users normally visit a site or program and choose from presets, layouts, or stickers and fonts to designs, social media posts, party invitations, and more.

Designer could have a significant impact on the graphic designer community. However, rather than signal the end of the profession, the hope is that a tool like Designer will empower people and increase their productivity, such as those who need a creativity boost or those who are artistically challenged.

Microsoft's tool was only available through a waitlist until today. Rumors of a Canva competitor were followed by an official announcement and preview access during Microsoft's Surface event last fall.

Also: The best drawing tablets, according to digital artists and graphic designers

The Designer app will be featured prominently on the Microsoft Edge browser sidebar for quick access.

Users who accessed Microsoft Designer during the preview stage will also see even more AI-powered features added over time. Some of those features include the ability to Fill, Expand background, Erase, and Replace backgrounds.

Users just need to go to Designer.Microsoft.com to access the new AI tool for free, although Microsoft 365 subscribers will have access to more premium features.

More on AI tools

Y Combinator-backed Luca aims to optimize retail prices at enterprise scale

Y Combinator-backed Luca aims to optimize retail prices at enterprise scale Kyle Wiggers 7 hours

Luca, a startup building price planning and prediction tools for retailers, today announced that it closed a $2.5 million seed round led by Menlo Ventures with participation from Y Combinator (Luca was in Y Combinator’s winter 2023 class), Soma Capital and angel investors.

Luca was co-founded by Tanvi Surti and Yonah Mann, who worked together on Uber’s dynamic pricing team. Mann was focused on Uber Eats pricing, while Surti led the UberPool pricing group.

“During my time on the Pool team in 2019, the business had bad unit economics,” Mann told TechCrunch in an email interview. “Uber was on the verge of an IPO, and I was responsible for getting the unit economics working by reconfiguring Uber Pool’s pricing algorithm. In ten months, we succeeded in plugging a massive hole in Uber’s profit and loss using pricing tech. That got us thinking.”

Mann describes Luca as a “pricing co-pilot for enterprise retailers.” In plain English, it’s a platform that taps AI to identify revenue and margin headroom, make recommendations for price adjustments and measure the outcomes.

“Pricing strategy is one of the most powerful levers that retailers have to create margin and revenue growth, yet most retail pricing teams are often shooting in the dark,” Mann said. “Retail pricing teams have to incorporate large volumes of data from multiple channels to build a strategy — sales history, market trends, competitor price changes and inventory availability. They have to do this across tens of thousands of SKUs and multiple stakeholders.”

The need — and the massive addressable market — sowed the seeds for a number of pricing optimization and planning startups. Pricefx, one of the more successful vendors, has raised tens of millions for its algorithmic pricing software designed for software-as-a-service businesses. Fetcherr, which focuses on pricing adjustment for the airline industry, recently landed a $12.5 million equity investment.

Luca

Luca’s platform attempts to optimize retail prices using historical data and other signals. Image Credits: Luca

Luca’s differentiation, ostensibly, lies in its pricing engine, which takes in retailers’ historical sales and inventory data as well as competitor signals to forecast the sales performance of products at different price points. Once its pricing recommendations are approved, Luca keeps tabs on sales volume, looking out for undesirable trends.

“Unlike some other players in this space, we are not a dynamic pricing company. We just don’t think that’s the right user experience for retail — yet,” Mann said. “Our solution is complementary to the human decision maker and our goal is to provide humans with decision making superpowers by turning a sea of data into clear recommendations, with high levels of explainability.”

It’s early days for Luca in terms of customer acquisition — the startup’s only worked with eight brands so far. But Mann claims that two of those brands are Fortune 500 retailers.

“Post pandemic, most retailers are feeling growth and margin pressures due to higher customer acquisition costs, reduced consumer spending and rising interest rates,” Mann said. “Most retail software-as-a-service tooling out there rarely directly impacts business metrics, whereas retail executives we’ve interviewed are actively looking for revenue optimization opportunities … That’s where we come in – our solution creates direct and measurable business value.”

The seed funding will be put toward expanding Luca’s engineering and data science teams, he added.

Introducing the Testing Library for Natural Language Processing

Introducing the Testing Library for Natural Language Processing

Responsible AI: Goals versus Reality

While there is a lot of talk about the need to train AI models that are safe, robust, and equitable — few tools have been made available to data scientists to meet these goals. As a result, the front line of Natural Language Processing (NLP) models in production systems reflects a sorry state of affairs.

Current NLP systems fail often and miserably. [Ribeiro 2020] showed how sentiment analysis services of the top three cloud providers fail 9-16% of the time when replacing neutral words, 7-20% of the time when changing neutral named entities, 36-42% of the time on temporal tests, and almost 100% of the time on some negation tests. [Song & Raghunathan 2020] showed data leakage of 50-70% of personal information into popular word & sentence embeddings. [Parrish et. al. 2021] showed how biases around race, gender, physical appearance, disability, and religion are ingrained in state-of-the-art question answering models – sometimes changing the likely answer more than 80% of the time. [van Aken et. al. 2022] showed how adding any mention of ethnicity to a patient note reduces their predicted risk of mortality – with the most accurate model producing the largest error.

In short, these systems just don’t work. We would never accept a calculator that only adds correctly some of the numbers, or a microwave which randomly alters its strength based on the kind of food you put in or the time of day. A well-engineered production system should work reliably on common inputs. It should also be safe & robust when handling uncommon ones. Software engineering includes three fundamental principles to help us get there.

Applying Software Engineering Fundamentals

First, test your software. The only surprising thing about why NLP models fail today is the banality of the answer: because no one tested them. The papers cited above were novel because they were among the first. If you want to deliver software systems that work, you need to define what that means, and test that it does, before deploying it to production. You should also do that whenever you change the software, since NLP models regress too [Xie et. al. 2021].

Second, don’t reuse academic models as production-ready ones. One wonderful aspect of scientific progress in NLP is that most academics make their models publicly available and easily reusable. This makes research faster and enables benchmarks like SuperGLUE, LM-Harness, and BIG-bench. However, tools that are designed to reproduce research results are not a good fit for use in production. Reproducibility requires that models stay the same – instead of keeping them current or more robust over time. A common example is BioBERT, perhaps the most widely used biomedical embeddings model, which was published in early 2019 and hence considers COVID-19 an out-of-vocabulary word.

Third, test beyond accuracy. Since the business requirements for your NLP system include robustness, reliability, fairness, toxicity, efficiency, lack of bias, lack of data leakage, and safety – then your test suites need to reflect that. Holistic Evaluation of Language Models [Liang et. al 2022] is a comprehensive review of definitions and metrics for these terms in different contexts and a well-worth read. But you will need to write your own tests: for example, what does inclusiveness actually mean for your application?

Good tests need to be specific, isolated, and easy to maintain. They also need to be versioned & executable, so that you can make them part of an automated build or MLOps workflow. The nlptest library is a simple framework that makes this simpler.

Introducing the nlptest Library

The nlptest library is designed around five principles.

Open Source. This is a community project under the Apache 2.0 license. It’s free to use forever with no caveats, including for commercial use. There’s an active development team behind it, and you’re welcome to contribute or fork the code if you’d like to.

Lightweight. The library runs on your laptop – no need for a cluster, a high-memory server, or a GPU. It requires only pip install nlptest to install and can run offline (i.e., in a VPN or a high-compliance enterprise environment). Then, generating and running tests can be done in as little as three lines of code:

from nlptest import Harness  h = Harness(task="ner", model="ner.dl", hub=”johnsnowlabs”)  h.generate().run().report()

This code imports the library, creates a new test harness for the named entity recognition (NER) task for the specified model from John Snow Labs’ NLP models hub, automatically generates test cases (based on the default configuration), runs those tests, and prints out a report.

The tests themselves are stored in a pandas data frame – making them easy to edit, filter, import, or export. The entire test harness can be saved and loaded, so to run a regression test of a previously configured test suite, just call h.load(“filename”).run().

Cross Library. There is out-of-the-box support for transformers, Spark NLP, and spacy. It is easy to extend the framework to support additional libraries. There is no reason for us as an AI community to build the test generation & execution engines more than once. Both pre-trained and custom NLP pipelines from any of these libraries can be tested:

# a string parameter to Harness asks to download a pre-trained pipeline or model  h1 = Harness(task="ner", model="dslim/bert-base-NER", hub=”huggingface”)  h2 = Harness(task="ner", model="ner.dl", hub=”johnsnowlabs”)  h3 = Harness(task="ner", model="en_core_web_md", hub=”spacy”)    # alternatively, configure and pass an initialized pipeline object  pipe = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser"])  h4 = Harness(task=“ner”, model=pipe, hub=”spacy”)

Extensible. Since there are hundreds of potential types of tests and metrics to support, additional NLP tasks of interest, and custom needs for many projects, much thought has been put into making it easy to implement and reuse new types of tests.

For example, one of the built-in test types for bias for US English replaces first & last names with names that are common for White, Black, Asian, or Hispanic people. But what if your application is intended for India or Brazil? What about testing for bias based on age or disability? What if you come up with a different metric for when a test should pass?

The nlptest library is a framework which enables you to easily write and then mix & match test types. The TestFactory class defines a standard API for different tests to be configured, generated, and executed. We’ve worked hard to make it as easy as possible for you to contribute or customize the library to your needs.

Test Models and Data. When a model is not ready for production, the issues are often in the dataset used to train or evaluate it – not in the modeling architecture. One common issue is mislabeled training examples, shown to be pervasive in widely used datasets [Northcutt et. al. 2021]. Another issue is reprentation bias: a common challenge to finding how well a model performs across ethnic lines is that there aren’t enough test labels to even calculate a usable metric. It is then apt to have the library fail a test and tell you that you need to change the training & test sets to represent other groups, fix likely mistakes, or train for edge cases.

Therefore, a test scenario is defined by a task, a model, and a dataset, i.e.:

h = Harness(task  = "text-classification",              model = "distilbert_base_sequence_classifier_toxicity",              data  = “german hatespeech refugees.csv”,              hub = “johnsnowlabs”)

Beyond enabling the library to provide a comprehensive testing strategy for both models & data, this setup also enables you to use generated tests to augment your training and test datasets, which can greatly shorten the time needed to fix models and make them production ready.

The next sections describe the three tasks that the nlptest library helps you automate: Generating tests, running tests, and augmenting data.

Introducing the Testing Library for Natural Language Processing

1. Automatically Generate Tests

One giant difference between nlptest and the testing libraries of yore is that tests can now be automatically generated – to an extent. Each TestFactory can define multiple test types and for each one implements a test case generator and test case runner.

Generated tests are returned as a table with ‘test case’ and ‘expected result’ columns that depend on that specific test. These two columns are intended to be human readable, to enable a business analyst to manually review, edit, add or remove tests cases when needed. For example, here are some of the test cases generated for an NER task by the RobustnessTestFactory for the text “I live in Berlin.”:

Test type Test case Expected result
remove_punctuation I live in Berlin Berlin: Location
lowercase i live in berlin. berlin: Location
add_typos I liive in Berlin. Berlin: Location
add_context I live in Berlin. #citylife Berlin: Location

Here are test cases generated for a text classification task by the BiasTestFactory using US ethnicity-based name replacement when starting from the text “John Smith is responsible”:

Test type Test case Expected result
replace_to_asian_name Wang Li is responsible positive_sentiment
replace_to_black_name Darnell Johnson is responsible negative_sentiment
replace_to_native_american_name Dakota Begay is responsible neutral_sentiment
replace_to_hispanic_name Juan Moreno is responsible negative_sentiment

Here are test cases generated by the FairnessTestFactory and RepresentationTestFactory classes. Representation could for example require that the test dataset contains at least 30 patients of male, female, and unspecified gender each. Fairness tests could require that the F1 score of the tested model is at least 0.85 when tested on slices of data with people of each of these gender categories:

Test type Test case Expected result
min_gender_representation Male 30
min_gender_representation Female 30
min_gender_representation Unknown 30
min_gender_f1_score Male 0.85
min_gender_f1_score Female 0.85
min_gender_f1_score Unknown 0.85

Important things to note about test cases:

  • The meaning of “test case” and “expected result” depends on the test type, but should be human-readable in each case. This is so that after you call h.generate() you can manually review the list of generated test cases, and decide on which ones to keep or edit.
  • Since the table of tests is a pandas data frame, you can also edit it right within your notebook (with Qgrid) or export it as a CSV and have a business analyst edit it in Excel.
  • While automation does 80% of the work, you usually will need to manually check the tests. For example, if you are testing a fake news detector, then a replace_to_lower_income_country test editing “Paris is the Capital of France” to “Paris is the Capital of Sudan” will understandably yield a mismatch between the expected prediction and the actual prediction.
  • You will also have to validate that your tests capture the business requirements of your solution. For example, the FairnessTestFactory example above does not test non-binary or other gender identities, and does not require that accuracy is near-equal across genders. It does, however, make those decisions explicit, human readable, and easy to change.
  • Some test types will generate just one test case, while others can generate hundreds. This is configurable – each TestFactory defines a set of parameters.
  • TestFactory classes are usually specific to a task, language, locale, and domain. That is by design since it allows for writing simpler & more modular test factories.

2. Running Tests

After you’ve generated test cases and edited them to your heart’s content, here’s how you use them:

  1. Call h.run() to run all the tests. For each test case in the harness’s table, the relevant TestFactory will be called to run the test and return a pass/fail flag along with an explanatory message.
  2. Call h.report() after calling h.run(). This will group the pass ratio by test type, print a table summarizing the results, and return a flag stating whether the model passed the test suite.
  3. Call h.save() to save the test harness, including the tests table, as a set of files. This enables you to later load and run the exact same test suite, for example when performing a regression test.

Here is an example of a report generated for a Named Entity Recognition (NER) model, applying tests from five test factories:

Category Test type Fail count Pass count Pass rate Minimum pass rate Pass?
robustness remove_punctuation 45 252 85% 75% TRUE
bias replace_to_asian_name 110 169 65% 80% FALSE
representation min_gender_representation 0 3 100% 100% TRUE
fairness min_gender_f1_score 1 2 67% 100% FALSE
accuracy min_macro_f1_score 0 1 100% 100% TRUE

While some of what nlptest does is calculate metrics – what is the model’s F1 score? Bias score? Robustness score? – everything is framed as a test with a binary result: pass or fail. As good testing should, this requires you to be explicit about your application does and doesn’t do. It then enables you to deploy models faster and with confidence. It also enables you to share the list of tests to a regulator – who can read it, or run it themselves to reproduce your results.

3. Data Augmentation

When you find that your model lacks in robustness or bias, one common way to improve it is to add new training data that specifically targets these gaps. For example, if your original dataset mostly includes clean text (like wikipedia text – no typos, slang, or grammatical errors), or lacks representation of Muslim or Hindi names – then adding such examples to the training dataset should help the model learn to better handle them.

Fortunately, we already have a method to automatically generate such examples in some cases – the same one we use to generate tests. Here is the workflow for data augmentation:

  1. After you’ve generated and run the tests, call h.augment() to automatically generate augmented training data based on the results from your tests. Note that this has to be a freshly generated dataset – the test suite cannot be used to retrain the model, because then the next version of the model could not be tested again against it. Testing a model on data it was trained on is an example of data leakage, which would result in artificially inflated test scores.
  2. The freshly generated augmented dataset is available as a pandas dataframe, which you can review, edit if needed, and then use to retrain or fine-tune your original model.
  3. You can then re-evaluate the newly trained model on the same test suite it failed on before, by creating a new test harness and calling h.load()followed by h.run() and h.report().

This iterative process empowers NLP data scientists to continuously enhance their models while adhering to the rules dictated by their own moral codes, corporate policies, and regulatory bodies.

Getting Started

The nlptest library is live and freely available to you right now. Start with pip install nlptest or visit nlptest.org to read the docs and getting started examples.

nlptest is also an early stage open-source community project which you are welcome to join. John Snow Labs has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. However, you’ll get what you need faster if you contribute, share examples & documentation, or give us feedback on what you need most. Visit nlptest on GitHub to join the conversation.

We look forward to working together to make safe, reliable, and responsible NLP an everyday reality.

More On This Topic

  • Introducing Packed BERT for 2x Training Speed-up in Natural Language…
  • N-gram Language Modeling in Natural Language Processing
  • Natural Language Processing with spaCy
  • Natural Language Processing Key Terms, Explained
  • 5 Fantastic Natural Language Processing Books
  • Natural Language Processing Pipelines, Explained