IBM Reveals its Entire 6.48 TB LLM Training Dataset

In May, IBM open sourced its Granite 13B LLM, ideal for enterprise use cases.

Now, Armand Ruiz, the VP of product – AI platform at IBM, has revealed the entirety of its comprehensive 6.48 TB dataset used to train Granite 13B.

This dataset, after undergoing rigorous pre-processing, was reduced to 2.07 TB, reflecting a 68% reduction. Ruiz emphasised that this step was essential to ensure a high-quality, unbiased, ethical, and legal dataset tailored for enterprise use cases.

The dataset was meticulously curated from a variety of sources, including:

arXiv: Over 2.4 million scientific paper pre-prints.
Common Crawl: Open repository of web crawl data.
DeepMind Mathematics: Mathematical Q&A pairs.
Free Law: Public-domain legal opinions from US courts.
GitHub Clean: Code data from CodeParrot.
Hacker News: Computer science and entrepreneurship news from 2007-2018.
OpenWeb Text: Open-source version of OpenAI’s Web Text corpus.
Project Gutenberg (PG-19): Free e-books with a focus on older works.
Pubmed Central: Biomedical and life sciences papers.
SEC Filings: 10-K/Q filings from the US SEC (1934-2022).
Stack Exchange: User-contributed content on the Stack Exchange network.
USPTO: US patents granted from 1975 to May 2023.
Webhose: Unstructured web content converted into machine-readable data.
Wikimedia: Eight English Wikimedia projects.

The pre-processing pipeline included several key steps including Text extraction, deduplication, language identification, sentence splitting, hate, abuse, and profanity annotation, document quality annotation, URL block-listing annotation, Filtering, Tokenization.

These steps, involving annotation and filtering based on defined thresholds, ensured that the final dataset was of the highest quality for model training.

IBM has released four variations of the Granite code model, ranging in size from 3 to 34 billion parameters. The models have been tested on a range of benchmarks and have outperformed other comparable models like Code Llama and Llama 3 in many tasks.

The post IBM Reveals its Entire 6.48 TB LLM Training Dataset appeared first on Analytics India Magazine.

IBM Reveals its Entire 6.48 TB LLM Training Dataset

Latest stories

Machine Learning Enhances Assam Government’s Disaster Response Amid Floods

What’s the California AI Bill, and Why Does Meta’s Yann...

Google’s Greenhouse Gas Emissions Increased by 48% Since 2019, Thanks...

Cloudflare’s new free tool stops bots from scraping your website...

How to Speed Up Python Pandas by Over 300x

You might also like...

Machine Learning Enhances Assam Government’s Disaster Response Amid Floods

What’s the California AI Bill, and Why Does Meta’s Yann LeCun Think it Sucks?

Google’s Greenhouse Gas Emissions Increased by 48% Since 2019, Thanks to AI Pursuits