Microsoft Solves the Problem of LLM Data Scarcity

Small models have shown promise over the last few months, and we are now finally getting to see what they are truly capable of – thanks to Microsoft, Google, Anthropic, and even Alibaba from China.

Small models have also shown that parameter size isn’t everything inside a language model – it is all in the underlying techniques that make them great. Especially the use of synthetic data.

And Microsoft has had plenty to show off with its Phi-4 AI model.

The philosophy behind Phi-4 is something that has been making news almost every single day over the last few months. The most recent comment comes from former OpenAI chief scientist Ilya Sutskever, who said that it might just be the end of pre-training and that data, which is analogous to fossil fuel, may soon get exhausted.

In his recent talk at the NeurIPS 2024, Sutskever laid out his view on overcoming these challenges. “I’m sure that eventually something will happen. People feel like agents are the future. More concretely, but also a little bit vaguely, synthetic data.”

The ‘Phi’ght For the Top, and a Victory of Some Sorts

Microsoft’s latest small language model, the Phi-4, packs in 14B parameters. As per benchmarks, the Phi-4 outperformed the Llama 3.3 70B and OpenAI’s GPT-4o in several benchmarks.

The Phi-4’s secret sauce is synthetic data, and lots of it, in high quality.

chart visualization

AIM spoke to Harkirat Behl, a researcher at Microsoft, who was instrumental in creating the Phi family of models. Notably, Microsoft’s Phi-4 stole the show even amidst an array of announcements over the last week.

Microsoft’s detailed technical paper speaks about numerous techniques, and the major onus is on ensuring the highest quality of datasets. They created 50 broad types of synthetic datasets, each one relying on a different set of skills and the nature of the interaction. The synthetic data for Phi-4 is mostly designed to prioritise reasoning and problem-solving.

“Big models are trained on all kinds of data and store information which may not be relevant,” Behl added. He further said that with sufficient effort in curating high-quality data, it is possible to match the performance levels of these models – and perhaps even surpass them. Moreover, Microsoft hasn’t experimented with inference optimisation with the Phi-4, and the focus is majorly on synthetic data.

In Phi-4, synthetic data was used in both the pre-training and mid-training phases. Microsoft said that synthetic data serves as a more effective mechanism for the model’s learning by using structured, diverse and nuanced datasets.

Small models like the Phi-4 can significantly impact countries like India, where most people wouldn’t be able to shell out $20 a month for frontier models. The Phi-4 will be available on Hugging Face for free, thanks to Behl and his determination to give back to his homeland.

In a conversation with AIM, Behl said that Phi-4 supports 10 Indian languages. “I personally made sure, and worked hard to get Phi-4 to interpret ten most common Indian languages.

Behl also said that once the model architecture is released, developers will be able to optimise further, and quantise the model to run it on-device for local use on PCs and laptops.

But Are Datasets Truly Hitting the Wall?

Besides the potential of synthetic data, are we living in times when we are running out of data for pre-training? “It was created somehow, and now we use it, and we’ve achieved peak data, and there will be no more,” Sutskever said during the discussion.

Frontier models will have to explore new techniques, inference computing or training methods to stay on top of the game. Parameter count will no longer be the moat. “We have to deal with the data that we have now,” he added.

Behl also echoed the sentiment and said whatever is being done with Phi-4 is the anti-thesis of the original scaling hypothesis. “Blindly scaling, like how people have been doing with trillion parameter models, isn’t just needed, right?” he said. This also led to Microsoft’s approach towards focusing on higher quality datasets rather than quantity.

This is also the reality of current frontier models. Between the release of GPT-1 and GPT-3, parameter counts were multiplied by 1,000 times, and another 10 times from 175 billion to 1.8 trillion parameters between GPT-3 and GPT-4. Thanks to Epoch AI’s new research, we have an idea of how the trend is shifting. The research points out that their parameter count isn’t increasing with every new release.

“Let alone reaching the 10 trillion parameter mark, current frontier models such as the original GPT-4o and Claude 3.5 Sonnet are probably an order of magnitude smaller than GPT-4,” said Ege Erdil, a researcher at EpochAI.

chart visualization

According to EpochAI, the GPT-4o is said to have 200 billion parameters, and the 3.5 Sonnet has around 400 billion parameters.

No Value in Untapped Data Either

There is another side to the debate. Many believe we have only exhausted data that is publicly available and that more of it is locked inside the vaults of companies and enterprises. There may also be a lot of data paywalled inside academia, and the ones that aren’t available through digital mediums.

Moreover, there also might be a lot of untapped data that isn’t just based out of human text. Dhruv Batra, a former senior director at Meta’s Fundamental AI Research (FAIR), said on X, “We have more videos than we know what to do with. We just haven’t solved pre-training in vision.”

Dumb question: everyone keeps saying we've run out of training data and Ilya just gave a talk asserting as much
But:
– Every day 3.7m new videos are uploaded to YouTube
– ~4m new scientific papers/yr
– ~4m new books/yr
How are we running out of training data? pic.twitter.com/4mKmCFS6px

— Marcelo P. Lima (@MarceloPLima) December 14, 2024

The debate has another layer – is there any value in these locked, untapped, and unreleased data that AI can’t generate by itself yet? After all, the foundational data is already out in the world. For instance, last year, Bloomberg revealed a large finance-focused language model called BloombergGPT. It was pre-trained using curated data from Bloomberg, and yet it lagged behind Meta’s Llama 2.

Camilo Thorne, principal data scientist at Elsevier, also pointed out that datasets that are nothing remotely similar to the ones already available; are quite rare.

“Many business or private or classified datasets have some (often massive) open counterpart out there. Doesn’t look like a winning bet,” he said.

It all circles back to synthetic data and its power to create newer datasets to empower small models like Phi-4 to overpower the best models out there.

The post Microsoft Solves the Problem of LLM Data Scarcity appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...