The Breakthrough AI Scaling Desperately Needed

When Transformers were introduced, the entire AI ecosystem underwent a reform. But there was a problem. When a model was large enough, and researchers wanted to train a specific part of it, the only option was to retrain the entire model from scratch.

This was a critical issue. To address it, researchers from Google, Max Planck Institute, and Peking University introduced a new approach called TokenFormer.

The innovation lies in treating model parameters as tokens themselves, allowing for a dynamic interaction between input tokens and model parameters through an attention mechanism rather than fixed linear projections.

The traditional Transformer architecture faces a significant challenge when scaling—it requires complete retraining from scratch when architectural modifications are made, leading to enormous computational costs. TokenFormer addresses this by introducing a token-parameter attention (Pattention) layer that enables incremental scaling without full retraining.

This approach has demonstrated impressive results, successfully scaling from 124M to 1.4B parameters while maintaining performance comparable to Transformers trained from scratch.

Training cost reduced using tokenformer — (Training cost reduced significantly using TokenFormer architecture)

Explaining the significance of this research, a Reddit user said that it allows for incremental learning. In other words, changing the model size and adding more parameters does not mean you need to train the entire model from scratch.

“Specifically, our model requires only one-tenth of the training costs associated with Transformer baselines. To mitigate the effects of varying training data, we also included the performance curve of a Transformer trained from scratch using an equivalent computational budget of 30B tokens.

“Under the same computational constraints, our progressively scaled model achieves a lower perplexity of 11.77 compared to the Transformer’s 13.34, thereby highlighting the superior efficiency and scalability of our approach,” he added, further suggesting drastically reduced costs via TokenFormers.

Why Scaling Efficiency Matters?

One of TokenFormer’s most compelling features is its ability to preserve existing knowledge while scaling, offering a new approach to continuous learning. This aligns with industry efforts to rethink scaling efficiency. When new parameters are initialised to zero, the model can maintain its current output distribution while incorporating additional capacity.

This characteristic makes it particularly valuable for continuous learning scenarios, where models need to adapt to new data without losing previously acquired knowledge.

Meanwhile, the architecture has shown remarkable efficiency in practical applications. In benchmark tests, TokenFormer achieved performance comparable to standard Transformers, requiring only one-tenth of the computational budget.

This efficiency extends to both language and vision tasks, with the model demonstrating competitive performance across various benchmarks, including zero-shot evaluations and image classification tasks.

TokenFormer’s design also offers advantages for long-context modelling, a crucial capability for modern language models. Unlike traditional Transformers, where computational costs for token-token interactions increase with model size, TokenFormer maintains these costs at a constant level while scaling parameters.

This makes it particularly suitable for processing longer sequences, an increasingly important capability in contemporary AI applications.

A Reddit user praised this research, saying, “In a way, what they’ve developed is a system to store knowledge and incrementally add new knowledge without damaging old knowledge; it’s potentially a big deal.”

Meanwhile, multiple conversations have been taking place around the technical breakthroughs that will solve the scaling problem like the TokenFormer.
At Microsoft Ignite 2024, CEO Satya Nadella highlighted the shift in focus, stating, “The thing to remember is that these are not physical laws but empirical observations, much like Moore’s Law.”
He introduced “tokens per watt plus dollar” as a new metric for AI efficiency, emphasising value maximisation. NVIDIA’s Jensen Huang echoed these concerns, calling inference “super hard” due to the need for high accuracy, low latency, and high throughput.
“Our hopes and dreams are that, someday, the world will do a ton of inference,” he added, signalling the growing importance of scaling innovations like TokenFormer in the AI landscape.

Too Good to be True?

Multiple users have called the idea too good to be true and noted some issues in the research paper. A user said on Hacker News that it is hard to trust the numbers shown in the research. “When training a Transformer to compare against it, they replicate the original GPT-2 proposed in 2019. In doing so, they ignore years of architectural improvements, such as rotary positional embeddings, SwiGLU, and RMSNorm, which culminated in Transformer++,” he added.

On the other hand, users from the same thread have praised this approach, saying it looks like a huge deal. “I feel this could enable a new level of modularity and compatibility between publicly available weight sets, assuming they use similar channel dimensions. Maybe it also provides a nice formalism for thinking about fine-tuning, where you could adopt certain heuristics for adding/removing key-value pairs from the Pattention layers,” he added.

The user further mentioned that according to this paper, the model can grow or scale dynamically by simply adding new rows (key-value pairs) to certain matrices (like K and V in attention layers). The rows at the beginning might hold the most critical or foundational information, while later rows add more specific or less essential details.

While the approach looks promising on paper, we’ll have to wait for developers to implement it in actual models.

The post The Breakthrough AI Scaling Desperately Needed appeared first on Analytics India Magazine.

The Breakthrough AI Scaling Desperately Needed

Why Scaling Efficiency Matters?

Too Good to be True?

Why AI will eat McKinsey’s lunch — however not as we speak

Meta restructures its AI unit below ‘Superintelligence Labs’

Latest stories

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not...

As job losses loom, Anthropic launches program to trace AI’s...

Congress would possibly block state AI legal guidelines for a...

PetLibro’s new good digicam makes use of AI to explain...

You might also like...

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not as we speak

As job losses loom, Anthropic launches program to trace AI’s financial fallout