While there’s no shortage of efforts to build powerful LLMs that solve all sorts of challenges, Google DeepMind’s approach to reducing the cost, computing, and resources required for an LLM to function, is a ray of hope for various environmental and sustainability concerns.
Google DeepMind, in collaboration with KAIST AI, suggests a method called Relaxed Recursive Transformers, or RRTs.
It allows LLMs to be programmed to behave like small language models yet outperform many of the standard SLMs present in the ecosystem today.
Less is More
Google DeepMind proposes a technique called Layer Tying. Instead of processing the input through a large number of layers, it can be made to pass through a small number of layers recursively, so that it creates an equivalent impact. This cuts down memory requirements and significantly reduces computational resources.
This technique also introduces LoRA or low-ranking adaptation. Low-rank matrices are set up to adjust the shared weights by introducing a slight amount of variation to ensure that each repeated pass provides some distinct behaviour in processing the input.
The model is also uptrained, meaning the recursive layers are iteratively fine-tuned to provide the weights with any additional training data to perform the task.
And that’s not all—RRT uses a continuous batch-wise processing techniquein which multiple inputs are processed simultaneously. Inputs in a batch can be at different points inside the layer looping structure. One part of the batch can be processed in its first loop, and another part can be processed in the second/third loop.
If a satisfactory output is generated before the input completes the set number of loops – the input can exit the model, potentially saving computational resources.
In an interaction with AIM, Sangmin Bae, one of the paper’s authors, said, “We identified a critical challenge inherent in early exiting: the synchronisation issue. This issue arises when a token exits at an intermediate layer, requiring it to wait for other unfinished samples within the same batch to complete processing.”
“Consequently, batched inference suffers from reduced efficiency, hindering its widespread adoption in practical applications,” said Bae
Numbers Don’t Lie
The authors compared a large language model with a recursive layer to a small language model of a similar parameter size. For an uptrained recursive Gemma 1B model converted from a pre-trained Gemma 2B, a 13.5 percentage point absolute accuracy improvement (22% error reduction) was observed on few-shot tasks compared to a non-recursive Gemma 1B model (pre-trained from scratch).
The recursive, uptrained Gemma model uptrained on 60 billion tokens and achieved performance parity with the full-size Gemma model trained on 3 trillion tokens.
On a larger scale, these numbers may very well contribute to impactful energy savings. “We also anticipate a significant reduction in the parameter footprint, leading to substantial energy savings commensurate with the increase in inference throughput.” added Bae.
Like most research, RRTs have their fair share of challenges. Bae said, “Further research is needed to determine the uptraining cost associated with scaling to larger models.
Currently, the research presents only hypothetical speedup estimates based on an Oracle-exiting algorithm, without actual implementation of an early-exiting algorithm.
“Future work will focus on achieving practical speedup and inference optimisation with real-world early-exiting algorithms,” added Bae. Once these challenges are resolved, It may not take long before RRTs are up and running in real-world applications
Asserting his belief on the same, Bae says “We believe this approach is scalable to significantly larger models. With effective engineering of continuous depth-wise batching, we foresee substantial inference speed improvements in real-world deployments.”
Beyond Meta’s Quantization and Layer Skip
Google DeepMind isn’t the only one exploring ideas to scale down LLMs without compromising on performance. A few days ago, Meta made quite the mark in introducing quantised LLMs that can be processed inside devices with lower memory.
Both Quantisation and RRT increase the efficiency of LLMs and use LoRA to compensate for performance losses, but they’re not quite the same thing.
Quantisation focuses on reducing the precision of weights to reduce the overall space and memory occupied by the model. However, for RRTs, the onus is on increasing the throughput and the speed at which the inputs are processed and outputs are generated.
One of Meta’s other works includes Layer Skip, which involves skipping the layers inside an LLM during both training and inference. It employs ‘early exit loss’ to ensure performance isn’t affected.
Unlike Layer Skip and Quantisation, RRTs involve parameter sharing and recursively reusing them.
Bae says, “Unlike traditional LayerSkip, which generates N draft tokens and then verifies them in an upper layer with N inputs, RRTs allow real-time verification during draft token generation due to shared parameters between the draft model and upper layers (for verification).”
He also mentioned that post-optimisation, Layer Skip and quantisation can be applied along with RRTs.
“We anticipate significant synergy with speculative decoding methods. Specifically, continuous batching in RRTs enables real-time verification of draft tokens, promising substantial speed improvements,” he said.
Efficient Language Models
We’ve seen a rise insmall language models over the recent few months, and they can greatly help in several applications that do not demand high output accuracy.
With further research and development focusing on improving the performance of SLMs and optimising LLMs, will we reach a point where standard large parameter models seem redundant for most applications?
Meta’s quantised models, Microsoft’s Phi, HuggingFace’s SmolLM and OpenAI’s GPT Mini indicate strong efforts to build efficient, and small-sized models. The Indian AI ecosystem was quick to turn towards SLM as well. Recently, Infosys and Saravam AI collaborated to develop small language models for banking and IT applications.
Soon, we’ll also certainly see a rising interest in techniques and frameworks that optimise LLMs.
The post Google DeepMind Just Made Small Models Irrelevant with RRTs appeared first on Analytics India Magazine.