RNNs are Back to Compete with Transformers

RNNs are Back to Compete with Transformers

Jürgen Schmidhuber was probably right when he said that recurrent neural networks (RNN) are all we need. While Transformers dominate many areas of natural language processing, most importantly LLMs, they still struggle when dealing with long sequences.

For this, researchers have tried to introduce several architectures, including Mamba and others. However, researchers from Borealis AI decided to revisit RNNs to figure out if they can solve some current problems with LLMs.

The research group led by Yoshua Bengio, one of the godfathers of deep learning, believes that RNNs introduced in 2015 were earlier slower because they needed to go through the backpropagation (BPTT) method, something that Schmidhuber has frequently claimed credit for introducing.

“Were RNNs All We Needed?” asked the researchers, in a bid to revive traditional RNNs, including LSTMs (1997) and GRUs (2014).

They concluded that by removing the hidden state dependencies from their input, forgetting them, and updating gates, LSTMs and GRUs no longer need BPTT and can be efficiently trained in parallel.

What Makes RNNs Special This Time?

Researchers have now introduced the minimal versions of LSTMs and GRUs called minLSTM and minGRU, which are stripped-down versions of the original models and unlike traditional RNNs, these minimal versions can be trained in parallel using the parallel scan algorithm, significantly speeding up training time.

These two models use significantly fewer parameters than their traditional counterparts, resulting in minGRU and minLSTM being 175x and 235x faster per training step than traditional GRUs and LSTMs for a sequence length of 512.

A developer on Hacker News said that he enjoys the simplicity of the minGRU architecture. He added that the proposed hidden states and mix factors for each layer are both only dependent on the current token, allowing you to compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.

“The fact that this is competitive with Transformers and state-space models in their small-scale experiments is gratifying to the ‘best PRs are the ones that delete code’ side of me. That said, we won’t know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameters and data counts comparable to SOTA models,” he added.

But with Transformers, you can also fetch any previous information at any moment which is quite useful. Meanwhile, RNN are constantly updating and overwriting their memory. It means they need to be able to predict what is going to be useful in order to store it for later.

This is a massive advantage for Transformers in interactive use cases like in ChatGPT. You give it context and ask questions in multiple turns. Which part of the context was important for a given question only becomes known later in the token sequence.

To be precise, attention-based models have an advantage because there are also hybrid models that successfully combine both approaches, like Jamba.

RNNs will scale

This is not something entirely new. In fact, researchers predicted the same in 2019 in the research paper titled ‘Single-Headed Attention RNN: Stop Thinking With Your Head’. It demonstrated near state-of-the-art results using LSTMs, a modified version of recurrent neural networks, which makes it easier to remember past data in memory suggesting that the “Transformer hype” may be overblown.

The Tale of Scaling and Transformers

David Andrés, a data scientist at Fever, mentioned that the “self-loop” is one of the most unique features of RNNs, where each neuron receives input and then passes its output (known as the hidden state) to the next neuron in line.

“This self-loop allows the network to remember previous information and use it to inform later steps, creating an internal “memory” that helps in processing sequences. This is crucial for tasks like time-series forecasting or natural language understanding, where the order and context of data points matter,” he added, further suggesting how RNN can be helpful in predictive tasks and real-time language translation.

Following on this, a few months ago (before this research paper was published), a Reddit user mentioned that he could bet on the return of RNNs when we find better methods to train them and a way to make the model learn adaptive computation times (ACT).

He also mentioned that RNNs will be crucial in situations where all of the input steps don’t fit into memory. Transformers require the entire sequence to be stored in memory, but that won’t be realistic when the models become highly multimodal and start taking in a lot of information, such as images, video, sound, etc, at once.

“RNNs simply don’t have this limitation by design; they only need the data corresponding to the current timestep. That’s why I think they will end up scaling better in the context of multimodality, even if Transformers stay a few percent better in terms of accuracy,” he added.

That user’s prediction was correct as the simplified versions of LSTMs and GRUs (minLSTM and minGRU) introduced in this paper offer significant improvements in scalability compared to traditional RNNs, allowing them to compete more effectively with Transformers.

This hybrid approach of combining the strengths of RNNs with lessons learned from the Transformer era could point the way forward for sequence modelling tasks. The ability to train these models efficiently on large datasets while maintaining the conceptual simplicity of RNNs is particularly appealing.

The post RNNs are Back to Compete with Transformers appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...