Adding Noise Cancellation to LLMs

Adding-Noise-Cancellation-to-LLMs

LLMs are great for summarisation but not so clever when the answer is hidden within a pile of documents. This is because the traditional Transformer models sometimes struggle to effectively allocate attention. They often focus too much on irrelevant parts of the input data, which can degrade the quality of the model’s outputs. This problem is known as “attention noise”.

Sure, the idea of scaling up traditional Transformers to improve performance is good, but it typically requires significantly more computational resources and training data, which can be costly and inefficient.

To solve this problem, Microsoft recently released a research paper in which researchers have proposed a fresh approach to calculating attention. It focuses more accurately on relevant information and reduces the influence of irrelevant data.

Novak I Zukowski, a research scientist, explained that the DIFF Transformer (Differential Transformer) introduces a differential attention module that replaces conventional softmax attention. It uses a differential denoising approach to amplify relevant signals and suppress noise.

The attention head score analysis reveals that this architecture significantly improves focus on pertinent information while reducing attention to irrelevant context, leading to better efficiency, improved context utilisation, and more order-invariant reasoning performance.

“The feed-forward network module remains similar to standard Transformers, with the primary innovation centred on the attention mechanism. Despite a slight decrease in raw throughput, DIFF Transformer demonstrates superior performance across various tasks, particularly in long-context scenarios, and shows promise for better quantisation due to reduced activation outliers,” he added.

DIFF transformer

Bye Bye, Noise!

The approach is simple. They are essentially learning two different projections for attention: one to actually attend, and the second to act as a reference for noise cancellation. When attention is calculated, the difference can be taken to keep the signal and lose the noise.

A Reddit user said this is possible because both the weights and the scaling for taking the difference are trained in parallel with the rest of the model. The specialisation of the functional blocks occurs much as it does for neurons within a layer of a regular neural net.

“The two sets of weights learn different things. The second/negative set of weights is constrained by the softmax function to be unable to direct attention towards specific tokens. Doing so would require producing a negative value, and softmax output values are in the [0,1] range. So, the only thing the second set of values can productively learn to do is to suppress noise,” he added.

The DIFF Transformer introduces differential attention to improve accuracy and reduce hallucinations by filtering noise. It uses 35-40% fewer parameters and handles long contexts up to 64K tokens effectively, enhancing in-context learning.

Attention to answer is improved using DIFF Transformer

Additionally, it supports low-bit quantisation without performance loss, offering a promising upgrade for language models with better long-context comprehension. This could lead to more efficient and accurate LLMs in the future.

Rohan Paul, an AI engineer, mentioned in his recent post on X that apart from reducing noise, the DIFF Transformer also shows remarkable improvement in performance.

For example, 30% accuracy improvement in key information retrieval with 64K context, 10-20% accuracy gain in many-shot in-context learning across datasets, 7-11% reduction in the hallucination for summarisation and question answering, and it also maintains performance with 6-bit quantisation, while Transformer degrades significantly.

What Does it Mean for a Regular User?

Apart from the Differential Transformer research paper, there are other efforts to reduce noise. For example, Denoising LM, a research paper from Apple, aims to use LLMs to clean up and correct errors in speech recognition outputs, significantly improving accuracy even in challenging, noisy environments.

This means we can expect more tech which is based on LLMs and requires the least noisy output possible. And that is where approaches like DIFF Transformer come into play.

The post Adding Noise Cancellation to LLMs appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...