AI is booming, and so is energy consumption. According to reports, ChatGPT is probably using more than half a million kilowatt-hours of electricity to respond to some 200 million requests a day. In other words, ChatGPT consumes energy equivalent to powering 17,000 households in the USA daily.
A research paper titled, ‘Addition is All You Need: For Energy Efficient Language Models’ mentioned that multiplying floating point numbers consumes significantly more energy than integer operations. The paper states that multiplying two 32-bit floating point numbers (fp32) costs four times more energy than adding two fp32 numbers and 37 times more than adding two 32-bit integers.
The researchers have proposed a new technique called linear-complexity multiplication (L-Mul), which solves the problem of energy-intensive floating point multiplications in large neural networks.
Before L-Mul, neural networks typically performed computations using standard floating-point multiplication, which is computationally expensive and energy-intensive, especially for LLMs, which typically run over billions of parameters.
These operations consumed significant computational resources and energy, particularly in attention mechanisms and matrix multiplications.
The best part of this approach is that it is not dependent on any specific architecture. Researchers have tested with real-world models like Llama 3.1 8b, Mistral-7 b, and Gemma2-2b to prove these numbers.
After testing these models, researchers have concluded that the proposed method can replace different modules in Transformer layers under fine-tuning or training-free settings.
Why is it Bigger Than LLMs?
As this approach is not limited to neural networks, the implementation of L-Mul should not be limited to LLMs only but can also be extended to the hardware to achieve energy efficiency over a larger spectrum.
L-Mul is a new method that approximates floating-point multiplication using only simple integer additions. This makes it faster because the time it takes grows directly with the size of the numbers (linear complexity), unlike traditional methods that get much slower as numbers get bigger (quadratic complexity).
L-Mul uses straightforward bit operations and additions to avoid complicated multiplication of parts of the numbers (mantissa) and tricky rounding steps.
This approach not only reduces the computational cost but also potentially decreases energy consumption by up to 95% for element-wise floating-point tensor multiplications and 80% for dot products while maintaining comparable or even superior precision to 8-bit floating-point operations in many cases.
This is why Google developed bfloat16, a truncated floating point format for machine learning. Meanwhile, NVIDIA has also created TensorFloat-32 specifically for AI applications on its GPUs.
The energy consumption is not only related to LLMs but goes beyond that. A Reddit user mentioned that this research paper would probably lead to all CPU manufacturers deprecating normal 8-bit float multiplication routines to a legacy/compatibility mode.
Instead, any FP8 multiplication could be natively performed using the L-Mat algorithm, potentially implemented in future hardware like the 6090 series GPUs, CPUs beyond the 9000-series, or Apple’s M5 chips.
“It might force companies like Intel, AMD, NVIDIA, and Apple to quickly and substantially widen the memory buses across their entire hardware lines. If they don’t adapt, they risk being outpaced by alternatives. For instance, inexpensive RISC-V derivatives with extensive high-bandwidth memory (HBM) or even standard FPGAs with sufficient SRAM could potentially outperform NVIDIA’s top-end GB200 cards.
This disruption could occur before these established companies have time to develop and release competitive products, potentially reshaping the market dynamics within just a few months,” he added. Further, he suggested that this research can fundamentally change how hardware is built for neural networks.
Too Good to be True?
While this approach sounds promising, users have raised some concerns. A Reddit user mentioned that integer addition might require more cycles than a single clock cycle on modern GPUs, especially if implemented through bit-level manipulations and approximations. And converting back and forth between floating-point and integer representations might introduce additional overhead.
Also, if we consider speed, the proposed approach might not lead to a speed-up of current GPU architectures because the GPUs are optimised for native floating-point operations. The approximation approach might involve multiple steps or require more complex handling of integer operations, which could offset potential speed gains.
The paper hints that specialised hardware designed to implement the L-Mul algorithm could lead to both speed and energy efficiency gains. However, on current GPU architectures that are designed for traditional floating-point operations, the method is more likely to achieve energy efficiency improvements rather than speed-ups.
“If energy efficiency is the primary concern, then this method could be very valuable in reducing costs. For performance improvements (speed), the gains might be minimal without hardware specifically optimised for the new method,” he added, further suggesting that specialised hardware is essential for speed up.
L-Mul is performing on-par with the current standards while saving a large amount of power. So, even if we don’t achieve better speeds, L-Mul should still be considered a great technique to reduce energy consumption of neural networks.
The post 95% Less Energy Consumption in Neural Networks Can be Achieved. Here’s How appeared first on AIM.