Microsoft Launches Inference Framework to Run 100B 1-Bit LLMs on Local Devices

Microsoft has launched BitNet.cpp, an inference framework for 1-bit large language models, enabling fast and efficient inference for models like BitNet b1.58.

Earlier this year, Microsoft published an extensive paper on 1-bit LLMs.

The framework offers a suite of optimised kernels that currently support lossless inference on CPU, with plans for NPU and GPU support in the future.

The crux of this innovation lies in the representation of each parameter in the model, commonly known as weights, using only 1.58 bits. Unlike traditional LLMs, which often employ 16-bit floating-point values (FP16) or FP4 by NVIDIA for weights, BitNet b1.58 restricts each weight to one of three values: -1, 0, or 1.

This substantial reduction in bit usage is the cornerstone of the proposed model. It performs as well as the traditional ones with the same size and training data in terms of end-task performance.

The initial release is optimised for ARM and x86 CPUs, showcasing significant performance improvements. On ARM CPUs, speedups range from 1.37x to 5.07x, particularly benefiting larger models.

Energy consumption is also reduced, with decreases of 55.4% to 70.0%. On x86 CPUs, speedups vary from 2.37x to 6.17x, alongside energy reductions between 71.9% to 82.2%.

Notably, BitNet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving processing speeds comparable to human reading, at 5-7 tokens per second.

BitNet.cpp supports a variety of 1-bit models available on Hugging Face and aims to inspire the development of additional 1-bit LLMs in large-scale settings. The tested models are primarily dummy setups used to illustrate the framework’s capabilities.

A demo showcasing BitNet.cpp running a BitNet b1.58 3B model on Apple M2 is available for review. The project timeline indicates the 1.0 release occurred on October 17, 2024, alongside prior advancements in 1-bit transformers and LLM scaling.

The installation process for BitNet.cpp requires Python 3.9, CMake 3.22, and Clang 18. For Windows users, Visual Studio 2022 is necessary, with specific options selected during installation. Debian/Ubuntu users can utilise an automatic installation script for convenience. The repository can be cloned from GitHub, and dependencies can be installed via conda.

Usage instructions detail how to run inference with the quantised model and conduct benchmarks. Scripts are provided for users to benchmark their models effectively, ensuring the framework’s versatility in various applications.

This project builds on the llama.cpp framework and acknowledges contributions from the open-source community, particularly the T-MAC team for their input on low-bit LLM inference methods. More updates and details about future enhancements will be shared soon.

The post Microsoft Launches Inference Framework to Run 100B 1-Bit LLMs on Local Devices appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...