Microsoft Launches Inference Framework to Run 100B 1-Bit LLMs on Local Devices

Microsoft has launched BitNet.cpp, an inference framework for 1-bit large language models, enabling fast and efficient inference for models like BitNet b1.58.

Earlier this year, Microsoft published an extensive paper on 1-bit LLMs.

The framework offers a suite of optimised kernels that currently support lossless inference on CPU, with plans for NPU and GPU support in the future.

The crux of this innovation lies in the representation of each parameter in the model, commonly known as weights, using only 1.58 bits. Unlike traditional LLMs, which often employ 16-bit floating-point values (FP16) or FP4 by NVIDIA for weights, BitNet b1.58 restricts each weight to one of three values: -1, 0, or 1.

This substantial reduction in bit usage is the cornerstone of the proposed model. It performs as well as the traditional ones with the same size and training data in terms of end-task performance.

The initial release is optimised for ARM and x86 CPUs, showcasing significant performance improvements. On ARM CPUs, speedups range from 1.37x to 5.07x, particularly benefiting larger models.

Energy consumption is also reduced, with decreases of 55.4% to 70.0%. On x86 CPUs, speedups vary from 2.37x to 6.17x, alongside energy reductions between 71.9% to 82.2%.

Notably, BitNet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving processing speeds comparable to human reading, at 5-7 tokens per second.

BitNet.cpp supports a variety of 1-bit models available on Hugging Face and aims to inspire the development of additional 1-bit LLMs in large-scale settings. The tested models are primarily dummy setups used to illustrate the framework’s capabilities.

A demo showcasing BitNet.cpp running a BitNet b1.58 3B model on Apple M2 is available for review. The project timeline indicates the 1.0 release occurred on October 17, 2024, alongside prior advancements in 1-bit transformers and LLM scaling.

The installation process for BitNet.cpp requires Python 3.9, CMake 3.22, and Clang 18. For Windows users, Visual Studio 2022 is necessary, with specific options selected during installation. Debian/Ubuntu users can utilise an automatic installation script for convenience. The repository can be cloned from GitHub, and dependencies can be installed via conda.

Usage instructions detail how to run inference with the quantised model and conduct benchmarks. Scripts are provided for users to benchmark their models effectively, ensuring the framework’s versatility in various applications.

This project builds on the llama.cpp framework and acknowledges contributions from the open-source community, particularly the T-MAC team for their input on low-bit LLM inference methods. More updates and details about future enhancements will be shared soon.

The post Microsoft Launches Inference Framework to Run 100B 1-Bit LLMs on Local Devices appeared first on AIM.

Microsoft Launches Inference Framework to Run 100B 1-Bit LLMs on Local Devices

Gartner’s 2025 tech trends show how your business needs to adapt – and fast

Qualcomm’s new chipset that will power flagship Android phones makes the iPhone seem outdated

One of the best Android Auto wireless adapters is still on sale at Amazon – and it’s my go-to gift

AI Chatbots’ Thirst May Kill Our Water Resources

Revamping prompt augmentation: Multi-agent architecture and LangGraph

Latest stories

The Transformative Impact of Generative AI on IT Services, BPO,...

AI Chatbots’ Thirst May Kill Our Water Resources

One of the best Android Auto wireless adapters is still...

Mastering the Big 12 Data-driven Economic Concepts

Qualcomm’s new chipset that will power flagship Android phones makes...

You might also like...

The Transformative Impact of Generative AI on IT Services, BPO, Software, and Healthcare

AI Chatbots’ Thirst May Kill Our Water Resources

One of the best Android Auto wireless adapters is still on sale at Amazon – and it’s my go-to gift