Large language models (LLMs) demand substantial computational resources, which are often limited to powerful servers. However, a new generation of compact models is making it possible to run these powerful language models directly on your smartphones. Interestingly, you won’t require the internet to utilise LLMs on your smartphones.
Here are six open-source LLMs that can be trained and optimised to be used on your smartphones.
- Gemma 2B: Google’s compact, high-performance LLM for mobile language tasks.
- Phi-2: Microsoft’s tiny model outperforming giants up to 25 times its size.
- Falcon-RW-1B: Efficient 1B-parameter model for resource-constrained mobile devices.
- StableLM-3B: Stability AI’s balanced model for diverse language tasks on phones.
- TinyLlama: Compact Llama variant delivering impressive results on cell phones.
- LLaMA-2-7B: Meta’s powerful 7B model for advanced tasks on high-end smartphones.
1. Gemma 2B
Google’s Gemma 2B is a compact language model that delivers impressive performance despite its small size. It utilises a multi-query attention mechanism, which helps reduce memory bandwidth requirements during inference.
This is particularly advantageous for on-device scenarios where memory bandwidth is often limited. With just 2 billion parameters, Gemma 2B achieves strong results on academic benchmarks for language understanding, reasoning, and safety.
It outperformed similarly sized open models on 11 out of 18 text-based tasks.
2. Phi-2
With 2.7 billion parameters, Phi-2 has been shown to outperform models up to 25 times larger on certain benchmarks. It excels in tasks involving common sense reasoning, language understanding, and logical reasoning.
Phi-2 can be quantised to lower bit-widths like 4-bit or 3-bit precision, significantly reducing the model size to around 1.17-1.48 GB to run efficiently on mobile devices with limited memory and computational resources.
One of the key strengths of Phi-2 is its ability to perform common sense reasoning. The model has been trained on a large corpus of web data, allowing it to understand and reason everyday concepts and relationships.
3. Falcon-RW-1B
Falcon-RW-1B is part of the Falcon family of language models, known for their efficiency and performance. The RW stands for ‘Refined Web’, indicating a training dataset curated for quality over quantity.
Falcon-RW-1B’s architecture is adapted from GPT-3 but incorporates techniques like ALiBi (Attention with Linear Biases) and FlashAttention to enhance computational efficiency. These optimisations make Falcon-RW-1B well-suited for on-device inference on resource-constrained devices like smartphones.
The Falcon-RW-1B-Chat model aims to add conversational capabilities to the Falcon-RW-1B-Instruct-OpenOrca model to improve user engagement, expand use cases, and provide accessibility for resource-constrained environments like smartphones.
4. StableLM-3B
StableLM-3B, developed by Stability AI, is a 3 billion parameter model that strikes a balance between performance and efficiency. The best part of StableLM-3B is that despite being trained on fewer tokens, it outperformed models trained on 7 billion parameters on some benchmarks.
StableLM-3B can be quantised to lower bit-widths like 4-bit precision, significantly reducing the model size to around 3.6 GB to make it run efficiently on smartphones. A user mentioned that StableLM-3B has outperformed Stable’s own 7B StableLM-Base-Alpha-v2.
5. TinyLlama
TinyLlama leverages optimisations like FlashAttention and RoPE positional embeddings to enhance computational efficiency while maintaining strong performance. It is compatible with the Llama architecture and can be integrated into existing Llama-based mobile apps with minimal changes.
TinyLlama can be quantised to lower bit-widths like 4-bit or 5-bit precision, significantly reducing the model size to around 550-637 MB. A user, while sharing his experience with TinyLlama, mentioned that on a mid-range phone like the Asus ROG, TinyLlama was generating 6-7 tokens per second.
6. LLaMA-2-7B
The LLaMA-2-7B model has been quantised to 4-bit weights and 16-bit activations, making it suitable for on-device deployment on smartphones. This quantisation reduces the model size to 3.6GB, making it feasible to load and run on mobile devices with sufficient RAM.
LLaMA-2-7B model on mobile requires a device with at least 6GB of RAM. During inference, the peak memory usage ranges from 316MB to 4785MB on the Samsung Galaxy S23 Ultra. This suggests that while the model can run on devices with 6GB+ RAM, having more RAM allows for better performance and reduces the risk of out-of-memory errors.
While it requires devices with sufficient RAM and may not match the speed of cloud-based models, it offers an attractive option for developers looking to create intelligent language-based features that run directly on smartphones.
The post 6 Open-Source LLMs That Can Run on Smartphones appeared first on AIM.