NVIDIA Launches Hymba, its New Hybrid Architecture for Small LLMs

NVIDIA has unveiled Hymba-1.5B-Base, a small-scale language model that integrates transformer attention mechanisms with state space models (SSMs). This hybrid architecture aims to enhance efficiency in natural language processing tasks.

Pavlo Molchanov, scientist and research manager at NVIDIA took to X to announce this latest development.

Sharing our team’s latest work on Hymba – an efficient small language model with hybrid architecture.
Tech report: https://t.co/koThe3a6Ja
Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be… pic.twitter.com/brI69sAiFy

— Pavlo Molchanov (@PavloMolchanov) November 22, 2024

The model uses a dual structure with attention heads for precise recall and SSM heads for efficient context summarisation.

It adds learnable meta-tokens at the start of input sequences to store key information and reduce unnecessary attention demands. To improve memory and computation efficiency, it includes cross-layer key-value sharing and partial sliding window attention.

The research paper “Hymba: A Hybrid-head Architecture for Small Language Models” explores the model’s design, performance, and applications in detail.

Hymba Outperforms Llama-3.2

In a controlled study comparing various architectures under identical settings, Hymba-1.5B-Base demonstrated significant advantages.

It outperformed all publicly available models under 2 billion parameters and surpassed Llama-3.2-3B with a 1.32% higher average accuracy, an 11.67-fold reduction in cache size, and a 3.49-fold increase in throughput.

Philipp Schmid, technical lead & LLMs at Hugging Face, commented on the development, stating, “Hymba outperforms other small LLMs like Meta 3.2 or SmolLM v2 being trained on only 1.5T Tokens.”

Molchanov commented, “I’m not sure if we should be proud of 1.5T training. The reason is we want to move quickly; in the next two weeks, somebody will be even better.”

NVIDIA has also provided a setup script to facilitate environment configuration, supporting CUDA versions 12.1 and 12.4.

But, Caution!

NVIDIA acknowledges that the model was trained on internet data, which includes toxic language, unsafe content, and societal biases. As a result, the model may reflect these biases, generate toxic responses to toxic prompts, or produce inaccurate or irrelevant text even with neutral prompts.

Users should set the batch size to one during generation, as the current setup doesn’t fully support padding meta tokens with sliding window attention. However, any batch size works for training and pre-filling.

The company highlights the importance of shared responsibility in creating trustworthy AI and has set ethical guidelines for its development. Users are advised to use the model responsibly and at the same time, stay mindful of its limitations.

The post NVIDIA Launches Hymba, its New Hybrid Architecture for Small LLMs appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...