Tencent Introduces VoCo-LLaMA for Compressing Visual Information with LLMs

Tencent has introduced VoCo-LLaMA, a new approach that can efficiently compress hundreds of vision tokens into just a single token while minimising loss of visual information. The method introduces special “Vision Compression” (VoCo) tokens between visual and text tokens in large language models, allowing the model itself to compress and distil the vision tokens.

Click here to check out the paper.

With VoCo-LLaMA, a compression ratio of 576x can be achieved while retaining 83.7% of performance on common visual understanding benchmarks like GQA, MMBench, and VQAv2. The compressed tokens also enable major efficiency gains – up to 99.8% reduction in cache storage, 94.8% fewer FLOPs, and 69.6% faster inference time.

By leveraging attention distillation, VoCo-LLaMA distils how large language models understand uncompressed vision tokens into their processing of the compact VoCo tokens. This facilitates effective compression without specialised cross-modal modules.

The approach can be easily implemented by modifying the attention mask during standard visual instruction tuning, without additional training phases. On video benchmarks like MSVD-QA and MSRVTT-QA, VoCo-LLaMA outperforms previous compression methods by capturing temporal correlations among compressed video frame tokens.

While promising, VoCo-LLaMA has limitations – it diminishes the model’s ability to understand uncompressed tokens and struggles with diverse fine-grained compression levels. But it offers a path to overcome the context window bottleneck in vision-language models for more scalable multi-modal applications.

Last year, Apple published a paper titled “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory,” which outlines how to run large language models on devices with limited DRAM. It appears that Apple is optimising large language models for edge use cases.

The post Tencent Introduces VoCo-LLaMA for Compressing Visual Information with LLMs appeared first on AIM.

Tencent Introduces VoCo-LLaMA for Compressing Visual Information with LLMs

How Circle co-founder Sean Neville plans to construct the primary AI-native monetary establishment

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product reccomendations

Latest stories

How Circle co-founder Sean Neville plans to construct the primary...

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product...

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not...

As job losses loom, Anthropic launches program to trace AI’s...

You might also like...

How Circle co-founder Sean Neville plans to construct the primary AI-native monetary establishment

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product reccomendations

Meta restructures its AI unit below ‘Superintelligence Labs’