LlamaGen Beats Diffusion Models for Scalable Image Generation

Meta Partners with Microsoft to Release LLaMA-2 for Commercial Use

The University of Hong Kong and ByteDance have unveiled LlamaGen, a new family of autoregressive models that outperform popular diffusion models like LDM and DiT for high-resolution image generation.

The key breakthrough is that LlamaGen applies the same “next-token prediction” paradigm used in large language models to the visual domain without relying on inductive biases tailored for vision.

The LlamaGen models range from 111 million to 3.1 billion parameters and achieve an impressive 2.18 FID score on the challenging ImageNet 256×256 benchmark, surpassing state-of-the-art diffusion models. For class-conditional image generation, LlamaGen-3B realises 2.32 FID with classifier-free guidance at 1.75 scale.

Read the full paper here.

Training Method

Notably, the researchers developed an image tokeniser with a downsampling ratio of 16 that achieves 0.94 reconstruction FID and 97% codebook usage on ImageNet. This discrete representation matches the quality of continuous VAE representations used in diffusion models.

For text-conditional generation, a 775M parameter LlamaGen model was first trained on 50M image-text pairs from LAION-COCO, then fine-tuned on 10M high-quality images. It demonstrates the competitive visual quality and text alignment on challenging prompts from datasets like PartiPrompts.

A key advantage of LlamaGen is its ability to leverage optimisation techniques developed for large language models. The researchers showed a 326-414% speedup using the vLLM serving framework compared to baseline settings.

While still behind the latest diffusion models on some metrics, the researchers believe LlamaGen paves the way for unified autoregressive models spanning language and vision. With more training data and computing, they aim to scale LlamaGen above 7B parameters for further gains.

Up Next

OpenAI’s Sora was released earlier this year, and with Google recently releasing Veo, text-to-video AI models are now gaining prominence.

As these improved capabilities demonstrate that image generation can become faster and more accurate, they can also be applied to open-source video generation models, putting them on par with video-generation models like Sora and Veo.

The post LlamaGen Beats Diffusion Models for Scalable Image Generation appeared first on AIM.

LlamaGen Beats Diffusion Models for Scalable Image Generation

How Circle co-founder Sean Neville plans to construct the primary AI-native monetary establishment

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product reccomendations

Latest stories

How Circle co-founder Sean Neville plans to construct the primary...

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product...

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not...

As job losses loom, Anthropic launches program to trace AI’s...

You might also like...

How Circle co-founder Sean Neville plans to construct the primary AI-native monetary establishment

Meta provides enterprise voice calling to WhatsApp, explores AI-powered product reccomendations

Meta restructures its AI unit below ‘Superintelligence Labs’