Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Meta’s Code Llama is Here, But Unnaturally

In a recent research paper, researchers have introduced LlamaGen, a new family of autoregressive models that outperform popular diffusion models like LDM and DiT for high-resolution image generation. The key breakthrough is that LlamaGen applies the same “next-token prediction” paradigm used in large language models to the visual domain without relying on inductive biases tailored for vision.

The LlamaGen models range from 111M to 3.1B parameters and achieve an impressive 2.18 FID score on the challenging ImageNet 256×256 benchmark, surpassing state-of-the-art diffusion models. For class-conditional image generation, LlamaGen-3B realises 2.32 FID with classifier-free guidance at 1.75 scale.

Notably, the researchers developed an image tokeniser with a downsampling ratio of 16 that achieves 0.94 reconstruction FID and 97% codebook usage on ImageNet. This discrete representation matches the quality of continuous VAE representations used in diffusion models.

For text-conditional generation, a 775M parameter LlamaGen model was first trained on 50M image-text pairs from LAION-COCO, then fine-tuned on 10M high-quality images. It demonstrates competitive visual quality and text alignment on challenging prompts from datasets like PartiPrompts.

A key advantage of LlamaGen is its ability to leverage optimisation techniques developed for large language models. The researchers showed a 326-414% speedup using the vLLM serving framework compared to baseline settings.

While still behind the latest diffusion models on some metrics, the researchers believe LlamaGen paves the way for unified autoregressive models spanning language and vision. With more training data and compute, they aim to scale LlamaGen above 7B parameters for further gains.

The post Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation appeared first on AIM.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Why AI will eat McKinsey’s lunch — however not as we speak

Meta restructures its AI unit below ‘Superintelligence Labs’

Latest stories

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not...

As job losses loom, Anthropic launches program to trace AI’s...

Congress would possibly block state AI legal guidelines for a...

PetLibro’s new good digicam makes use of AI to explain...

You might also like...

Meta restructures its AI unit below ‘Superintelligence Labs’

Why AI will eat McKinsey’s lunch — however not as we speak

As job losses loom, Anthropic launches program to trace AI’s financial fallout