The University of Hong Kong and ByteDance have unveiled LlamaGen, a new family of autoregressive models that outperform popular diffusion models like LDM and DiT for high-resolution image generation.
The key breakthrough is that LlamaGen applies the same “next-token prediction” paradigm used in large language models to the visual domain without relying on inductive biases tailored for vision.
The LlamaGen models range from 111 million to 3.1 billion parameters and achieve an impressive 2.18 FID score on the challenging ImageNet 256×256 benchmark, surpassing state-of-the-art diffusion models. For class-conditional image generation, LlamaGen-3B realises 2.32 FID with classifier-free guidance at 1.75 scale.
Read the full paper here.
Training Method
Notably, the researchers developed an image tokeniser with a downsampling ratio of 16 that achieves 0.94 reconstruction FID and 97% codebook usage on ImageNet. This discrete representation matches the quality of continuous VAE representations used in diffusion models.
For text-conditional generation, a 775M parameter LlamaGen model was first trained on 50M image-text pairs from LAION-COCO, then fine-tuned on 10M high-quality images. It demonstrates the competitive visual quality and text alignment on challenging prompts from datasets like PartiPrompts.
A key advantage of LlamaGen is its ability to leverage optimisation techniques developed for large language models. The researchers showed a 326-414% speedup using the vLLM serving framework compared to baseline settings.
While still behind the latest diffusion models on some metrics, the researchers believe LlamaGen paves the way for unified autoregressive models spanning language and vision. With more training data and computing, they aim to scale LlamaGen above 7B parameters for further gains.
Up Next
OpenAI’s Sora was released earlier this year, and with Google recently releasing Veo, text-to-video AI models are now gaining prominence.
As these improved capabilities demonstrate that image generation can become faster and more accurate, they can also be applied to open-source video generation models, putting them on par with video-generation models like Sora and Veo.
The post LlamaGen Beats Diffusion Models for Scalable Image Generation appeared first on AIM.