Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Bibliographic Details
Title: Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Authors: Dai, Xiaoliang, Hou, Ji, Ma, Chih-Yao, Tsai, Sam, Wang, Jialiang, Wang, Rui, Zhang, Peizhao, Vandenhende, Simon, Wang, Xiaofang, Dubey, Abhimanyu, Yu, Matthew, Kadian, Abhishek, Radenovic, Filip, Mahajan, Dhruv, Li, Kunpeng, Zhao, Yue, Petrovic, Vladan, Singh, Mitesh Kumar, Motwani, Simran, Wen, Yi, Song, Yiwen, Sumbaly, Roshan, Ramanathan, Vignesh, He, Zijian, Vajda, Peter, Parikh, Devi
Publication Year: 2023
Collection: Computer Science
Subject Terms: Computer Science - Computer Vision and Pattern Recognition
More Details: Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2309.15807
Accession Number: edsarx.2309.15807
Database: arXiv
More Details
Description not available.