Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

Wenze Liu Le Zhuo Yi Xin Sheng Xia Peng Gao Xiangyu Yue
MMLab, CUHK & Shanghai AI Lab & Nanjing University

SAR provides a smooth pathway travelling from AR to MAR, where the transition states enjoy both their merits.

Abstract

We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transformer. We reveal that existing AR variants correspond to specific design choices of sequence order and output intervals within the SAR framework, with AR and Masked AR (MAR) as two extreme instances. Notably, SAR facilitates a seamless transition from AR to MAR, where intermediate states allow for training a causal model that benefits from both advantages of AR and MAR, such as few-step inference, KV cache acceleration, and image editing. On the ImageNet benchmark, we carefully explore the properties of SAR by analyzing the impact of sequence order and output intervals on performance, as well as the generalization ability regarding inference order and steps. We further validate the potential of SAR by training a 900M text-to-image model capable of synthesizing photo-realistic images with any resolution. We hope our work may inspire more exploration and application of AR-based modeling across diverse modalities.

Introduction

Originating from language processing, AR and MAR (BERT-like) are two representative methods for generating images in discrete latent space. Recent work has proven that the generative capabilities of AR models can rival or even surpass those of diffusion models. However, AR models has long inference time, and they are intrisically not good at image editing tasks due to the fixed order. MAR reduces the number of inference steps at the cost of requiring global calculations, and the random order facilitates editing tasks. SAR unifies AR and MAR by generalizing the sequence order and output intervals. Further, SAR offers a path gradually travelling from AR to MAR, where in the transition states, one can train models possessing good properties of both AR and MAR, including few-step inference, KV cache acceleration, image editing, etc.

Approach

In order to train one model under the SAR framework, one should first specify the hyper-parameters, sequence order and output intervals. Based on the order setting, we first rearrange the sequence to the causal version. And we set the target as the rearranged causal sequence. Next, based on the output intervals, we drop the last set of the rearranged sequence and prepend a class token. The resulting sequence is then fed into the proposed Fully Masked Transformer. Then the model can be trained with the common cross entropy loss.

Few-step Generation

We train a T2I model, Lumina-SAR, in the transition states of SAR. Its first feature is few-step inference. Lumina-SAR begins to produce meaningful images at around 4 to 8 steps. With 64 steps, it can deliver high-quality outputs, requiring a processing time of only less than 3s on one A100 GPU. Typically, the full 4096 steps (AR) take >60 times longer than that required for 64 steps.

Zero-shot Painting

Another advantage of SAR is the flexibility in inference order, which facilitates image editing tasks such as image inpainting and outpainting. We perform zero-shot painting with Lumina-SAR, where the mask can be any shape.

More Visualizations

Class-to-image on ImageNet

Text-to-image

BibTex

@article{liu2024customize,

    title={Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling},

    author={Liu, Wenze and Zhuo, Le and Xin, Yi and Xia, Sheng and Gao, Peng and Yue, Xiangyu},

    booktitle={arXiv preprint arXiv:2410.10511},

    year={2024}

  }