Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale RAE)
Official Implementation
Paper | Project Page | Models | Data
This repository provides GPU inference and TPU training implementations for our paper: Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders.
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong*, Boyang Zheng*, Ziteng Wang*, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
New York University
*Core contributor
git clone https://github.com/ZitengWangNYU/Scale-RAE.git
cd Scale-RAE
conda create -n scale_rae python=3.10 -y
conda activate scale_rae
pip install -e .cd inference
python cli.py t2i --prompt "Can you generate a photo of a cat on a windowsill?"Models and decoders automatically download from HuggingFace.
| Guide | Description |
|---|---|
| Inference Guide | Generate images with pre-trained models |
| Training Guide | Train your own Scale-RAE models |
| TPU Setup Guide | Set up TPUs for large-scale training |
All models available in our HuggingFace collection:
| Model | LLM | DiT | Decoder | HuggingFace Repo |
|---|---|---|---|---|
| Scale-RAE | Qwen2.5-1.5B | 2.4B | SigLIP-2 | nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B ⭐ |
| Scale-RAE | Qwen2.5-7B | 9.8B | SigLIP-2 | nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B |
| Scale-RAE-WebSSL | Qwen2.5-1.5B | 2.4B | WebSSL | nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B-WebSSL |
⭐ = Recommended default model
Decoders:
nyu-visionx/siglip2_decoder(SigLIP-2-SO400M, default)nyu-visionx/webssl300m_decoder(WebSSL-DINO300M)
Scale-RAE follows a two-stage training approach:
- Stage 1: Large-scale pretraining with pretrained LLM and randomly initialized DiTs
- Stage 2: Finetuning on high-quality instruction datasets
# Stage 1: Pretraining with SigLIP-2
bash scripts/examples/stage1_rae_siglip_1.5b_dit2.4b.sh
# Stage 2: Instruction finetuning
bash scripts/examples/stage2_rae_siglip_1.5b_dit2.4b.shSee Training Guide for data preparation, hyperparameters, and example scripts. See TPU Setup Guide for TPU configuration.
Scale-RAE/
├── inference/ # Inference CLI and scaling experiments
├── scale_rae/ # Core model implementation
│ ├── model/ # Model architectures (LLM, DiT, encoders)
│ └── train/ # Training scripts (SPMD/FSDP)
├── scripts/examples/ # Example training scripts
├── docs/
│ ├── Inference.md # Inference guide (CLI, scaling)
│ ├── Train.md # Training guide (data, hyperparams)
│ └── TPUs_Torch_XLA.md # TPU setup guide
├── setup_gcs_mount.sh # GCS mount for WebDataset
├── install_spmd.sh # TPU/TorchXLA installation
└── clear.py # TPU memory clearing utility
If you find this work useful, please cite:
@article{scale-rae-2026,
title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
journal={arXiv preprint arXiv:2601.16208},
year={2026}
}This project is released under the MIT License.
This work builds upon:
- RAE - Diffusion Transformers with Representation Autoencoders
- Cambrian-1 - Multimodal LLM framework
- WebSSL - Self-supervised vision models
- SigLIP-2 - Self-supervised & language-supervised vision models