Diffusion Transformer Explained. Exploring the architecture that brought | by Mario Namtao Shianti Larcher | Feb, 2024 – Towards Data Science

Exploring the architecture that brought transformers into image generation Image generated with DALLE.

After shaking up NLP and moving into computer vision with the Vision Transformer (ViT) and its successors, transformers are now entering the field of image generation. They are gradually becoming an alternative to the U-Net, the convolutional architecture upon which all the early diffusion models were built. This article looks into the Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in their paper Scalable Diffusion Models with Transformers.

DiT has influenced the development of other transformer-based diffusion models like PIXART-, Sora (OpenAIs astonishing text-to-video model), and, as I write this article, Stable Diffusion 3. Lets start exploring this emerging class of architectures that are contributing to the evolution of diffusion models.

Given that this is an advanced topic, Ill have to assume a certain familiarity with recurring concepts in AI and, in particular, in image generation. If youre already familiar with this field, this section will help refresh these concepts, providing you with further references for a deeper understanding.

If you want an extensive overview of this world before reading this article, I recommend reading my previous article below, where I cover many diffusion models and related techniques, some of which well revisit here.

At an intuitive level, diffusion models function by first taking images, introducing noise (usually Gaussian), and then training a neural network to reverse this noise-adding

Read the original post:

Diffusion Transformer Explained. Exploring the architecture that brought | by Mario Namtao Shianti Larcher | Feb, 2024 - Towards Data Science

Related Posts

Comments are closed.