Introduction
Diffusion Models, the class of AI models which includes Stable Diffusion and DALL-E2 among others, rely on the simple yet powerful concept of diffusion. But what, exactly, is diffusion? And why is it so important that it became the core of most of the new fancy generative models? Are the two questions that I will try to talk about in this post.
The Theory of Diffusion
Definition
The word Diffusion is defined as a “process […] by which there is a net flow of matter from a region of high concentration to a region of low concentration” ( Citation: Encyclopedia Britannica, Encyclopedia Britannica (s.d.). Retrieved from https://www.britannica.com/science/diffusion ) or more generally “the action of spreading in many directions" ( Citation: Cambridge-Dictionary, Cambridge-Dictionary (n.d.). Diffusion. Retrieved from https://dictionary.cambridge.org/dictionary/english/diffusion ) . The concept of diffusion is essential in thermodynamics and thus has been studied extensively, and shouldn’t be a stranger to anyone with adequate background in chemistry or physics.
Relation to Distributions
Generally speaking, any value could be portrayed as a distribution, or a Probability Density Function (PDF). In the case of a random variable $x$ that could take a range of values, the PDF or probability distribution for the values of $x$ could look something like this:
You can see how, based on the graph, that $x$ has a certain, very probable value (range) between 1 and 3, described by the central tendency of the distribution ( Citation: Australian Bureau of Statistics, Australian Bureau of Statistics (s.d.). Retrieved from https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/measures-central-tendency ) . We know that the probability of a value range in a PDF is the integral over that value range, i.e. the space underneath it. This space gets larger by: (1) increasing the range or (2) decreasing the variance.
Decreasing the range means being more specific in estimating the value of the variable, which could only be high when the information about the variable is available and precise. A high variance counteracts that, because variance, in a sense, is a measure of confidence in the variable taking a certain value range. Variance is a measure of spread. The higher the spread, the less confident we are in the central tendency of the distribution, therefore the harder it’s to pick a value range representative of our variable ( Citation: Australian Bureau of Statistics, Australian Bureau of Statistics (s.d.). Retrieved from https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/measures-spread ) .
Now, diffusion works in an opposite direction to confidence, as shown in the next figure. If we let diffusion run its course, it would theoretically keep increasing variance so that we reach a uniform distribution of values, or maybe a distribution with a mean around the point of equilibrium and relatively high variance. This means that, ultimately, diffusion leads to losing any unique information in the system, and reaching a state of equilibrium, the origin of which is intractable.
Different distributions of a diffused variable $x_t$, where $q$ is a function that adds noise relative to the number $t$. This simulates the diffusion process, in which $t = T$ is the equilibrium and equivalent to a normal distribution. Note that we start with data and end up with noise just by running diffusion. Taken from ( Citation: Vahdat, 2023 Vahdat, A.(2023). Retrieved from https://cvpr2023-tutorial-diffusion-models.github.io/ ) .
More examples of adding noise to a joint PDF (running diffusion).
The second plot is the same as the first, the only difference is the choice of the type of plot in the middle. You can think of the first as modeling particles of gas, while the second models heat in a metal.
You can think of those two variables here as the 2D placement of gas particles in a room diffusing. However, this model is much more general, and depicts losing information of a system through the process of diffusion.
Relation to Images
There are a lot of accurate ways to think of images as signals or functions. This is basically what the field of signal processing deals with. If we take the example of the discrete cosine transformation, which expresses data as a linear combination of cosine functions of different frequencies, we notice that one of the properties of a DCT transformation of an image is what is called “energy compaction”; that is the concentration of most of the signal’s information in few lower frequency components ( Citation: Ahmed, Natarajan & al., 1974 Ahmed, N., Natarajan, T. & Rao, K. (1974). Discrete Cosine Transform. IEEE Transactions on Computers, C-23(1). 90–93. https://doi.org/10.1109/T-C.1974.223784 ) . This is apparent in the fact that information redundancy in images typically increases with increasing resolution, and that increasing resolution yields diminishing returns when it comes to visual quality ( Citation: Chen, 2023 Chen, T. (2023). On the Importance of Noise Scheduling for Diffusion Models. ) .
Top left are the image’s pixel values, top right are the DCT coefficients sorted by frequency ascending. Notice the energy compaction in the top left of the DCT. ( Citation: Reducible, 2022 Reducible(2022). Retrieved from https://www.youtube.com/watch?v=0me3guauqOU&t=1548s )
This property doesn’t hold for noise however. In a DCT of a noisy signal (e.g. an image of noise), there’s little if any energy compaction, and this effect could be found in other signal transformations as well. In other words, applying the process of diffusion on images would mean the gradual loss of energy compaction, going from a recognizable image with high concentration spots (image of a dog, unique information) to the equilibrium of random images (noise, no information), with no way of telling what the original image was.
Conclusion and an example
Considering all that we have talked discussed so far, it should be clear how the analogy of diffusion in an image is adding random numbers to its pixel values, in the end creating an image of complete noise. There’s of course no telling the what original image of noise was.
Example of the diffusion process applied to an image. ( Citation: Vahdat, Gao & al., 2022 Vahdat, A., Gao, R. & Kreis, K.(2022). Retrieved from https://cvpr2022-tutorial-diffusion-models.github.io/ )
The key insight here is that, given that diffusion is the process of destroying information, by reversing diffusion we would be creating a process that generates information. The process of diffusion is generally irreversible, i.e., reversing diffusion is intractable, at least in a lot of scenarios, yet if we create a controlled diffusion process of our own where we keep the diffusion steps small enough, we should be able to train a model to reverse this process, and as a byproduct produce new data samples ( Citation: Sohl-Dickstein, Weiss & al., 2015 Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Retrieved from https://arxiv.org/abs/1503.03585 ) . The details of how that could be achieved is huge topic of its own though, and is better left for another post.
I hope you found this post helpful and see you next time.
Bibliography
- Australian Bureau of Statistics (n.d.)
- Australian Bureau of Statistics (s.d.). Retrieved from https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/measures-central-tendency
- Australian Bureau of Statistics (n.d.)
- Australian Bureau of Statistics (s.d.). Retrieved from https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/measures-spread
- Encyclopedia Britannica (n.d.)
- Encyclopedia Britannica (s.d.). Retrieved from https://www.britannica.com/science/diffusion
- Cambridge-Dictionary (n.d.)
- Cambridge-Dictionary (n.d.). Diffusion. Retrieved from https://dictionary.cambridge.org/dictionary/english/diffusion
- Chen (2023)
- Chen, T. (2023). On the Importance of Noise Scheduling for Diffusion Models.
- Vahdat, Gao & Kreis (2022)
- Vahdat, A., Gao, R. & Kreis, K.(2022). Retrieved from https://cvpr2022-tutorial-diffusion-models.github.io/
- Vahdat (2023)
- Vahdat, A.(2023). Retrieved from https://cvpr2023-tutorial-diffusion-models.github.io/
- Ahmed, Natarajan & Rao (1974)
- Ahmed, N., Natarajan, T. & Rao, K. (1974). Discrete Cosine Transform. IEEE Transactions on Computers, C-23(1). 90–93. https://doi.org/10.1109/T-C.1974.223784
- Reducible (2022)
- Reducible(2022). Retrieved from https://www.youtube.com/watch?v=0me3guauqOU&t=1548s
- Sohl-Dickstein, Weiss, Maheswaranathan & Ganguli (2015)
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Retrieved from https://arxiv.org/abs/1503.03585