Variational Autoencoders (VAE)

Ashutosh Makone
5 min readJul 10, 2021

Autoencoders with variational inference

There are generative models and there are discriminative models. Discriminative models discriminate between different kinds of data instances while generative models generate new instances of data. There are many types of generative models. Variational Autoencoders is certainly one of the most popular generative models. They were introduced by Diederik P Kingma and Max Welling in their research paper titled “ Auto-Encoding Variational Bayes” which can be found here

How is it different from Autoencoder?

So what is the difference between Autoencoders and Variational Autoencoders? An Autoencoder is mostly used for dimensionality reduction. So as shown in the figure 1, the input(x) is a high dimensional vector while the encoded representation(z) is a low dimensional vector. In order to faithfully reconstruct the input vector at the output (x’) the encoder network has to learn the features of input data. For example if the input data is images of forest then, the encoder vector will learn various features like color/shape of trees, color/shape of their trunk and color of background of sky etc. These features are preserved in encoded representation with less number of dimensions. And then the decoder network, reconstructs the original image from this representation. There are multiple applications in which Autoencoders can be used but the basic philosophy remains the same. The middle portion of the diagram is narrow and it is also called as bottleneck.

Fig 1 : Architecture of Autoencoders

Architecture of Variational Autoencoders

Variational Autoencoders are different in a way that instead of learning the features of input data, they learn the probability distribution of features. Thus as shown in figure 2, the distribution of various features is learned in terms of its mean and standard deviation. This way we achieve a continuous and smooth representation called as latent space representation.

Fig 2: Architecture of Variational Autoencoders

Every distribution in this latent space is then randomly sampled and passed to the decoder network. The decoder is now able to generate data (images of forest as specified in above example) which is similar(but not same) to our training data. This network can thus be used to generate similar dataset and this generation is evidently stochastic is nature. This is the intuition behind Variational Autoencoders as a generative model.

The encoder model and the decoder model in Variational Autoencoder are also called as recognition model and generative model respectively.

Formulation

As shown in figure, the input dataset is x with probability function p(x) and we want to model this distribution. The latent space encoding vector is z. So according to Bayes’ Theorem

The set of relationships between input data and latent space can be defined as

p(z) : Prior

p(x|z): Likelihood

p(z|x): Posterior

The denominator of RHS can be computed as

But this integral is very difficult to calculate in higher dimensional space. One way to solve it is by using variatonal inference. If we can choose another distribution say q(x|z) (which can be computed) such that it approximates to p(x|z), then our problem is solved. So the objective is to make q(x|z) with same distribution as p(x|z), which can be done by minimizing their KL divergence. KL divergence is a measure of similarity between two distribution. A lower value of KL diversion indicates higher similarity between distributions. A detailed description of how this is solved can be found here. Ultimately it all boils down to

The first term is the likelihood of observing x given z and the second term is KL divergence of p and q. So basically to make q similar to p, we can maximize the first term and/or minimize the second term. This process of approximating q to p is called variatonal inference.

Reparameterization Trick

Another problem we encounter is regarding back-propagation. As we know during back-propagation, all the elements in network should be differentiable. But the stage of random sampling is stochastic and hence not differentiable. To overcome this problem a new element called epsilon is introduced in the network. Thus the latent space distribution of z which is given by

is modified as

given that

where there is element-wise product between sigma and epsilon. This epsilon takes care of the stochastic part.

Reparameterization trick.

Thus the clever trick here is that the mean and standard deviation are the only elements that we want to train by calculating gradients and doing backpropagation. Epsilon has a fix distribution and do not need to be trained. This is called as reparameterization trick.

I hope this article well summarizes the concept behind Variational Autoencoders !!!! Feedback and suggestions are most welcomed !!!

--

--

Ashutosh Makone

I am a hands-on guy. I appreciate the beauty of theory but understand its futility without application. ML, DL and computer vision are my interests.