In our last article we discussed the intuition and theory behind Generative Adversarial Networks (GANs).
In this article, we'll expand on this intuition and review the key components of GANs in more detail. This includes topics such as activation functions, batch normalization, convolutions, and so on. These components are all used to create a deep convolutional GAN, or DCGAN, for the purpose of image generation.
This article is based on notes from the first course in this Generative Adversarial Networks (GANs) Specialization and is organized as follows:
- Introduction to Activations
- Common Activation Functions
- Introduction to Batch Normalization
- Review of Convolutions
- Pooling and Upsampling
- Transposed Convolutions
- The DCGAN Paper
Stay up to date with AI
This post may contain affiliate links. See our policy page for more information.
Introduction to Activations
Activations are functions that take any real number as input and output a number within a certain range using a non-linear differentiable function.
Activations are typically used in deep neural networks for classification between certain layers.
For example, if we have a neural network with two hidden layers and multiple inputs, $X_0$, $X_1$,...$X_n$. Let's say that the neural network uses these features to determine if the image is a cat, meaning it will output a probability between zero and one.
All the nodes in between the input and output comprise the entire neural network architecture.
A node takes in information from the previous layer and predicts two things:
$$z^{[l]}_i = \sum_{i=0}W^{[l]}_i a^{[l-1]}_i$$
The first thing computed is $z^{[l]}_i$, where $i$ represents which node it is and $l$ represents the layer.
$z^{[l]}_i$ is equal to the sum of various weights on the output from the previous layer, $a^{[l-1]}_i$.
$a^{[l-1]}_i$ is the output of an activation function that we can call $g^{[l]}$ that takes in as input the $(z^{[l]}_i)$ value:
$$a^{[l-1]}_i = g^{[l]}(z^{[l]}_i)$$
This activation function needs to be differentiable and non-linear function for the following reasons:
- It needs to be differentiable for backpropagation in training and updating the neural network
- It needs to be non-linear in order to compute complex features
In summary, we need a non-linear differentiable activation function to take advantage of deep learning and construct a complex neural network.
Common Activation Functions
Let's now look at several commonly used activation functions, including:
- ReLU
- Leaky ReLU
- Sigmoid
- Tanh
ReLU
ReLU, or Rectified Linear Unit, is one of the most commonly used activation functions and works as follows:
$$g^{[l]}(z[l]) = max(0, z^{[l]})$$
ReLU takes the max between $z^{[l]}$ and 0, in other words it gets rid of all the negatives.
One problem with ReLU is known as the dying ReLU problem, which occurs if the unit it only outputs 0 for any given input and the weights stop learning.
Leaky ReLU
The dying ReLU problem is why the the leaky ReLU variation exists.
Leaky ReLU maintains the same form of ReLU in the case where $z$ is positive. If $z$ is negative it adds a little "leak" or slope in the line. This slope is still nonlinear with a bend in the slope at $z = 0$.
$$g^{[l]}(z[l]) = max(az^{[l]}, z^{[l]})$$
The slope $a$ is treated as a hyperparameter and is typically set to 0.1, meaning the leak is quite small relative to the positive slope.
Sigmoid
The sigmoid activation function has a smooth s shape and outputs values between 0 and 1.
$$g^{[l]}(z[l]) = \frac{1}{1 + e^{-z^{[l]}}}$$
The sigmoid activation function isn't used very often in hidden layers since the derivative of the function approaches 0 at the tails of the function. This produces what's known as the vanishing gradient problem or saturated outputs at the tails of the function.
Tanh
The tanh activation function has a similar shape to sigmoid, except that it outputs values between -1 and 1.
$$g^{[l]}(z[l]) = tanh(z^{[l]})$$
When $z$ is positive it outputs a value between 0 and 1 and -1 and 0 if $z$ is negative.
A key difference is that tanh keeps the sign of the input, which can be useful in some applications.
Introduction to Batch Normalization
GANs can often take a long time to train, especially for larger applications.
The models can also be quite fragile in the learning process as they're not as simple as building a classifier.
For this reason, batch normalization is a technique we can use to stabilize and improve the training process.
For example, if the distribution of input variables changes from the training to the test set, which is known as covariate shift, this is where we will need to apply batch normalization.
If the input variables, let's say $X_1$ and $X_2$, are normalized, this means the distribution of the new input variables $X_1'$ and $X_2'$ will be much more similar.
This normalization will typically make the cost function look much smoother and balanced across the two dimensions. As a result, training will also be easier and likely faster.
No matter how much the distribution of the input variables change, the mean and standard deviation of the normalized variables will be normalized to the same place—around a mean of 0 and standard deviation of 1.
Using normalization, the effect of this covariate shift will be significantly reduced.
Neural networks are also susceptible to what's referred to as internal covariate shift, which mean there's a covariate shift in the internal hidden layers of the network.
Batch normalization indicates the use of batch statistics—discussed in the next section—and works to reduce the effects of internal covariate shift.
Batch normalization may sound difficult to implement, although keep in mind frameworks like TensorFlow or PyTorch handle the procedure for you.
With these frameworks, all you have to do is create a layer called batch_norm
and when the model is in testing the statistics will be computed and saved for you.
Review of Convolutions
Convolutions are a key part of many GAN architectures as they're central to image processing.
Convolutions allow you to detect key features in different areas of an image using filters. For example, these filters scan the image and can tell you which part of the image contains, eyes, and so on.
Each filter is simply a matrix of real values that are learned during training.
At a high-level, the convolution operation works as follows:
- Let's say we we have a 5x5 grayscale image where each square is a pixel with a value between 0 and 255
- Let's also say we have a 3x3 filter with 1 in the first column, 0 in the second, and -1 in the third
- The convolution operation works by applying this filter across the grayscale image
- You would then multiply all the elements from the filter on top of those pixels
- You then take the sum of all of the products and put them into a resulting matrix
This is the most simple version of a convolution operation, although there are tweaks we can add are helpful in image processing.
In summary, convolutions recognize patterns in images by scanning each section of the image and detecting features. A convolution is simply a series of sums of the element-wise products across the image.
Padding & Stride
If we take the grayscale image example discussed above, we applied a 3x3 convolutional filter by computing the element-wise product of the first 3x3 section in the top left corner.
Then we move one pixel to the right and compute it again, and move it one more time to the right by one pixel.
We then move one pixel down and start from the left again, until we have applied the filter to the entire image.
Moving one pixel to the right and one pixel down is called a stride of 1.
We also don't need to have a stride of 1 for moving right and down, we could have a stride of 2 right and 4 down, for example.
As you increase the stride size, the computation is much faster, although it is a tradeoff as the filter will have less coverage of the image.
In this example, a stride of 2 would result in a smaller matrix of size 2x2.
Another tweak we can apply to the convolution operation is padding.
If we're applying a stride of 1 to a 3x3 image, for example, this would result in the corners being touched by the filter only once, whereas the center will be visited four times.
This can be a problem if we have features in the edge of an image that are important.
To address this, we can apply a frame around the image, which is referred to as padding.
By applying padding before computing the convolution, the filter scans the image along the frame, which results in every pixel of the image being visited the same number of times. This is because all of the pixels in the image are located at the center and are surrounded by the frame.
Pooling and Upsampling
Two common layers in convolutional neural networks (CNNs) are pooling and upsampling.
Pooling is used to reduce the size of the input, whereas upsampling is used to increase the size of the input.
Pooling reduces the dimensions of the input by taking the mean or finding the maximum value of different areas in the image.
For example, applying pooling to an image will often result in a blurry image that still has roughly the same color distribution.
It is then much less expensive to perform computation on the pooled layer than the original image.
One of the most popular types of pooling layer is max pooling, which is defined as:
Max pooling is a sample-based discretization process. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned.
It's important to note that pooling doesn't have any learned parameters, instead it's just a simple rule applied across the image.
Upsampling has the opposite effect of pooling and results in an image with a higher resolution.
Upsampling requires inferring values for the additional pixels.
There are a few different ways to do this, including nearest neighbours, which copies the values of the pixels from the input multiple times to fill in the output.
There are many other ways to do upsampling, including linear and bi-linear interpolation.
All of the processes for pooling and upsampling techniques is taken care of with machine learning frameworks like TensorFlow or PyTorch.
As with pooling, upsampling doesn't involve learning parameters, it is just a fixed rule that's applied to the image.
Next, we'll look at a method similar to upsampling that does have learnable parameters, which is called transposed convolutions.
Transposed Convolutions
Transposed convolutions are an upsampling technique that uses a learnable filter to increase the size of the input.
Let's say we have a 2x2 input that we want to upsample to a 3x3 output.
We can accomplish this using a transposed convolution we can use a 2x2 learned filter with stride=1.
One of the common problems with transposed convolutions is that the center pixel of the 3x3 output will be visited four times, whereas the corners are only visited once. This can result in the image having a checkerboard-like appearance.
To avoid this issue, a common technique is to use upsampling and then use a convolutional layer.
You can learn more about deconvolution and the checkerboard affect here.
The DCGAN Paper
While we won't go over the code for the DCGAN article in this article, you can find an implementation in the first course here. Also, you can find the full DCGAN paper below:
Summary: Components of DCGANs
In this article, we discussed the key components of building a DCGAN for the purpose of image generation. This includes techniques such as activation functions, batch normalization, convolutions, pooling and upsampling, and transposed convolutions.
In the following articles, we'll expand on these components and look at other architectures such as the Wasserstein GANs and conditional GANs.