-
[CS231n] 8. Convolutional Neural Networks: Architectures, Pooling LayersMachine Learning/CS231n 2022. 5. 12. 11:52728x90
layers, spatial arrangement, layer patterns, layer sizing patterns, AlexNet/ZFnet/VGGNet case studies, computational considerations
Convolutional Neural Networks (CNNs / ConvNets)
- CNNs are very similar to ordinary NN.
→ made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. still have scores and loss function(SVM/Softmax) on the last (fully-connected) layer - ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties.
→ These make the forward function more efficient to implement and vastly reduce the amount of parameters in the network
Architecture Overview
- Regular Neural Nets don't scale well to full images.
→ In CIFAR-10, images are only of size 32x32x3, so a single FC neuron in a first hidden layer wolud have 32_32_3=3072 weights.
→ If image size and neurons increase, the parameters would add up quickly. This full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. - The layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth.
Left: A regular 3-layer Neural Network.
Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels)
Layers used to build ConvNets
A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function.
- Three main types of layers to build ConvNet Architectures
- Convolutional Layer
- Pooling Layer
- Fully-Connected Layer(exactuly as seen in regular NNs)
- Example : CIFAR-10 classification [INPUT - CONV - RELU - POOL - FC]
- INPUT[32x32x3] will hold the raw pixel values of the image
- CONV layer will compute the output of neurons that are connecte to local regions in the input. This may result in volume such as [32x32x12] if we decided to use 12 filters
- ReLU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged [32x32x12]
- POOL layer will perfom a downsampling operation along the spatial dimensions(width, height), resulting in volume such as [16x16x12]
- FC layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score. Each neuron in this layer will be connected to all the numbers in the previous volume.
In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores.
- The ReLU/POOL layers will implement a fixed function(don't have updated parameters). The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.
The activations of an example ConvNet architecture. The initial volume stores the raw image pixels (left) and the last volume stores the class scores (right). Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one.
Convolutional Layer
The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting.
Overview and intuition without brain stuff
- The CONV layer's parameters consist of a set of learnable filters
- Every filter is small spatially(along width and height), but extends through the full depth of the input volume
- Example : filter 5x5x3 size (depth 3, the color channels)
As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
→ Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation.Local Connectivity
- We will connect each neuron to only a local region of the input volume.
- The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size).
- Example: input volume has size [32x32x3]
If the receptive field(or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5x5x3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.
Spatial arrangement
- Three hyperparameters control the size of the output volume: the depth, stride and zero-padding.
1. Depth of the output volume is a hyperparameter
→ It corresponds to the number of filters.
→ Different neurons along the depth dimension may activate in presence of various oriented edges, or blobs of color.
→ A set of neurons that are all looking at the same region of the input as a depth column2. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 then the filters jump 2 pixels. This will produce smaller output volumes spatially.
3. Somtimes it will be convenient to pad the input volume with zeros around the border.
The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes
- We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border.
- (W−F+2P)/S+1.
For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Parameter Sharing
Parameter sharing scheme is used in Convolutional Layers to control the number of parameters.
- It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption
→ That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). - Denoting a single 2-dimensional slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias.
→ With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights, for a total of 96x11_x_11x3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55x55 neurons in each depth slice will now be using the same parameters. - In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.
Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55x55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55x55 distinct locations in the Conv layer output volume.
- Sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure.
→ When the input are faces that have been centered in the image.
→ In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
Numpy example
Suppose that the input volume is a numpy array
X
- A depth column(or a fibre) at position
( x , y )
would be the activationsX[x, y, :]
- A depth slice, or an activation map at depth
d
would be the activationsX[:, :, d]
Conv Layer Example
Suppose that the input volume
X
has shapeX.shape:(11, 11, 4)
Suppose further that
- Use no zero padding(
P = 0
) - Filter size is
F = 5
- Stride is
S = 2
→ The output volume would have spatial size(11-5)/2+1 = 4
→ Giving a volume with width and height of 4
The activation map in the output volume (call it
V
), would then look as followsV[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
→ In numpy, the operation
*
above denotes elementwise multiplication between the arrays.→ The weight vector
W0
is the weight vector of that neuron andb0
is the bias.→ Here,
W0
is assumed to be of shapeW0.shape: (5,5,4)
, since the filter size is 5 and the depth of the input volume is 4.To construct a second activation map in the output volume, we would have:
V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1
V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1
V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1
V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1
V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1
(example of going along y)V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1
(or along both)
Recall that these activation maps are often followed elementwise through an activation function such as ReLU
Implementation as Matrix Multiplication
Convolution operation essentially performs dot products between the filters and local regions of the input.
- The local regions in the input image are stretched out into columns in an operation commonly called im2col.
Example- Input is
[227x227x3]
and it is to be convolved with11x11x3
filters at stride4
, then we would take[11x11x3]
blocks of pixels in the input and stretch each block into a column vector of size11x11x3 = 363
. - Stride of
4
gives(227-11)/4+1 = 55
locations along both width and height, leading to an output matrixX_col
of im2col of size[363 x 3025]
, where every column is a stretched out receptive field and there are55x55 = 3025
of them in total.
- Input is
- The weights of the CONV layer are similarly stretched out into rows.
For example, if there are 96 filters of size[11x11x3]
this would give a matrixW_row
of size[96 x 363]
. - The result of a convolution is now equivalent to performing one large matrix multiply
np.dot(W_row, X_col)
, which evaluates the dot product between every filter and every receptive field location.
Example
- The output of this operation would be
[96 x 3025]
, giving the output of the dot product of each filter at each location.
- The output of this operation would be
- The result must finally be reshaped back to its proper output dimension
[55x55x96]
.
This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage. The same im2col idea can be reused to perform the pooling operation
Backpropagation
The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters).
Dilated convolutions
A recent development (e.g. see paper by Fisher Yu and Vladlen Koltun) is to introduce one more hyperparameter to the CONV layer called the dilation.
- It’s possible to have filters that have spaces between each cell, called dilation.
- Example
In one dimension a filterw
of size 3 would compute over inputx
the following:w[0]*x[0] + w[1]*x[1] + w[2]*x[2]
. This is dilation of 0.
For dilation 1 the filter would instead compute w[0]*x[0] + w[1]*x[2] + w[2]*x[4] - This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers.
Pooling Layer
It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture.
- Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.
- The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.
- Accepts a volume of size
W1 x H1 x D1
Produces a volume of sizeW2 x H2 x D2
- Requires two hyperparameters : spatial extent
F
, stride S
→W2 = (W1 - F) / S + 1
→H2 = (H1 - F) / S + 1
→D2 = D1
A pooling layer with
F=3,S=2
(also called overlapping pooling), and more commonlyF=2,S=2.
Pooling sizes with larger receptive fields are too destructive.
In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved.
The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).
Backpropagation
Backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass.
Getting rid of pooling
Many people dislike the pooling operation and think that we can get away without it.
- To reduce the size of the representation they suggest using larger stride in CONV layer once in a while.
- Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs).
Normalization Layer
Many types of normalization layers have been proposed for use in ConvNet architectures. However, these layers have since fallen out of favor because in practice their contribution has been shown to be minimal.
Fully-connected layer
Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset.
Converting FC layers to CONV layers
The neurons in FC and CONV layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers
- For any CONV layer there is an FC layer that implements the same forward function.
→ The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). - Any FC layer can be converted to a CONV layer.
→ For example, an FC layer with K=4096 that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096.
In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
728x90'Machine Learning > CS231n' 카테고리의 다른 글
[CS231n] 9. Convolutional Neural Networks: Layer Patterns, Case studies (0) 2022.05.13 [CS231n] 7. Neural Networks Part 3 : Learning and Evaluation (0) 2022.05.05 [CS231n] 6. Neural Networks Part2 : Setting up the Data (0) 2022.05.03 [CS231n] 5. Neural Networks Part 1: Setting up the Architecture (0) 2022.04.29 [CS231n] 4. Backpropagation (0) 2022.04.27 - CNNs are very similar to ordinary NN.