[CS231n] 4. Backpropagation

Machine Learning/CS231n 2022. 4. 27. 16:13

Keywords : chain rule interpretation, real-valued circuits, patterns in gradient flow

1) Introduction

Intuitive understanding of backpropagation

A way of computing gradients of expressions through recursive application of chain rule.
Understanding of this process is critical to understand, and effectively develop, design and debug neural networks.
Think of the training data as given and fixed, and of the weights as variables we have control over.
→ We usually only compute the gradient for the parameters (e.g. W, b) so that we can use it to perform a parameter update.
The gradient on x_i can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing.

2) Simple expressions & interpretation of the gradient

Multiplication function of two numbers f(x, y)=xy (partial derivative for either input)

Derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point

When h is very small, then the function is well-approximated by a straight line, and the derivative is its slope.
→ Derivative on each variable tells the sensitivity of the whole expression on its value.

Derivatives for the addition operation

max operation

The (sub)gradient is 1 on the input that was larger and 0 on the other input.

Example
- If the inputs are x=4, y=2, then the max is 4, and the function is not sensitive to the setting of y.
- If we were to increase it by a tiny amount h, the function would keep outputting 4, and therefore the gradient is zero: there is no effect.

3) Compound expressions with chain rule

complicated expressions that involve multiple composed functions, such as f(x,y,z)=(x+y)z

chain rule
- Correct way to “chain” gradient expressions together is through multiplication.
- In practice this is simply a multiplication of the two numbers that hold the two gradients.

Example

# set some inputs
x = -2 
y = 5 
z = -4

# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12

# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1

[dfdx,dfdy,dfdz] tell the sensitivity of the variables x,y,z on f

This computation can also be nicely visualized with a circuit diagram:

The real-valued "circuit" shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.

4) Intuitive understanding of backpropagation

Every gate in a circuit diagram gets some inputs and can right away compute two things:
1. Its output value
2. The local gradient of its output with respect to its inputs. The gates can do this completely independently without being aware of any of the details of the full circuit
  → Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs.

📌 Backpropagation can be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

5) Modularity : Sigmoid example

Describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function.

Example circuit for a 2D neuron with a sigmoid activation function.

The inputs are [x_0,x_1] and the (learnable) weights of the neuron are [w_0,w_1,w_2]. As we will see later, the neuron computes a dot product with the input and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1.

sigmoid function σ(x)

Computational graph representation may not be unique.

Choose one where local gradients at each node can be easily expressed

In any real practical application it would be very useful to group these operations into a single gate.

Example

w = [2,-3,-3] # assume some random weights and data
x = [-1, -2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

Implementation protip: staged backpropagation.

It is always helpful to break down the forward pass into stages that are easily backpropped through.
dot holds the output of the dot product between w and x.
During backward pass, successively compute (in reverse order) the corresponding variables
(ddot, and ultimately dw, dx) that hold the gradients of those variables.

6) Backprop in practice: Staged computation

Example

how we would structure the forward pass of such expression:

x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

It contains multiple intermediate variables, each of which are only simple expressions for which we already know the local gradients.

Every single piece in backprop will involve computing the local gradient of that expression, and chaining it with the gradient on that expression with a multiplication.

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
# done! phew

Cache forward pass variables

To compute the backward pass it is very helpful to have some of the variables that were used in the forward pass.

Gradients add up at forks

The forward expression involves the variables x, y multiple times, so when we perform backpropagation we must be careful to use += instead of =
multivariable chain rule - if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

Patterns in backward flow

add gate : gradient distributor
max gate : gradient router
mul gate : gradient switcher

7) Gradients for vectorized operations

The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations.

One must pay closer attention to dimensions and transpose operations.

Matrix-Matrix multiply gradient
generalizes all matrix-vector and vector-vector multiply operations:

# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)

X is of size [10 x 3] and dD of size [5 x 3], so if we want dW and W has shape [5 x 10], then the only way of achieving this is with dD.dot(X.T), as shown above.

Summary

Developed intuition for

What the gradients mean
How they flow backwards in the circuit
How they communicate which part of the circuit

Discussed the importance of staged computation for practical implementations of backpropagation

저작자표시 비영리 변경금지

'Machine Learning > CS231n' 카테고리의 다른 글

[CS231n] 6. Neural Networks Part2 : Setting up the Data (0)	2022.05.03
[CS231n] 5. Neural Networks Part 1: Setting up the Architecture (0)	2022.04.29
[CS231n] 3. Optimization (0)	2022.04.27
[CS231n] 2. Linear Classification (0)	2022.04.26
[CS231n] 1. Image Classification (0)	2021.04.28

ABOUT ME

cocoa cocoa

Keywords : chain rule interpretation, real-valued circuits, patterns in gradient flow

1) Introduction

2) Simple expressions & interpretation of the gradient

3) Compound expressions with chain rule

4) Intuitive understanding of backpropagation

5) Modularity : Sigmoid example

Implementation protip: staged backpropagation.

6) Backprop in practice: Staged computation

Cache forward pass variables

Gradients add up at forks

Patterns in backward flow

7) Gradients for vectorized operations

Summary

'Machine Learning > CS231n' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Keywords : chain rule interpretation, real-valued circuits, patterns in gradient flow

1) Introduction

2) Simple expressions & interpretation of the gradient

3) Compound expressions with chain rule

4) Intuitive understanding of backpropagation

5) Modularity : Sigmoid example

Implementation protip: staged backpropagation.

6) Backprop in practice: Staged computation

Cache forward pass variables

Gradients add up at forks

Patterns in backward flow

7) Gradients for vectorized operations

Summary

'Machine Learning > CS231n' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바