-
[CS231n] 4. BackpropagationMachine Learning/CS231n 2022. 4. 27. 16:13728x90
Keywords : chain rule interpretation, real-valued circuits, patterns in gradient flow
1) Introduction
Intuitive understanding of backpropagation
- A way of computing gradients of expressions through recursive application of chain rule.
- Understanding of this process is critical to understand, and effectively develop, design and debug neural networks.
- Think of the training data as given and fixed, and of the weights as variables we have control over.
→ We usually only compute the gradient for the parameters (e.g. W, b) so that we can use it to perform a parameter update. - The gradient on x_i can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing.
2) Simple expressions & interpretation of the gradient
- Multiplication function of two numbers f(x, y)=xy (partial derivative for either input)
- Derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point
- When h is very small, then the function is well-approximated by a straight line, and the derivative is its slope.
→ Derivative on each variable tells the sensitivity of the whole expression on its value.
- Derivatives for the addition operation
- max operation
The (sub)gradient is 1 on the input that was larger and 0 on the other input.
- Example
- If the inputs are x=4, y=2, then the max is 4, and the function is not sensitive to the setting of y.
- If we were to increase it by a tiny amount h, the function would keep outputting 4, and therefore the gradient is zero: there is no effect.
3) Compound expressions with chain rule
complicated expressions that involve multiple composed functions, such as f(x,y,z)=(x+y)z
- chain rule
- Correct way to “chain” gradient expressions together is through multiplication.
- In practice this is simply a multiplication of the two numbers that hold the two gradients.
- Example
# set some inputs x = -2 y = 5 z = -4 # perform the forward pass q = x + y # q becomes 3 f = q * z # f becomes -12 # perform the backward pass (backpropagation) in reverse order: # first backprop through f = q * z dfdz = q # df/dz = q, so gradient on z becomes 3 dfdq = z # df/dq = z, so gradient on q becomes -4 # now backprop through q = x + y dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = 1.0 * dfdq # dq/dy = 1
[dfdx,dfdy,dfdz]
tell the sensitivity of the variablesx,y,z
onf
- This computation can also be nicely visualized with a circuit diagram:
The real-valued "circuit" shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.
4) Intuitive understanding of backpropagation
- Every gate in a circuit diagram gets some inputs and can right away compute two things:
- Its output value
- The local gradient of its output with respect to its inputs. The gates can do this completely independently without being aware of any of the details of the full circuit
→ Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs.
📌 Backpropagation can be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.
5) Modularity : Sigmoid example
- Describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function.
- Example circuit for a 2D neuron with a sigmoid activation function.
The inputs are [x_0,x_1] and the (learnable) weights of the neuron are [w_0,w_1,w_2]. As we will see later, the neuron computes a dot product with the input and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1.
- sigmoid function σ(x)
Computational graph representation may not be unique.
Choose one where local gradients at each node can be easily expressed
In any real practical application it would be very useful to group these operations into a single gate.
- Example
w = [2,-3,-3] # assume some random weights and data x = [-1, -2] # forward pass dot = w[0]*x[0] + w[1]*x[1] + w[2] f = 1.0 / (1 + math.exp(-dot)) # sigmoid function # backward pass through the neuron (backpropagation) ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation dx = [w[0] * ddot, w[1] * ddot] # backprop into x dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w # we're done! we have the gradients on the inputs to the circuit
Implementation protip: staged backpropagation.
- It is always helpful to break down the forward pass into stages that are easily backpropped through.
dot
holds the output of the dot product betweenw
andx
.- During backward pass, successively compute (in reverse order) the corresponding variables
- (
ddot
, and ultimatelydw
,dx
) that hold the gradients of those variables.
6) Backprop in practice: Staged computation
- Example
how we would structure the forward pass of such expression:
x = 3 # example values y = -4 # forward pass sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1) num = x + sigy # numerator #(2) sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3) xpy = x + y #(4) xpysqr = xpy**2 #(5) den = sigx + xpysqr # denominator #(6) invden = 1.0 / den #(7) f = num * invden # done! #(8)
It contains multiple intermediate variables, each of which are only simple expressions for which we already know the local gradients.
Every single piece in backprop will involve computing the local gradient of that expression, and chaining it with the gradient on that expression with a multiplication.
# backprop f = num * invden dnum = invden # gradient on numerator #(8) dinvden = num #(8) # backprop invden = 1.0 / den dden = (-1.0 / (den**2)) * dinvden #(7) # backprop den = sigx + xpysqr dsigx = (1) * dden #(6) dxpysqr = (1) * dden #(6) # backprop xpysqr = xpy**2 dxpy = (2 * xpy) * dxpysqr #(5) # backprop xpy = x + y dx = (1) * dxpy #(4) dy = (1) * dxpy #(4) # backprop sigx = 1.0 / (1 + math.exp(-x)) dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3) # backprop num = x + sigy dx += (1) * dnum #(2) dsigy = (1) * dnum #(2) # backprop sigy = 1.0 / (1 + math.exp(-y)) dy += ((1 - sigy) * sigy) * dsigy #(1) # done! phew
Cache forward pass variables
- To compute the backward pass it is very helpful to have some of the variables that were used in the forward pass.
Gradients add up at forks
- The forward expression involves the variables x, y multiple times, so when we perform backpropagation we must be careful to use
+=
instead of=
- multivariable chain rule - if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.
Patterns in backward flow
- add gate : gradient distributor
- max gate : gradient router
- mul gate : gradient switcher
7) Gradients for vectorized operations
The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations.
One must pay closer attention to dimensions and transpose operations.
- Matrix-Matrix multiply gradient
- generalizes all matrix-vector and vector-vector multiply operations:
# forward pass W = np.random.randn(5, 10) X = np.random.randn(10, 3) D = W.dot(X) # now suppose we had the gradient on D from above in the circuit dD = np.random.randn(*D.shape) # same shape as D dW = dD.dot(X.T) #.T gives the transpose of the matrix dX = W.T.dot(dD)
X
is of size [10 x 3] anddD
of size [5 x 3], so if we wantdW
andW
has shape [5 x 10], then the only way of achieving this is withdD.dot(X.T)
, as shown above.
Summary
Developed intuition for
- What the gradients mean
- How they flow backwards in the circuit
- How they communicate which part of the circuit
Discussed the importance of staged computation for practical implementations of backpropagation
728x90'Machine Learning > CS231n' 카테고리의 다른 글
[CS231n] 6. Neural Networks Part2 : Setting up the Data (0) 2022.05.03 [CS231n] 5. Neural Networks Part 1: Setting up the Architecture (0) 2022.04.29 [CS231n] 3. Optimization (0) 2022.04.27 [CS231n] 2. Linear Classification (0) 2022.04.26 [CS231n] 1. Image Classification (0) 2021.04.28