ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [CS231n] 4. Backpropagation
    Machine Learning/CS231n 2022. 4. 27. 16:13
    728x90

    Keywords : chain rule interpretation, real-valued circuits, patterns in gradient flow

     

    1) Introduction

    Intuitive understanding of backpropagation

    • A way of computing gradients of expressions through recursive application of chain rule.
    • Understanding of this process is critical to understand, and effectively develop, design and debug neural networks.

    • Think of the training data as given and fixed, and of the weights as variables we have control over.
      → We usually only compute the gradient for the parameters (e.g. W, b) so that we can use it to perform a parameter update.
    • The gradient on x_i can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing.

     


     

    2) Simple expressions & interpretation of the gradient

    • Multiplication function of two numbers f(x, y)=xy (partial derivative for either input)

    multiplication function

     

    • Derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point

    derivative

     

    • When h is very small, then the function is well-approximated by a straight line, and the derivative is its slope.
      → Derivative on each variable tells the sensitivity of the whole expression on its value.

     

    • Derivatives for the addition operation

     

    • max operation

    The (sub)gradient is 1 on the input that was larger and 0 on the other input.

     

    • Example
      - If the inputs are x=4, y=2, then the max is 4, and the function is not sensitive to the setting of y.
      - If we were to increase it by a tiny amount h, the function would keep outputting 4, and therefore the gradient is zero: there is no effect.

     


     

    3) Compound expressions with chain rule

    complicated expressions that involve multiple composed functions, such as f(x,y,z)=(x+y)z

    • chain rule
       - Correct way to “chain” gradient expressions together is through multiplication.
       - In practice this is simply a multiplication of the two numbers that hold the two gradients.

    chain rule

    • Example
    # set some inputs
    x = -2 
    y = 5 
    z = -4
    
    # perform the forward pass
    q = x + y # q becomes 3
    f = q * z # f becomes -12
    
    # perform the backward pass (backpropagation) in reverse order:
    # first backprop through f = q * z
    dfdz = q # df/dz = q, so gradient on z becomes 3
    dfdq = z # df/dq = z, so gradient on q becomes -4
    # now backprop through q = x + y
    dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule!
    dfdy = 1.0 * dfdq # dq/dy = 1

    [dfdx,dfdy,dfdz] tell the sensitivity of the variables x,y,z on f

    • This computation can also be nicely visualized with a circuit diagram:

    chain rule

    The real-valued "circuit" shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.

     


     

    4) Intuitive understanding of backpropagation

    • Every gate in a circuit diagram gets some inputs and can right away compute two things:
      1. Its output value
      2. The local gradient of its output with respect to its inputs. The gates can do this completely independently without being aware of any of the details of the full circuit
        → Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs.
    📌  Backpropagation can be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

     


     

    5) Modularity : Sigmoid example

    • Describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function.

     

    • Example circuit for a 2D neuron with a sigmoid activation function.

    The inputs are [x_0,x_1] and the (learnable) weights of the neuron are [w_0,w_1,w_2]. As we will see later, the neuron computes a dot product with the input and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1.

     

    • sigmoid function σ(x)

    Computational graph representation may not be unique.

    Choose one where local gradients at each node can be easily expressed

    In any real practical application it would be very useful to group these operations into a single gate.

     

    • Example
    w = [2,-3,-3] # assume some random weights and data
    x = [-1, -2]
    
    # forward pass
    dot = w[0]*x[0] + w[1]*x[1] + w[2]
    f = 1.0 / (1 + math.exp(-dot)) # sigmoid function
    
    # backward pass through the neuron (backpropagation)
    ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
    dx = [w[0] * ddot, w[1] * ddot] # backprop into x
    dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
    # we're done! we have the gradients on the inputs to the circuit

    Implementation protip: staged backpropagation.

    • It is always helpful to break down the forward pass into stages that are easily backpropped through.
    • dot holds the output of the dot product between w and x.
    • During backward pass, successively compute (in reverse order) the corresponding variables
    • (ddot, and ultimately dw, dx) that hold the gradients of those variables.

     


     

    6) Backprop in practice: Staged computation

    • Example

    how we would structure the forward pass of such expression:

    x = 3 # example values
    y = -4
    
    # forward pass
    sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
    num = x + sigy # numerator                               #(2)
    sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
    xpy = x + y                                              #(4)
    xpysqr = xpy**2                                          #(5)
    den = sigx + xpysqr # denominator                        #(6)
    invden = 1.0 / den                                       #(7)
    f = num * invden # done!                                 #(8)

    It contains multiple intermediate variables, each of which are only simple expressions for which we already know the local gradients.

    Every single piece in backprop will involve computing the local gradient of that expression, and chaining it with the gradient on that expression with a multiplication.

    # backprop f = num * invden
    dnum = invden # gradient on numerator                             #(8)
    dinvden = num                                                     #(8)
    # backprop invden = 1.0 / den 
    dden = (-1.0 / (den**2)) * dinvden                                #(7)
    # backprop den = sigx + xpysqr
    dsigx = (1) * dden                                                #(6)
    dxpysqr = (1) * dden                                              #(6)
    # backprop xpysqr = xpy**2
    dxpy = (2 * xpy) * dxpysqr                                        #(5)
    # backprop xpy = x + y
    dx = (1) * dxpy                                                   #(4)
    dy = (1) * dxpy                                                   #(4)
    # backprop sigx = 1.0 / (1 + math.exp(-x))
    dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
    # backprop num = x + sigy
    dx += (1) * dnum                                                  #(2)
    dsigy = (1) * dnum                                                #(2)
    # backprop sigy = 1.0 / (1 + math.exp(-y))
    dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
    # done! phew

    Cache forward pass variables

    • To compute the backward pass it is very helpful to have some of the variables that were used in the forward pass.

    Gradients add up at forks

    • The forward expression involves the variables x, y multiple times, so when we perform backpropagation we must be careful to use += instead of =
    • multivariable chain rule - if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

    Patterns in backward flow

    1. add gate : gradient distributor
    2. max gate : gradient router
    3. mul gate : gradient switcher

     


     

    7) Gradients for vectorized operations

    The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations.

    One must pay closer attention to dimensions and transpose operations.

    • Matrix-Matrix multiply gradient
    • generalizes all matrix-vector and vector-vector multiply operations:
    # forward pass
    W = np.random.randn(5, 10)
    X = np.random.randn(10, 3)
    D = W.dot(X)
    
    # now suppose we had the gradient on D from above in the circuit
    dD = np.random.randn(*D.shape) # same shape as D
    dW = dD.dot(X.T) #.T gives the transpose of the matrix
    dX = W.T.dot(dD)

    X is of size [10 x 3] and dD of size [5 x 3], so if we want dW and W has shape [5 x 10], then the only way of achieving this is with dD.dot(X.T), as shown above.

     


     

    Summary

    Developed intuition for

    • What the gradients mean
    • How they flow backwards in the circuit
    • How they communicate which part of the circuit

    Discussed the importance of staged computation for practical implementations of backpropagation

    728x90

    댓글

Designed by Tistory.