How Much “Calculus” is Enough for Deep Learning?

Ramil Chaimongkolbutr
6 min readJul 5, 2021

This article will attempt to explain matrix calculus, a little bit of linear algebra and a little bit of multivariate calculus, to non-technical audiences.

Picture from TechTalks.

You might have heard it so many times already and It is true — you do not have to be an expert in math to excel at training deep neural networks — but a bit of calculus does not hurt! It will help you get a firm grip of what happens under the hood of the algorithm. Terence Parr and Jeremy Howard of University of San Francisco explain all the matrix calculus needed to understand the training of deep neural networks without any math knowledge beyond calculus 1 (of course, there is a certain basic understanding of calculus required). This article will summarize and highlight key points from their paper, “The Matrix Calculus You Need For Deep Learning,” in words which non-technical can understand.

Some basic rules of scalar derivative

Introduction

What is Deep Learning?

Deep Learning (DL) is the algorithm that tries to imitate our brain when we try to process information. DL attempts to create a patterns, so that it can make a decision.

Why matrix calculus?

This is because majority of your data will come in form of multidimensional matrices. A lot of work you will be doing with DL will involve understanding of matrix operations. The matrix representation allows the math to be simplified greatly. Understanding matrix calculus is crucial if you would want to be a serious DL practitioner or developer from scratch.

Vector Calculus and Partial Derivatives

Neural network layers usually do not come as a simple function such as f(x), instead they are often in term of multiple parameters such as f(x, y). Therefore, the derivatives of the layers have to be partial meaning that we treat other variables as constant and described as a vector. We called this vector, the gradient of f(x, y).

For example. the gradient of f(x, y)= 3x²y is:

The gradient of f(x, y) as a horizontal vector

Matrix Calculus

When we have more than one function, each vector will combine as a matrix.

For example, let’s bring another gradient, g(x, y) = 2x + y². The gradient of g(x, y) is:

The gradient of g(x, y) as a horizontal vector

We stack a gradient on top of each other forming a matrix called Jacobian Matrix, (AKA numerator layout, or denominator layout).

Jacobian Matrix of 2 gradients f and g

Now we define Jacobian Matrix more generally. Instead of express a function as x and y, we substitute x1 for x, x2 for y. Therefore, we get y = f(x), where x is a vector. We will get the general form of Jacobian Matrix:

Generalization of the Jacobian

We can generalize further by using “element-wise binary operations” as a notation y = f(w) O g(x), where the O symbol represents any element-wise operator (such as +). Examples that often crop up in deep learning are max(w, x) and w > x (returns a vector of ones and zeros).

Jacobian with respect to w and x

Vector Sum Reduction

Network loss function in DL often utilize summing up the elements of a vector as an important operation. The gradient (1 × n Jacobian) of vector summation is:

Vector summation

The Chain Rules

Basic matrix calculus rules sometimes is not enough to compute partial derivatives of very complicated functions. For example, we can’t take the derivative of nested expressions like sum(w + x) directly without reducing it to its scalar equivalent.

Chain Rules
Picture from Researchgate.net

Why the chain rule is so important to DL? It is because the way we optimize our neural network requires the use of gradient decent to minimize our sum squared residual (SSR). To find optimal weights and biases, gradient decent finds a derivative of SSR on each weight and bias in each path, re-calculates each weight and bias, and repeats until it gets the minimum point of SSR or maximum iterations have been reached. The chain rule is used to find a derivative of SSR on each weight and bias since the SSR is a function of them.

Picture from Andrew Ng’s deep learning course

Single-variable total-derivative chain rule

The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.

Single-variable total-derivative chain rule

Vector chain rule

Vector chain rule for vectors of functions and a single parameter mirrors the single-variable chain rule.

Let y= f(g(x)) and x is a vector . The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule.

The goal is to convert the above vector of scalar operations to a vector operation. So the above right-hand side matrix can also be implemented as a product of vector multiplication.

That means that the Jacobian is the multiplication of two other Jacobians. To make this formula work for multiple parameters or vector x , we just have to change x to vector x in the equation. The effect is that ∂g/ ∂x and the resulting Jacobian, *∂f/ ∂x * , are now matrices instead of vertical vectors. Our complete vector chain rule is:

Vector chain rule

The complete Jacobian components can be expressed as:

Jacobian components

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, w and b:

Conclusion

I hope you will get some ideas of matrix calculus. How it can be used to calculate the derivative of a typical neuron activation will be discussed in the next article. For more information or discussion in more detail, please visit the link of the actual article.

Reference

Terrence Parr and Jeremy Howard, “The Matrix Calculus You Need For Deep Learning”, Cornell University, revised 2 Jul 2018.

--

--