Introduction to Matrix Calculus

1. Gradient & Jacobian

$def.$ Gradient

Instead of having partials just floating around and not organized in any way, we organize them into a horizontal vector. We call this vector the gradient of $f(x, y)$ and write it as:

$$\nabla f(x, y)=\left[\frac{\partial f(x, y)}{\partial x}, \frac{\partial f(x, y)}{\partial y}\right] $$

$def.$ Jacobian

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions $f(x,y)$ and $g(x,y)$, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:

$$J=\left[\begin{array}{l}\nabla f(x, y) \\ \nabla g(x, y)\end{array}\right]=\left[\begin{array}{ll}\frac{\partial f(x, y)}{\partial x} & \frac{\partial f(x, y)}{\partial y} \\ \frac{\partial g(x, y)}{\partial x} & \frac{\partial g(x, y)}{\partial y}\end{array}\right] $$

Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout, say:

$$J=\left[\begin{array}{l}\nabla f(x, y) \\ \nabla g(x, y)\end{array}\right]^{\mathsf T} $$

2. Generalization of the Jacobian

Assume that $\mathbf{x}=\left[\begin{array}{c}x_{1}, x_{2},\ldots,x_{n}\end{array}\right]^{\mathsf T},\mathbf{y}=\left[\begin{array}{c}y_{1}, y_{2},\ldots,y_{m}\end{array}\right]^{\mathsf T}$. With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let $\mathbf y = f(\mathbf x)$ be a vector of $m$ scalar-valued functions that each take a vector $\mathbf x$ of length $n = |\mathbf x|$ where $|\mathbf x|$ is the cardinality (count) of elements in $\mathbf x$. Each $f_i$ function within $f$ returns a scalar:

$$y_i=f_i(\mathbf x)\quad(i=1,2,.\ldots,m) $$

It’s very often the case that $m = n$ because we will have a scalar function result for each element of the x vector. For example, consider the identity function $\mathbf y = f(\mathbf x) = \mathbf x$:

$$y_i=f_i(\mathbf x)=x_i $$

Generally speaking, though, the Jacobian matrix is the collection of all $m × n$ possible partial derivatives ($m$ rows and $n$ columns), which is the stack of $m$ gradients with respect to $\mathbf x$:

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{c}\nabla f_{1}(\mathbf{x}) \\ \nabla f_{2}(\mathbf{x}) \\ \ldots \\ \nabla f_{m}(\mathbf{x})\end{array}\right]=\left[\begin{array}{c}\frac{\partial}{\partial \mathbf{x}} f_{1}(\mathbf{x}) \\ \frac{\partial}{\partial \mathbf{x}} f_{2}(\mathbf{x}) \\ \ldots \\ \frac{\partial}{\partial \mathbf{x}} f_{m}(\mathbf{x})\end{array}\right]=\left[\begin{array}{cccc}\frac{\partial}{\partial x_{1}} f_{1}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{1}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{1}(\mathbf{x}) \\ \frac{\partial}{\partial x_{1}} f_{2}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{2}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{2}(\mathbf{x}) \\ & \ldots & \\ \frac{\partial}{\partial x_{1}} f_{m}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{m}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{m}(\mathbf{x})\end{array}\right] $$

Each $\frac{\partial}{\partial \mathbf x} f_i(\mathbf x)$ is a horizontal $n$-vector because the partial derivative is with respect to a vector, $\mathbf x$, whose length is $n = |\mathbf x|$. The width of the Jacobian is $n$ if we’re taking the partial derivative with respect to $\mathbf x$ because there are $n$ parameters we can wiggle, each potentially changing the function’s value. Therefore, the Jacobian is always $m$ rows for $m$ equations.

This is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.

3. Derivatives of vector element-wise binary operators

We can generalize the element-wise binary operations with notation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ where $n = |\mathbf y| = |\mathbf w| = |\mathbf x|$. The symbol $\bigcirc$ represents any element-wise operator (such as +) and not the $\circ$ function composition operator. Here’s what equation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ looks like when we zoom in to examine the scalar equations:

$$\mathbf y=\left[\begin{array}{c}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{array}\right]=\left[\begin{array}{c}f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x}) \\ f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x}) \\ \vdots \\ f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\end{array}\right] $$

Using the ideas from the last section, we can see that the general case for the Jacobian with respect to $\mathbf w$ is the square matrix:

$$J_{\mathbf{w}}=\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\left[\begin{array}{cccc}\frac{\partial}{\partial w_{1}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial w_{n}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) \\ \frac{\partial}{\partial w_{1}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) \\ & \ldots & & \\ \frac{\partial}{\partial w_{1}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right)\end{array}\right] $$

and the Jacobian with respect to $\mathbf x$ is:

$$J_{\mathbf{x}}=\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}\frac{\partial}{\partial x_{1}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) \\ \frac{\partial}{\partial x_{1}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) \\ & \cdots & & \\ \frac{\partial}{\partial x_{1}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right)\end{array}\right] $$

$def.$ Diagonal Jacobian

In a diagonal Jacobian, all elements off the diagonal are zero, $\frac{\partial}{\partial w_{j}}\left(f_{i}(\mathbf{w}) \bigcirc g_{i}(\mathbf{x})\right)=0$ where $j \not = i$. Under what conditions are those off-diagonal elements zero? Precisely when $f_i$ and $g_i$ are contants with respect to $w_j$ , $\frac{\partial}{\partial w_{j}}f_{i}(\mathbf{w}) =\frac{\partial }{\partial x_j}g_{i}(\mathbf{x}) = 0$Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, $0\bigcirc 0 = 0$ no matter what, and the partial derivative of a constant is zero.

Those partials go to zero when $f_i$ and $g_i$ are not functions of $w_j$ . We know that element-wise operations imply that $f_i$ is purely a function of $w_i$ and $g_i$ is purely a function of $x_i$ . Consequently, $f_i(\mathbf w) \bigcirc g_i(\mathbf x)$ reduces to $f_i(w_i)\bigcirc g_i(x_i)$ and the goal becomes

$$\frac{\partial}{\partial w_{j}}f_{i}(w_i) =\frac{\partial }{\partial x_j}g_{i}(x_i) = 0\quad(i\not =j) $$

$f_i(w_i)$ and $g_i(x_i)$ look like constants to the partial differentiation operator with respect to $w_j$ when $j \not = i$ so the partials are zero off the diagonal.

Notation $f_i(w_i)$ is technically an abuse of our notation because $f_i$ and $g_i$ are functions of vectors not individual elements. We should really write something like $\hat f_i(w_i) = f_i(\mathbf w)$, but that would muddy the equations further, and programmers are comfortable overloading functions, so we’ll proceed with the notation anyway.

$$\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\left[\begin{array}{cccc}\frac{\partial}{\partial w_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right) & & \\ & \frac{\partial}{\partial w_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right) & & \huge 0 \\ \huge 0 & & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\end{array}\right] $$

(The large 0s are a shorthand indicating all of the off-diagonal are 0.)

More succinctly, we can write:

$$\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\operatorname{diag}\left(\frac{\partial}{\partial w_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right), \frac{\partial}{\partial w_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right), \ldots, \frac{\partial}{\partial w_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\right) $$

and

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\operatorname{diag}\left(\frac{\partial}{\partial x_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right), \frac{\partial}{\partial x_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right), \ldots, \frac{\partial}{\partial x_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\right) $$

4. Derivatives involving scalar expansion

When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar $z$ to vector $\mathbf x$, $\mathbf y = \mathbf x + z$, is really $\mathbf y = \mathbf f(\mathbf x) + \mathbf g(z)$ where $\mathbf f(\mathbf x) = \mathbf x$ and $\mathbf g(z) = \mathbf 1z$. (The notation $\mathbf 1$ represents a vector of ones of appropriate length.) $z$ is any scalar that doesn’t depend on $\mathbf x$, which is useful because then $\frac{\partial z}{\partial z_i}=0$ for any $x_i$ and that will simplify our partial derivative computations. (It’s okay to think of variable $z$ as a constant for our discussion here.) Similarly, multiplying by a scalar, $\mathbf y = \mathbf xz$, is really $\mathbf y = \mathbf f(\mathbf x) ⊗ \mathbf g(z)$ = $\mathbf x ⊗\mathbf 1z$ where $⊗$ is the element-wise multiplication (Hadamard product) of the two vectors.

The partial derivatives of vector-scalar addition and multiplication with respect to vector $\mathbf x$ use our element-wise rule:

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\operatorname{diag}\left(\ldots \frac{\partial}{\partial x_{i}}\left(f_{i}\left(x_{i}\right) \bigcirc g_{i}(z)\right) \ldots\right) $$

5. Vector Chain Rule

We assume that you’ve got a good handle on the total-derivative chain rule, we’re ready to tackle the chain rule for vectors of functions and vector variables. We can start by computing the derivative of a sample vector function with respect to a scalar, say, $\mathbf y = \mathbf f(\mathbf z)$ and $\mathbf z=\mathbf g(x)$, then $\mathbf y=\mathbf f\circ\mathbf g(x)$:

$$\frac{\partial \mathbf f}{\partial x}:=\frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial x}=\left[ \begin{array}{cccc} \frac{\partial f_{1}}{\partial g_{1}} & \frac{\partial f_{1}}{\partial g_{2}} & \ldots & \frac{\partial f_{1}}{\partial g_{k}} \\ \frac{\partial f_{2}}{\partial g_{1}} & \frac{\partial f_{2}}{\partial g_{2}} & \ldots & \frac{\partial f_{2}}{\partial g_{k}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_{m}}{\partial g_{1}} & \frac{\partial f_{m}}{\partial g_{2}} & \ldots & \frac{\partial f_{m}}{\partial g_{k}} \end{array} \right] \begin{bmatrix} \frac{\partial g_{1}}{\partial x} \\ \frac{\partial g_{2}}{\partial x}\\ \vdots\\ \frac{\partial g_{k}}{\partial x} \end{bmatrix} $$

This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule.

The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of $f_i$ with respect to $g_j$ and $g_i$ with respect to $x_j$ . For completeness, here are the two Jacobian components in their full glory:

$$\frac{\partial}{\partial \mathbf{x}} \mathbf{f}\circ\mathbf{g}(\mathbf{x})= \left[\begin{array}{cccc}\frac{\partial f_{1}}{\partial g_{1}} & \frac{\partial f_{1}}{\partial g_{2}} & \ldots & \frac{\partial f_{1}}{\partial g_{k}} \\ \frac{\partial f_{2}}{\partial g_{1}} & \frac{\partial f_{2}}{\partial g_{2}} & \ldots & \frac{\partial f_{2}}{\partial g_{k}} \\ \vdots & \vdots &\ddots & \vdots \\ \frac{\partial f_{m}}{\partial g_{1}} & \frac{\partial f_{m}}{\partial g_{2}} & \ldots & \frac{\partial f_{m}}{\partial g_{k}}\end{array}\right] \left[\begin{array}{cccc} \frac{\partial g_{1}}{\partial x_{1}} & \frac{\partial g_{1}}{\partial x_{2}} & \ldots & \frac{\partial g_{1}}{\partial x_{n}} \\ \frac{\partial g_{2}}{\partial x_{1}} & \frac{\partial g_{2}}{\partial x_{2}} & \ldots & \frac{\partial g_{2}}{\partial x_{n}} \\ \vdots & \vdots &\ddots & \vdots \\ \frac{\partial g_{k}}{\partial x_{1}} & \frac{\partial g_{k}}{\partial x_{2}} & \cdots & \frac{\partial g_{k}}{\partial x_{n}}\end{array}\right] $$

6. The gradient of neuron activation

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, $\mathbf w$ and $b$:

$$\operatorname{activation}(\mathbf{x})=\max (0, \mathbf{w}{^\mathsf T} \mathbf{x}+b) $$

Let’s worry about max later and focus on computing $\frac{\partial }{\partial \mathbf w}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$ and $\frac{\partial }{\partial b}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$. (Recall that neural networks learn through optimization of their weights and biases.) We haven’t discussed the derivative of the dot product yet, $y = \mathbf f^{\mathsf T}(\mathbf w)· \mathbf g(\mathbf x)$, but we can use the chain rule to avoid having to memorize yet another rule. (Note notation $\mathbf y$ not $y$ as the result is a scalar not a vector.)

Actually, The dot product $\mathbf{w}{^\mathsf T}\mathbf{x}$ is just the summation of the element-wise multiplication of the elements:

$$\mathbf{w}{^\mathsf T}\mathbf{x}=\sum_{i=1}^{n}w_ix_i $$

We know how to compute the partial derivatives of $\displaystyle \sum_{i=1}^{n}x_i$ and $\mathbf w\otimes\mathbf x$ but haven’t looked at partial derivatives for $\displaystyle \sum_{i=1}^{n}(\mathbf{w}{^\mathsf T}\mathbf{x})_i$. We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:

$$\begin{array}{l} \displaystyle \mathbf{u}=\mathbf{w} \otimes \mathbf{x} \implies \frac{\partial \mathbf u}{\partial \mathbf w}=\operatorname{diag}(\mathbf x)\\ y=\displaystyle \sum_{i=1}^n(\mathbf{u}_i)\implies\frac{\partial y}{\partial \mathbf u}=\mathbf 1^{\mathsf T} \end{array} $$

The vector chain rule says to multiply the partials:

$$\frac{\partial y}{\partial \mathbf{w}}=\frac{\partial y}{\partial \mathbf{u}} \frac{\partial \mathbf{u}}{\partial \mathbf{w}}=\mathbf{1} \operatorname{diag}(\mathbf{x})=\mathbf{x}^{\mathsf T} $$

Let’s tackle the partials of the neuron activation, $\max (0, \mathbf{w}{^\mathsf T} \mathbf{x}+b)$. The use of the $\max(0, z)$ function call on scalar $z$ just says to treat all negative $z$ values as 0. The derivative of the max function is a piecewise function. When $z ≤ 0$, the derivative is $0$ because $z$ is a constant. When $z > 0$, the derivative of the max function is just the derivative of $z$, which is $1$:

$$\frac{\partial}{\partial z} \max (0, z)=\left\{\begin{array}{ll}0 & z \leq 0 \\ \frac{\text d z}{\text d z}=1 & z>0\end{array}\right. $$

let $z=(\mathbf w,\mathbf x,b)=\mathbf{w}{^\mathsf T} \mathbf{x}+b$, then

$$\frac{\partial}{\partial\mathbf w}\operatorname{activation}(\mathbf x)=\left\{ \begin{array}{ll}0 \cdot\frac{\partial z}{\partial \mathbf w} =\mathbf 0^{\mathsf T}& \mathbf{w}{^\mathsf T} \mathbf{x}+b \leq 0 \\ 1 \cdot\frac{\partial z}{\partial \mathbf w} =\mathbf x^{\mathsf T} & \mathbf{w}{^\mathsf T} \mathbf{x}+b>0 \end{array}\right. $$

Let’s use these partial derivatives now to handle the entire loss function