Introduction to Matrix Calculus

Chentian Wu

2023-08-01

CS › Machine Learning

1. Gradient & Jacobian

Definition: Gradient

Instead of having partials just floating around and not organized in any way, we organize them into a horizontal vector. We call this vector the gradient of $$f(x, y)$$ and write it as:

\nabla f(x, y)=\left[\frac{\partial f(x, y)}{\partial x}, \frac{\partial f(x, y)}{\partial y}\right]

$$def.$$ Jacobian

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions $$f(x,y)$$ and $$g(x,y)$$ , we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:

J=\left[\begin{array}{l}\nabla f(x, y) \\ \nabla g(x, y)\end{array}\right]=\left[\begin{array}{ll}\frac{\partial f(x, y)}{\partial x} & \frac{\partial f(x, y)}{\partial y} \\ \frac{\partial g(x, y)}{\partial x} & \frac{\partial g(x, y)}{\partial y}\end{array}\right]

Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout, say:
$J=\left[\begin{array}{l}\nabla f(x, y) \\ \nabla g(x, y)\end{array}\right]^{\mathsf T}$

2. Generalization of the Jacobian

Assume that $\mathbf{x}=\left[\begin{array}{c}x_{1}, x_{2},\ldots,x_{n}\end{array}\right]^{\mathsf T},\mathbf{y}=\left[\begin{array}{c}y_{1}, y_{2},\ldots,y_{m}\end{array}\right]^{\mathsf T}$ . With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let $\mathbf y = f(\mathbf x)$ be a vector of $$m$$ scalar-valued functions that each take a vector $\mathbf x$ of length $n = |\mathbf x|$ where $|\mathbf x|$ is the cardinality (count) of elements in $\mathbf x$ . Each $$f_i$$ function within $$f$$ returns a scalar:

y_i=f_i(\mathbf x)\quad(i=1,2,.\ldots,m)

It’s very often the case that $$m = n$$ because we will have a scalar function result for each element of the x vector. For example, consider the identity function $\mathbf y = f(\mathbf x) = \mathbf x$ :

y_i=f_i(\mathbf x)=x_i

Generally speaking, though, the Jacobian matrix is the collection of all $$m \times n$$ possible partial derivatives ( $$m$$ rows and $$n$$ columns), which is the stack of $$m$$ gradients with respect to $\mathbf x$ :

\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{c}\nabla f_{1}(\mathbf{x}) \\ \nabla f_{2}(\mathbf{x}) \\ \ldots \\ \nabla f_{m}(\mathbf{x})\end{array}\right]=\left[\begin{array}{c}\frac{\partial}{\partial \mathbf{x}} f_{1}(\mathbf{x}) \\ \frac{\partial}{\partial \mathbf{x}} f_{2}(\mathbf{x}) \\ \ldots \\ \frac{\partial}{\partial \mathbf{x}} f_{m}(\mathbf{x})\end{array}\right]=\left[\begin{array}{cccc}\frac{\partial}{\partial x_{1}} f_{1}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{1}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{1}(\mathbf{x}) \\ \frac{\partial}{\partial x_{1}} f_{2}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{2}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{2}(\mathbf{x}) \\ & \ldots & \\ \frac{\partial}{\partial x_{1}} f_{m}(\mathbf{x}) & \frac{\partial}{\partial x_{2}} f_{m}(\mathbf{x}) & \ldots & \frac{\partial}{\partial x_{n}} f_{m}(\mathbf{x})\end{array}\right]

Each $\frac{\partial}{\partial \mathbf x} f_i(\mathbf x)$ is a horizontal $$n$$ -vector because the partial derivative is with respect to a vector, $\mathbf x$ , whose length is $n = |\mathbf x|$ . The width of the Jacobian is $$n$$ if we’re taking the partial derivative with respect to $\mathbf x$ because there are $$n$$ parameters we can wiggle, each potentially changing the function’s value. Therefore, the Jacobian is always $$m$$ rows for $$m$$ equations.

This is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.

3. Derivatives of vector element-wise binary operators

We can generalize the element-wise binary operations with notation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ where $n = |\mathbf y| = |\mathbf w| = |\mathbf x|$ . The symbol $\bigcirc$ represents any element-wise operator (such as +) and not the $\circ$ function composition operator. Here’s what equation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ looks like when we zoom in to examine the scalar equations:

\mathbf y=\left[\begin{array}{c}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{array}\right]=\left[\begin{array}{c}f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x}) \\ f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x}) \\ \vdots \\ f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\end{array}\right]

Using the ideas from the last section, we can see that the general case for the Jacobian with respect to $\mathbf w$ is the square matrix:

J_{\mathbf{w}}=\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\left[\begin{array}{cccc}\frac{\partial}{\partial w_{1}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial w_{n}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) \\ \frac{\partial}{\partial w_{1}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) \\ & \ldots & & \\ \frac{\partial}{\partial w_{1}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \frac{\partial}{\partial w_{2}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right)\end{array}\right]

and the Jacobian with respect to $\mathbf x$ is:

J_{\mathbf{x}}=\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}\frac{\partial}{\partial x_{1}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{1}(\mathbf{w}) \bigcirc g_{1}(\mathbf{x})\right) \\ \frac{\partial}{\partial x_{1}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{2}(\mathbf{w}) \bigcirc g_{2}(\mathbf{x})\right) \\ & \cdots & & \\ \frac{\partial}{\partial x_{1}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \frac{\partial}{\partial x_{2}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right) & \cdots & \frac{\partial}{\partial x_{n}}\left(f_{n}(\mathbf{w}) \bigcirc g_{n}(\mathbf{x})\right)\end{array}\right]

$$def.$$ Diagonal Jacobian

In a diagonal Jacobian, all elements off the diagonal are zero, $\frac{\partial}{\partial w_{j}}\left(f_{i}(\mathbf{w}) \bigcirc g_{i}(\mathbf{x})\right)=0$ where $j \not = i$ . Under what conditions are those off-diagonal elements zero? Precisely when $$f_i$$ and $$g_i$$ are contants with respect to $$w_j$$ , $\frac{\partial}{\partial w_{j}}f_{i}(\mathbf{w}) =\frac{\partial }{\partial x_j}g_{i}(\mathbf{x}) = 0$ Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, $0\bigcirc 0 = 0$ no matter what, and the partial derivative of a constant is zero.

Those partials go to zero when $$f_i$$ and $$g_i$$ are not functions of $$w_j$$ . We know that element-wise operations imply that $$f_i$$ is purely a function of $$w_i$$ and $$g_i$$ is purely a function of $$x_i$$ . Consequently, $f_i(\mathbf w) \bigcirc g_i(\mathbf x)$ reduces to $f_i(w_i)\bigcirc g_i(x_i)$ and the goal becomes

\frac{\partial}{\partial w_{j}}f_{i}(w_i) =\frac{\partial }{\partial x_j}g_{i}(x_i) = 0\quad(i\not =j)

$$f_i(w_i)$$ and $$g_i(x_i)$$ look like constants to the partial differentiation operator with respect to $$w_j$$ when $j \not = i$ so the partials are zero off the diagonal.

Notation $$f_i(w_i)$$ is technically an abuse of our notation because $$f_i$$ and $$g_i$$ are functions of vectors not individual elements. We should really write something like $\hat f_i(w_i) = f_i(\mathbf w)$ , but that would muddy the equations further, and programmers are comfortable overloading functions, so we’ll proceed with the notation anyway.

\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\left[\begin{array}{cccc}\frac{\partial}{\partial w_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right) & & \\ & \frac{\partial}{\partial w_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right) & & \huge 0 \\ \huge 0 & & \ldots & \frac{\partial}{\partial w_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\end{array}\right]

(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)

More succinctly, we can write:

\frac{\partial \mathbf{y}}{\partial \mathbf{w}}=\operatorname{diag}\left(\frac{\partial}{\partial w_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right), \frac{\partial}{\partial w_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right), \ldots, \frac{\partial}{\partial w_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\right)

and

\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\operatorname{diag}\left(\frac{\partial}{\partial x_{1}}\left(f_{1}\left(w_{1}\right) \bigcirc g_{1}\left(x_{1}\right)\right), \frac{\partial}{\partial x_{2}}\left(f_{2}\left(w_{2}\right) \bigcirc g_{2}\left(x_{2}\right)\right), \ldots, \frac{\partial}{\partial x_{n}}\left(f_{n}\left(w_{n}\right) \bigcirc g_{n}\left(x_{n}\right)\right)\right)

4. Derivatives involving scalar expansion

When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar $$z$$ to vector $\mathbf x$ , $\mathbf y = \mathbf x + z$ , is really $\mathbf y = \mathbf f(\mathbf x) + \mathbf g(z)$ where $\mathbf f(\mathbf x) = \mathbf x$ and $\mathbf g(z) = \mathbf 1z$ . (The notation $\mathbf 1$ represents a vector of ones of appropriate length.) $$z$$ is any scalar that doesn’t depend on $\mathbf x$ , which is useful because then $\frac{\partial z}{\partial z_i}=0$ for any $$x_i$$ and that will simplify our partial derivative computations. (It’s okay to think of variable $$z$$ as a constant for our discussion here.) Similarly, multiplying by a scalar, $\mathbf y = \mathbf xz$ , is really $\mathbf y = \mathbf f(\mathbf x) ⊗ \mathbf g(z)$ = $\mathbf x ⊗\mathbf 1z$ where $$\otimes$$ is the element-wise multiplication (Hadamard product) of the two vectors.

The partial derivatives of vector-scalar addition and multiplication with respect to vector $\mathbf x$ use our element-wise rule:

\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\operatorname{diag}\left(\ldots \frac{\partial}{\partial x_{i}}\left(f_{i}\left(x_{i}\right) \bigcirc g_{i}(z)\right) \ldots\right)

5. Vector Chain Rule

We assume that you’ve got a good handle on the total-derivative chain rule, we’re ready to tackle the chain rule for vectors of functions and vector variables. We can start by computing the derivative of a sample vector function with respect to a scalar, say, $\mathbf y = \mathbf f(\mathbf z)$ and $\mathbf z=\mathbf g(x)$ , then $\mathbf y=\mathbf f\circ\mathbf g(x)$ :

\frac{\partial \mathbf f}{\partial x}:=\frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial x}=\left[ \begin{array}{cccc} \frac{\partial f_{1}}{\partial g_{1}} & \frac{\partial f_{1}}{\partial g_{2}} & \ldots & \frac{\partial f_{1}}{\partial g_{k}} \\ \frac{\partial f_{2}}{\partial g_{1}} & \frac{\partial f_{2}}{\partial g_{2}} & \ldots & \frac{\partial f_{2}}{\partial g_{k}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_{m}}{\partial g_{1}} & \frac{\partial f_{m}}{\partial g_{2}} & \ldots & \frac{\partial f_{m}}{\partial g_{k}} \end{array} \right] \begin{bmatrix} \frac{\partial g_{1}}{\partial x} \\ \frac{\partial g_{2}}{\partial x}\\ \vdots\\ \frac{\partial g_{k}}{\partial x} \end{bmatrix}

This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule.

The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of $$f_i$$ with respect to $$g_j$$ and $$g_i$$ with respect to $$x_j$$ . For completeness, here are the two Jacobian components in their full glory:

\frac{\partial}{\partial \mathbf{x}} \mathbf{f}\circ\mathbf{g}(\mathbf{x})= \left[\begin{array}{cccc}\frac{\partial f_{1}}{\partial g_{1}} & \frac{\partial f_{1}}{\partial g_{2}} & \ldots & \frac{\partial f_{1}}{\partial g_{k}} \\ \frac{\partial f_{2}}{\partial g_{1}} & \frac{\partial f_{2}}{\partial g_{2}} & \ldots & \frac{\partial f_{2}}{\partial g_{k}} \\ \vdots & \vdots &\ddots & \vdots \\ \frac{\partial f_{m}}{\partial g_{1}} & \frac{\partial f_{m}}{\partial g_{2}} & \ldots & \frac{\partial f_{m}}{\partial g_{k}}\end{array}\right] \left[\begin{array}{cccc} \frac{\partial g_{1}}{\partial x_{1}} & \frac{\partial g_{1}}{\partial x_{2}} & \ldots & \frac{\partial g_{1}}{\partial x_{n}} \\ \frac{\partial g_{2}}{\partial x_{1}} & \frac{\partial g_{2}}{\partial x_{2}} & \ldots & \frac{\partial g_{2}}{\partial x_{n}} \\ \vdots & \vdots &\ddots & \vdots \\ \frac{\partial g_{k}}{\partial x_{1}} & \frac{\partial g_{k}}{\partial x_{2}} & \cdots & \frac{\partial g_{k}}{\partial x_{n}}\end{array}\right]

6. The gradient of neuron activation

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, $\mathbf w$ and $$b$$ :

\operatorname{activation}(\mathbf{x})=\max (0, \mathbf{w}{^\mathsf T} \mathbf{x}+b)

Let’s worry about max later and focus on computing $\frac{\partial }{\partial \mathbf w}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$ and $\frac{\partial }{\partial b}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$ . (Recall that neural networks learn through optimization of their weights and biases.) We haven’t discussed the derivative of the dot product yet, $y = \mathbf f^{\mathsf T}(\mathbf w)· \mathbf g(\mathbf x)$ , but we can use the chain rule to avoid having to memorize yet another rule. (Note notation $\mathbf y$ not $$y$$ as the result is a scalar not a vector.)

Actually, The dot product $\mathbf{w}{^\mathsf T}\mathbf{x}$ is just the summation of the element-wise multiplication of the elements:

\mathbf{w}{^\mathsf T}\mathbf{x}=\sum_{i=1}^{n}w_ix_i

We know how to compute the partial derivatives of $\displaystyle \sum_{i=1}^{n}x_i$ and $\mathbf w\otimes\mathbf x$ but haven’t looked at partial derivatives for $\displaystyle \sum_{i=1}^{n}(\mathbf{w}{^\mathsf T}\mathbf{x})_i$ . We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:

\begin{array}{l} \displaystyle \mathbf{u}=\mathbf{w} \otimes \mathbf{x} \implies \frac{\partial \mathbf u}{\partial \mathbf w}=\operatorname{diag}(\mathbf x)\\ y=\displaystyle \sum_{i=1}^n(\mathbf{u}_i)\implies\frac{\partial y}{\partial \mathbf u}=\mathbf 1^{\mathsf T} \end{array}

The vector chain rule says to multiply the partials:

\frac{\partial y}{\partial \mathbf{w}}=\frac{\partial y}{\partial \mathbf{u}} \frac{\partial \mathbf{u}}{\partial \mathbf{w}}=\mathbf{1} \operatorname{diag}(\mathbf{x})=\mathbf{x}^{\mathsf T}

Let’s tackle the partials of the neuron activation, $\max (0, \mathbf{w}{^\mathsf T} \mathbf{x}+b)$ . The use of the $\max(0, z)$ function call on scalar $$z$$ just says to treat all negative $$z$$ values as 0. The derivative of the max function is a piecewise function. When $$z \leq 0$$ , the derivative is $$0$$ because $$z$$ is a constant. When $$z > 0$$ , the derivative of the max function is just the derivative of $$z$$ , which is $$1$$ :

\frac{\partial}{\partial z} \max (0, z)=\left\{\begin{array}{ll}0 & z \leq 0 \\ \frac{\text d z}{\text d z}=1 & z>0\end{array}\right.

let $z=(\mathbf w,\mathbf x,b)=\mathbf{w}{^\mathsf T} \mathbf{x}+b$ , then

\frac{\partial}{\partial\mathbf w}\operatorname{activation}(\mathbf x)=\left\{ \begin{array}{ll}0 \cdot\frac{\partial z}{\partial \mathbf w} =\mathbf 0^{\mathsf T}& \mathbf{w}{^\mathsf T} \mathbf{x}+b \leq 0 \\ 1 \cdot\frac{\partial z}{\partial \mathbf w} =\mathbf x^{\mathsf T} & \mathbf{w}{^\mathsf T} \mathbf{x}+b>0 \end{array}\right.

Let’s use these partial derivatives now to handle the entire loss function