1. Gradient & Jacobian
Definition: Gradient
Instead of having partials just floating around and not organized in any way, we organize them into a horizontal vector. We call this vector the gradient of $f(x, y)$ and write it as:
$def.$ Jacobian
Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions $f(x,y)$ and $g(x,y)$, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout, say:
$$J=\left[\begin{array}{l}\nabla f(x, y) \\ \nabla g(x, y)\end{array}\right]^{\mathsf T} $$
2. Generalization of the Jacobian
Assume that $\mathbf{x}=\left[\begin{array}{c}x_{1}, x_{2},\ldots,x_{n}\end{array}\right]^{\mathsf T},\mathbf{y}=\left[\begin{array}{c}y_{1}, y_{2},\ldots,y_{m}\end{array}\right]^{\mathsf T}$. With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let $\mathbf y = f(\mathbf x)$ be a vector of $m$ scalar-valued functions that each take a vector $\mathbf x$ of length $n = |\mathbf x|$ where $|\mathbf x|$ is the cardinality (count) of elements in $\mathbf x$. Each $f_i$ function within $f$ returns a scalar:
It’s very often the case that $m = n$ because we will have a scalar function result for each element of the x vector. For example, consider the identity function $\mathbf y = f(\mathbf x) = \mathbf x$:
Generally speaking, though, the Jacobian matrix is the collection of all $m × n$ possible partial derivatives ($m$ rows and $n$ columns), which is the stack of $m$ gradients with respect to $\mathbf x$:
Each $\frac{\partial}{\partial \mathbf x} f_i(\mathbf x)$ is a horizontal $n$-vector because the partial derivative is with respect to a vector, $\mathbf x$, whose length is $n = |\mathbf x|$. The width of the Jacobian is $n$ if we’re taking the partial derivative with respect to $\mathbf x$ because there are $n$ parameters we can wiggle, each potentially changing the function’s value. Therefore, the Jacobian is always $m$ rows for $m$ equations.
This is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.
3. Derivatives of vector element-wise binary operators
We can generalize the element-wise binary operations with notation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ where $n = |\mathbf y| = |\mathbf w| = |\mathbf x|$. The symbol $\bigcirc$ represents any element-wise operator (such as +) and not the $\circ$ function composition operator. Here’s what equation $\mathbf y = f(\mathbf w) \bigcirc g(\mathbf x)$ looks like when we zoom in to examine the scalar equations:
Using the ideas from the last section, we can see that the general case for the Jacobian with respect to $\mathbf w$ is the square matrix:
and the Jacobian with respect to $\mathbf x$ is:
$def.$ Diagonal Jacobian
In a diagonal Jacobian, all elements off the diagonal are zero, $\frac{\partial}{\partial w_{j}}\left(f_{i}(\mathbf{w}) \bigcirc g_{i}(\mathbf{x})\right)=0$ where $j \not = i$. Under what conditions are those off-diagonal elements zero? Precisely when $f_i$ and $g_i$ are contants with respect to $w_j$ , $\frac{\partial}{\partial w_{j}}f_{i}(\mathbf{w}) =\frac{\partial }{\partial x_j}g_{i}(\mathbf{x}) = 0$Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, $0\bigcirc 0 = 0$ no matter what, and the partial derivative of a constant is zero.
Those partials go to zero when $f_i$ and $g_i$ are not functions of $w_j$ . We know that element-wise operations imply that $f_i$ is purely a function of $w_i$ and $g_i$ is purely a function of $x_i$ . Consequently, $f_i(\mathbf w) \bigcirc g_i(\mathbf x)$ reduces to $f_i(w_i)\bigcirc g_i(x_i)$ and the goal becomes
$f_i(w_i)$ and $g_i(x_i)$ look like constants to the partial differentiation operator with respect to $w_j$ when $j \not = i$ so the partials are zero off the diagonal.
Notation $f_i(w_i)$ is technically an abuse of our notation because $f_i$ and $g_i$ are functions of vectors not individual elements. We should really write something like $\hat f_i(w_i) = f_i(\mathbf w)$, but that would muddy the equations further, and programmers are comfortable overloading functions, so we’ll proceed with the notation anyway.
(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)
More succinctly, we can write:
and
4. Derivatives involving scalar expansion
When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar $z$ to vector $\mathbf x$, $\mathbf y = \mathbf x + z$, is really $\mathbf y = \mathbf f(\mathbf x) + \mathbf g(z)$ where $\mathbf f(\mathbf x) = \mathbf x$ and $\mathbf g(z) = \mathbf 1z$. (The notation $\mathbf 1$ represents a vector of ones of appropriate length.) $z$ is any scalar that doesn’t depend on $\mathbf x$, which is useful because then $\frac{\partial z}{\partial z_i}=0$ for any $x_i$ and that will simplify our partial derivative computations. (It’s okay to think of variable $z$ as a constant for our discussion here.) Similarly, multiplying by a scalar, $\mathbf y = \mathbf xz$, is really $\mathbf y = \mathbf f(\mathbf x) ⊗ \mathbf g(z)$ = $\mathbf x ⊗\mathbf 1z$ where $⊗$ is the element-wise multiplication (Hadamard product) of the two vectors.
The partial derivatives of vector-scalar addition and multiplication with respect to vector $\mathbf x$ use our element-wise rule:
5. Vector Chain Rule
We assume that you’ve got a good handle on the total-derivative chain rule, we’re ready to tackle the chain rule for vectors of functions and vector variables. We can start by computing the derivative of a sample vector function with respect to a scalar, say, $\mathbf y = \mathbf f(\mathbf z)$ and $\mathbf z=\mathbf g(x)$, then $\mathbf y=\mathbf f\circ\mathbf g(x)$:
This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule.
The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of $f_i$ with respect to $g_j$ and $g_i$ with respect to $x_j$ . For completeness, here are the two Jacobian components in their full glory:
6. The gradient of neuron activation
We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, $\mathbf w$ and $b$:
Let’s worry about max later and focus on computing $\frac{\partial }{\partial \mathbf w}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$ and $\frac{\partial }{\partial b}(\mathbf{w}{^\mathsf T} \mathbf{x}+b)$. (Recall that neural networks learn through optimization of their weights and biases.) We haven’t discussed the derivative of the dot product yet, $y = \mathbf f^{\mathsf T}(\mathbf w)· \mathbf g(\mathbf x)$, but we can use the chain rule to avoid having to memorize yet another rule. (Note notation $\mathbf y$ not $y$ as the result is a scalar not a vector.)
Actually, The dot product $\mathbf{w}{^\mathsf T}\mathbf{x}$ is just the summation of the element-wise multiplication of the elements:
We know how to compute the partial derivatives of $\displaystyle \sum_{i=1}^{n}x_i$ and $\mathbf w\otimes\mathbf x$ but haven’t looked at partial derivatives for $\displaystyle \sum_{i=1}^{n}(\mathbf{w}{^\mathsf T}\mathbf{x})_i$. We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:
The vector chain rule says to multiply the partials:
Let’s tackle the partials of the neuron activation, $\max (0, \mathbf{w}{^\mathsf T} \mathbf{x}+b)$. The use of the $\max(0, z)$ function call on scalar $z$ just says to treat all negative $z$ values as 0. The derivative of the max function is a piecewise function. When $z ≤ 0$, the derivative is $0$ because $z$ is a constant. When $z > 0$, the derivative of the max function is just the derivative of $z$, which is $1$:
let $z=(\mathbf w,\mathbf x,b)=\mathbf{w}{^\mathsf T} \mathbf{x}+b$, then
Let’s use these partial derivatives now to handle the entire loss function