Scalar, Vector, and Matrix Derivatives Demystified
Contents
Scalar, Vector, and Matrix Derivatives Demystified#
If you’re working in the fields of machine learning, control theory, or any other STEM field that involves functions with multiple variables, you will almost certainly end up working with multi-variable derivatives at some point. Gradients and Jacobians can be fairly intuitive to work with once you wrap your head around them, but taking matrix-valued derivatives of matrices is a little more intimidating. What does that even look like? How do you implement that in code?
This post is a quick guide to working with multi-variable derivatives. It introduces a fundamental strategy for computing them, and then a useful heuristic for when you don’t quite know what the resulting shape should be. I should mention that the methods below are inspired by how some automatic differentiation systems such as Zygote.jl compute multi-variable derivatives.
Caveat
This article uses a lot of handwaving. There’s likely a more mathematically precise way of describing all of this, but I hope this is at least a practical starting point for understanding and implementing multi-variable derivatives in your work.
For a more mathematical treatment, this Wikipedia article is a good starting point.
The Fundamental Strategy of Multivariable Derivatives#
Say you have a function \(y = f(x)\), where \(f\) is a function. The input \(x\) and output \(y\) might be scalars. They might be vectors. They might be multi-dimensional arrays. Don’t worry about it for now.
Let’s denote the derivative of \(y\) with respect to \(x\) at the point \(z\) as
The fundamental strategy for computing multivariable derivatives is the following:
Take the partial derivative of every element of \(y\) with respect to every element of \(x\).
That’s really all there is to it.
The only thing left to worry about is what shape to put all those partial derivatives in. For example, should the output be in the shape of a matrix? A row vector? A column vector? This can admittedly get a little confusing. When in doubt however, the following strategy is a useful heuristic:
Reshape \(y\) into a \((M \times 1)\)-dimensional vector.
Reshape \(x\) into a \((N \times 1)\)-dimensional vector.
Apply the fundamental strategy for computing multivariable derivatives.
Once you’ve applied the three steps above, you’ve accomplished ~90% of the work for computing \(\frac{\partial y(z)}{\partial x}\).
What’s with the \((z)\)?
You’ll notice that I write \(\frac{\partial y(z)}{\partial x}\) instead of simply \(\frac{\partial y}{\partial x}\). This is just making it explicit that the derivative of \(y\) w.r.t. \(x\) depends in general on the specific location \(z\) you evaluate the derivative at.
Examples#
So what does this look like in practice? Let’s go through a few scenarios. Because I’m lazy, I’m going to use the abbreviation FSMD to refer to that fundamental strategy mentioned above. Just for kicks, let’s also abbreviate the 3-step reshaping heuristic as 3SRH because why not.
Scalar-Scalar Derivatives#
Suppose \(x\) and \(y\) are both scalars. To apply FSMD, we take the derivative of the only entry of \(y\) with respect to the only element of \(x\). That was easy.
Scalar-Vector Derivatives (a.k.a. Gradient)#
Suppose \(y\) is a scalar and \(x = \begin{bmatrix}x_1 & x_2 & \ldots & x_N \end{bmatrix}^\intercal\). To apply FSMD, we take the partial derivative of the only entry of \(y\) with respect to each entry \(x_i\) of \(x\), and slap all of these partial derivatives together in a vector:
Done.
Scalar-Matrix Derivatives#
This one might seem a little scary for the uninitiated. Taking the derivative of something with respect to a matrix? Are you crazy??
No, it’s not that bad. Keep calm, and carry on with the FSMD.
Suppose that \(x\) is a \((N_1 \times N_2)\) matrix. Let’s denote the \((i,j)\)th entry of \(x\) as \(x_{i,j}\). Take the partial derivative of the only entry of \(y\) with respect to each and every \(x_{i,j}\) variable. And voila–it’s over.
There does happen to be a clean convention for how to shape the final result into a matrix:
But notice that we very well could have used the 3SRH to compute this as well. Here’s how:
Reshape \(y\) into a \((1 \times 1)\)-dimensional vector. Since \(y\) is a scalar, we’re already done.
Reshape \(x\) into a \((N_1 N_2 \times 1)\)-dimensional vector. We’ll call this reshaped vector \(\hat{x}\):
Compute the partial derivative of the only component of \(y\) with respect to every component of \(\hat{x}\) as follows:
Notice that we can reshape the vector in (2) into the matrix in (1).
Vector-Scalar Derivatives#
As you might imagine, this case where \(y = \begin{bmatrix}y_1 & y_2 & \ldots & y_M \end{bmatrix}^\intercal\) is a vector and \(x\) is a scalar is very similar to where \(y\) is a scalar and \(x\) is a vector above. Applying FSMD, we take the partial derivative of every component \(y_i\) of \(y\) with every (i.e. the only) component of \(x\):
Vector-Vector Derivatives (a.k.a. Jacobian)#
When \(y = \begin{bmatrix}y_1 & \ldots & y_M \end{bmatrix}^\intercal\) is a vector of length \(M\) and \(x = \begin{bmatrix} x_1 & \ldots & x_N \end{bmatrix}^\intercal\), the result \(\frac{\partial y(z)}{\partial x}\) is commmonly called the Jacobian matrix of \(y\).
Name aside, the FSMD applies exactly the same as in previous cases: take the partial derivative of every entry \(y_i\) of \(y\) with respect to every entry \(x_j\) of \(x\):
Note that in this case, the structure of the Jacobian matches what we would have obtained using the 3SRH since \(y\) is already a \((M \times 1)\) vector and \(x\) is already a \((N \times 1)\) vector.
Vector-Matrix Derivatives#
Suppose \(y = \begin{bmatrix} y_1 & y_2 & \ldots, y_M \end{bmatrix}\) is a \((M \times 1)\) vector, and \(x\) is a \((N_1 \times N_2)\) matrix:
This is a case where many people might start scratching their heads. What does it look like to take the derivative of a vector with respect to a matrix? What kind of shape should the result have?
Remember, don’t panic–simply use the FSMD. This is where the 3SRH will come in handy.
First, we reshape \(y\) if needed. Since \(y\) is already in the shape \((M \times 1)\), we’re done.
Second, we reshape \(x\) into a \((N_1N_2 \times 1)\) vector. We’ll denote this vector as \(\hat{x}\):
Finally, just like before, we take the partial derivative of every entry \(y_i\) of \(y\) with respect to every entry \(\hat{x}_j\) of \(\hat{x}\):
If you really want to reshape this result into a multi-dimensional array, this should be possible using some principled manner. But we won’t go into that here.
Matrix-Scalar Derivatives, Matrix-Vector Derivatives, Matrix-Matrix Derivatives…#
By now I hope the pattern should be clear! Use the fundamental strategy of multi-variable derivatives. When necessary, apply the 3-step reshaping heuristic if you can’t immediately figure out what shape the final result should have.
Go forth and conquer.