Well, understanding is one thing and getting your task done is another; they’re related, but you needn’t necessarily fully have the former to do the latter. Still, I’ll concentrate on the former anyway:

I think we teach multivariable calculus in a somewhat needlessly confusing way, largely out of historical reasons more than anything else. Let me try to give you some better intuition so that you can understand how derivatives, the gradient, and the Jacobian are all fundamentally the same concept (just called different things depending on the number of dimensions).

In the following, I’ll write “x |-> x[sup]2[/sup]” to mean the function which sends an input x to the output x[sup]2[/sup], “e |-> 8e” to mean the function which multiplies its input by 8, and so on.

A function’s derivative at a point is the local linear approximation to the function at that point. What do I mean by this?

Well, for starters, let’s introduce the idea of a linear function. We call a function G linear if G(a + b + c + …) = G(a) + G(b) + G© + … . For example, e |-> 8e and e |-> -e are linear, while e |-> e[sup]3[/sup] and e |-> sin(e) are not. Functions which distribute across addition like this are of course very convenient; hence our interest in them.

Now, if F is some function, and p is some point, then the derivative of F at p is itself a function. I know, you don’t normally think of it that way. But you should! The derivative of x |-> x[sup]2[/sup] at 4 isn’t really 8; it’s the function e |-> 8e (i.e., “multiply by 8”).

Finally, the purpose of the derivative of a function at a point is to describe, to the best linear approximation, how small changes in the function’s input around that point cause changes in its output. That is, if G is the derivative of F at p, then F(p + e) should be approximately equal to F§ + G(e) for small e. For example, when we say that e |-> 8e is the derivative of x |-> x[sup]2[/sup] at the point 4, what we mean is that (4 + e)[sup]2[/sup] is approximately equal to 4[sup]2[/sup] + 8e, for small e (this being the best approximation we can give using a linear function).

Alright; what of all this? Well, the point is, nothing I said above assumes F takes in one-dimensional input or produces one-dimensional output. That’s just one common case. But we can look at functions from any number of dimensions to any number of dimensions as well.

Now, if F is a function from n-dimensional space to m-dimensional space, so are its derivatives at any point. And, as you probably are aware, a linear function from n-dimensional space to m-dimensional space can be represented by an m-by-n matrix of numbers.

In particular, if F is a function from 1-dimensional space to 1-dimensional space, then its derivative at any point can be represented by a 1-by-1 matrix of numbers; i.e., by a single number. This, as we saw above, is what’s going on when we say the derivative of x |-> x[sup]2[/sup] at 4 is 8; the 8 is shorthand for the function e |-> 8e which it represents.

If F is a function from n-dimensional space to 1-dimensional space, then its derivative at any point can be represented by a 1-by-n matrix of numbers; i.e., by an n-dimensional vector. This is what’s going on when we say the “gradient” of <x, y> |-> x[sup]2[/sup] + y[sup]2[/sup] at <4, 3> is <8, 6>; the <8, 6> is shorthand for the function <e, f> |-> <8e, 6f> which it represents, and “gradient” is just a fancy word for “derivative of a function from n-dimensional space to 1-dimensional space”. (It happens to be the case that the vector representing the derivative in this case always points in the direction of greatest rate of increase and has magnitude equal to the rate of increase in that direction, but don’t think of that as the fundamental definition of the gradient; the gradient is just the derivative by another name)

Finally, if F is a function from n-dimensional space to m-dimensional space, then its derivative at any point can be represented by a full m-by-n matrix of numbers, and in this case the fancy word we use for it is “Jacobian”.

(And “partial derivatives”? That’s just when you speak of particular entries in this matrix, rather than the entire matrix)