In machine learning (which I am trying to learn about, rather unsuccessfully), the ‘cost function’ is an important tool. In the examples I’ve seen, the cost function looks to be nothing more than the variance between one layer of the neural network in question and its desired or ideal output.
Can neural networks have cost functions that are based on measures other than variance? I assume the answer is yes and their cost functions can even be non-parametrically based. Otherwise their capacity to learn would be fatally-flawed, no?
In addition to your always appreciated insights, I’d be really interested in any links or resources on this topic that you’d recommend (at a basic level ;)).
Thanks,
(rather than embedding links for the key terms in my post above, anyone interested should really consider checking out the videos on the subject by 3Blue1Brown, e.g. What is a neural network? and How neural networks learn. As with everything on that channel, they are phenomenal.)
Still, I am unclear, though, since in the link, the Loss Function ultimately depends on the variance, i.e. Σ|x[sub]i[/sub][sup]2[/sup]|.
Could a loss function be defined to depend on something other than the sum of the squares of the distances from the best answer? Why not the 4th power, or the 34th, or a polynomial of some sort, or any other metric?
I think I am missing a key point - must the distances (as usually defined) be incorporated?
Yes, it absolutely can, but often loss functions are based on variance because it behaves nicely (smoothly) when you differentiate your equations. And we already know lots of properties of variances that we don’t necessarily know about other moments of the data (without a bunch of calculation)
Another common one is mean absolute error : sum(|x-mean|) (those are absolute value bars)
Oh, and to ‘must the distances be incorporated’ - yes, because that goes to the heart of what the loss function is doing.
Basically, when training, you take all your inputs, make a prediction, then look at the difference between your prediction and reality. The difference between them is your ‘loss’ - for that data point, and for however you define ‘distance’ (different loss functions are different ways of classifying ‘distance from correctness’). The total loss is the sum of all the individual losses.
Yep. Sometimes the variance is multiplied by a sigmoidal or inverse tangent function, too, which works because both of those are well behaved functions easy to differentiate.
That makes perfect sense. But I guess it comes at the potential cost of missing a ‘better’ measure, albeit with tough to model/work with parameters.
ISTM this is a like a question I had a long time ago about how, when finding the best fit for a set of data points (say, 2D), you would assume in advance there was a linear or a log-linear relationship. I always wondered why it couldn’t be some other, better fitting curve, a polynomial, or a transcendental function. The answer, I think, is that you never can know when you have only finitely many data points.
ETA:
Very helpful! The inverse tan was actually one of the factors I had thought about (for the wrong the reason, mind you)
I should add that sometimes the loss function is selected based on the kind of errors expected in the signal. If you want a function that rejects one wild point and goes right through all the other points, a max(absolute value) (L-inf norm L^infty-Norm -- from Wolfram MathWorld) is a good choice, but I think (don’t recall for sure) it has computational issues (though you can approximate with a L-100 norm which would be differentiable.
I got the above wrong. L2 is great for systems with gaussian noise (which is of course, quite common). L1 works well for the “single wild point” case. L-inf works well for the uniform noise case.
Outstanding link. My question is answered. Even the wiki link (in your link) on loss functions (that I obviously hadn’t seen) tells me what I need to know.
Next up, stochastic gradient descent as a form of simulated annealing. I wish.
A somewhat relevant tangent:
Deep networks (many layers) were traditionally difficult to impossible to train using these methods due to different layers learning at different rates.
Some key insights from Bengio, Hinton, and LeCun resulted in learning algorithms that worked for deep networks which created the recent explosion in effective neural network usage.