Neural networks and cost functions

KarlGauss · May 4, 2019, 3:27pm

In machine learning (which I am trying to learn about, rather unsuccessfully), the ‘cost function’ is an important tool. In the examples I’ve seen, the cost function looks to be nothing more than the variance between one layer of the neural network in question and its desired or ideal output.

Can neural networks have cost functions that are based on measures other than variance? I assume the answer is yes and their cost functions can even be non-parametrically based. Otherwise their capacity to learn would be fatally-flawed, no?

In addition to your always appreciated insights, I’d be really interested in any links or resources on this topic that you’d recommend (at a basic level ;)).

Thanks,

(rather than embedding links for the key terms in my post above, anyone interested should really consider checking out the videos on the subject by 3Blue1Brown, e.g. What is a neural network? and How neural networks learn. As with everything on that channel, they are phenomenal.)

RaftPeople · May 4, 2019, 7:57pm

Here’s an article that walks through how the cost (loss) function is used with diagrams, I think it will answer your question.

KarlGauss · May 4, 2019, 9:08pm

Thanks for the link. It was helpful.

Still, I am unclear, though, since in the link, the Loss Function ultimately depends on the variance, i.e. Σ|x[sub]i[/sub][sup]2[/sup]|.

Could a loss function be defined to depend on something other than the sum of the squares of the distances from the best answer? Why not the 4th power, or the 34th, or a polynomial of some sort, or any other metric?

I think I am missing a key point - must the distances (as usually defined) be incorporated?

Thanks

Aspidistra · May 4, 2019, 9:38pm

Yes, it absolutely can, but often loss functions are based on variance because it behaves nicely (smoothly) when you differentiate your equations. And we already know lots of properties of variances that we don’t necessarily know about other moments of the data (without a bunch of calculation)

Another common one is mean absolute error : sum(|x-mean|) (those are absolute value bars)

Aspidistra · May 4, 2019, 9:42pm

Oh, and to ‘must the distances be incorporated’ - yes, because that goes to the heart of what the loss function is doing.

Basically, when training, you take all your inputs, make a prediction, then look at the difference between your prediction and reality. The difference between them is your ‘loss’ - for that data point, and for however you define ‘distance’ (different loss functions are different ways of classifying ‘distance from correctness’). The total loss is the sum of all the individual losses.

Andy_L · May 4, 2019, 9:48pm

Yep. Sometimes the variance is multiplied by a sigmoidal or inverse tangent function, too, which works because both of those are well behaved functions easy to differentiate.

KarlGauss · May 4, 2019, 9:50pm

That makes perfect sense. But I guess it comes at the potential cost of missing a ‘better’ measure, albeit with tough to model/work with parameters.

ISTM this is a like a question I had a long time ago about how, when finding the best fit for a set of data points (say, 2D), you would assume in advance there was a linear or a log-linear relationship. I always wondered why it couldn’t be some other, better fitting curve, a polynomial, or a transcendental function. The answer, I think, is that you never can know when you have only finitely many data points.

ETA:

Very helpful! The inverse tan was actually one of the factors I had thought about (for the wrong the reason, mind you)

KarlGauss · May 4, 2019, 9:55pm

Oh, absolutely. I am just wondering about different ways of measuring distance.

Andy_L · May 4, 2019, 9:56pm

I should add that sometimes the loss function is selected based on the kind of errors expected in the signal. If you want a function that rejects one wild point and goes right through all the other points, a max(absolute value) (L-inf norm L^infty-Norm -- from Wolfram MathWorld) is a good choice, but I think (don’t recall for sure) it has computational issues (though you can approximate with a L-100 norm which would be differentiable.

Glad I mentioned it, then.

leahcim · May 4, 2019, 10:19pm

Another useful loss function for single-class classification problems is cross-entropy loss: −(ylog(p)+(1−y)log(1−p))

Some others at: Loss Functions — ML Glossary documentation

Andy_L · May 4, 2019, 10:22pm

I got the above wrong. L2 is great for systems with gaussian noise (which is of course, quite common). L1 works well for the “single wild point” case. L-inf works well for the uniform noise case.

KarlGauss · May 4, 2019, 10:32pm

Outstanding link. My question is answered. Even the wiki link (in your link) on loss functions (that I obviously hadn’t seen) tells me what I need to know.

Next up, stochastic gradient descent as a form of simulated annealing. I wish.

Thanks all.

RaftPeople · May 5, 2019, 5:14pm

A somewhat relevant tangent:
Deep networks (many layers) were traditionally difficult to impossible to train using these methods due to different layers learning at different rates.

Some key insights from Bengio, Hinton, and LeCun resulted in learning algorithms that worked for deep networks which created the recent explosion in effective neural network usage.

Topic		Replies	Views
Statistics (Functions) Factual Questions	18	1251	October 25, 2000
The square function, nature's preferred function? Factual Questions	27	1966	October 29, 2017
Setting up a simple "manual" neural network: Feasible? In My Humble Opinion	1	209	January 18, 2021
Why half-life and 1 standard devation = 68%? Factual Questions	37	8485	July 5, 2011
How do AI "programs" work? Factual Questions ai	53	2135	November 5, 2023

Neural networks and cost functions

Related topics