It is not any more differentiable than the Heaviside function, though. What are the technical conditions on the function, and how do I judge which one is optimal?
Sorry, yeah–differentiability itself isn’t exactly the criterion. I guess the better answer is that it needs a non-zero derivative over significant parts of its range.
As said, it needs to be non-linear, since most interesting functions are non-linear, and you can’t build a non-linear function from linear ones.
The non-zero derivative is so backpropagation works. The question is just: given the difference between the current output and the output we want, how do we tweak the inputs to reduce that error? And the answer is to use the derivative; multiply that by the error and add that to the input.
You don’t want the function to have huge amounts of compression in some regions but not others. Otherwise you’re wasting bits, because large portions of the input range get squeezed into small portions of the output range. You need some compression due to the non-linearity, but you don’t want to go crazy.
No one really knows how to judge whether one is optimal. It’s still an emerging technology. ReLU has been shown, in practice, to work a little better than sigmoid and some others. About the only really solid criterion is the hardware requirements, which are trivial for ReLU (can be implemented in a few dozen gates), whereas sigmoid uses an exponential and divide–both expensive in HW. Of course, there’s still plenty of room for experimentation with other functions.
Right, linear activation functions can only do linear separations of examples. Boring.
As far as I can tell, it isn’t known what makes activation functions optimal. Lots of comparisons are done on test data sets and the two things that generally are compared are: 1) final prediction accuracy, 2) training speed. So, ReLU is faster and gives better results in many cases, but the reason why that is probably isn’t as well understood as it could be.
I’ve seen a few papers that have neural networks try to learn the activation function for a second neural network and gotten some novel possibilities.
Sorry about all the elementary questions, but what functions did they get? Did the functions vary depending on the architecture of the second neural network, or not so much?
So, I’ve been trying to find the paper I was thinking of and failing. As I recall, there were two “families” of functions that worked best. One had the basic form of ln(a+e^x) which is basically the “softplus” / a smoothed ReLU. I don’t recall the form of the other family.
Sounds like the perfect problem for AI to solve! ![]()
I hope you can find it, I’d like to read it.
It’s here:
https://arxiv.org/pdf/1712.01815.pdf
If I’m understanding this right, they trained it for four hours, without any inputs other than the rules of chess. So no opening book, no grandmaster games. After that training, they had it play a 100-game match against the world computer chess champion, Stockfish. AlphaZero won 28 and drew 72, and didn’t lose a single game.
Some of the games are here. Crazy stuff in some of them.
That is astonishing. I’ll try to find time to play through some of those games. Thank you for posting this.
Wow. What was the white-black breakdown on those wins? And when AlphaChess plays against itself, what’s the white-black record?
When AlphaZero was white, it won 25 and drew 25. As black, it won 3 and drew 47. There aren’t any white/black statistics from the training, unless I missed them. The paper does have some neat graphs, showing how the algorithm played openings at different rates as it learned.
As I understand it, they’re going to release some further information, including some games from earlier in the learning process.
I don’t mean training games; I mean games between the fully-trained program and itself (well, I’m sure that it still continues learning continually, but it’s probably pretty close to the asymptote by now, close enough that a hundred more won’t move the needle much). I ask because a strong white-black skew could be expected to be one sign of a high degree of chess mastery.
This is just crazy stuff. Look at White’s development after move 12. Check out 26. Qh1! It’s just nuts the things these computers are able to get away with.
Pretty cool stuff. I figured at some point AI and chess would get to this (just feed it the rules, and it figures out best play for itself), but it’s pretty cool to be at that point already. That’s just amazing. I’m curious to see how the computer learned games develop chess theory and if there’s any weird strategic curveballs that go against conventional theory the computer finds. (IIRC, with Go there were a number of moves that went against conventional go wisdom that opened up areas of theory interest, but I’m not that familiar with go.) So far, from what I understand in the paper, it seems to have settled on its own on a number of established openings, so there’s nothing too surprising there that I see.
Can’t wait to see what more comes of this.
Whoa. That certainly does not look like your usual opening development with three central pawns missing and pieces exposed like that.
OK, I’m not a chess expert like you guys, but the move that seems most surprising to me is one of Stockfish’s (which should be more or less conventional, I think): After 32: c4, why doesn’t Black just take the pawn? Surely snatching up a free pawn (and getting two of your own passed in the process) should be a better opportunity than just making some feeble menacing facial expressions at White’s bishop (which is worth less than the rook threatening it, and which is defended)?
You can actually play it out if you want on that website. (Just move the black pawn to capture, and follow the lines below.) I do see that if you make the move where black captures c4, white all of a sudden jumps to a huge advantage (of a piece), and all sorts of bad things happen with a response of 33. f4.
Unfortunately, after a few moves, it doesn’t seem to let you analyze with the full depth (it limits it to 8 levels), so I can’t quite see what’s going on other than a lot of pressure and coordination of pieces against black’s king side. Maybe an actual expert like glee can chime in? But, whatever the case, the analysis seems to suggest that taking the pawn is not good news for black in the long run.
Oh, obviously, or the (second-)best player in the world wouldn’t have done it. I just can’t see what the not-good-news is.
And on the scoring analysis, it also thinks that the move immediately prior to that by White, which enabled the capture, was a horrible mistake.
I’ve messed around with that position for a while, and it looks like the issue is that white wants to make use of the c4 square with his queen. White is down material and attacking, so he wants to open up the opponent’s king, with something like 32.f4. Black is trying to avoid opening lines, so he can try 32…g4. So you could get something like this:
32.f4 g4 33.Qxg4 Bxc3 34.f5 Rxf5 35.Bh6+ Kf7
If the black b-pawn wasn’t covering c4, white would have Qc4+. So white tries to insert c4 bxc4 prior to going f4. Obviously there are many other lines to look at as well, but I think that’s the idea.
I’ve read a bit more about this, and some people are pointing out that the hardware used for the Stockfish match may not really have been fair. AlphaZero is running on high-powered custom hardware, while Stockfish wasn’t, and was given a relatively small hash table size and no access to tablebases. So maybe the result isn’t quite so significant, in terms of the strength of the AlphaZero engine itself. If you put Stockfish itself on similarly-powered hardware, it would probably beat the testing version of Stockfish handily as well.