“We are not scanning all those books to be read by people. We are scanning them to be read by an AI.” link
Everyone sees a different elephant in the room with Google Print, so here’s mine. On Aug 1, it was revealed that Google won every category in NIST’s machine translation competition. In the evaluation you have, for example, a document in Arabic and are to translate it to English. The prototype translation is done by a human, and you are essentially ranked based on the 1-1 sentence matches. The general idea is that, “The closer a machine translation is to a professional human translation, the better it is.” (cite). A reasonable interpretation of Google’s .5131 score of Arabic-to-English is that 51.31% of sentences were equally good as compared to a human translation. That’s just phenomenal.
At a talk at my university last month Google presented a graph showing a linear correlation with the size of the web, which is their translation corpus (in addition to a boatload of UN documents), and the accuracy of their translation. It was clear that they just needed more data. Problem solver that I am, I wondered, “Why don’t you just use all those e-mails in Gmail?”. I kept my mouth shut, of course. If they’re doing it their recruiters wouldn’t know about it.
So the jackpot is to be the sole owner of the Royal Library of Alexandria. Accomplished by making deals with the biggest libraries in the world (sans the biggest, since they have their own digitization project), and giving each of them a digital copy of their own books, making you the only one with a copy of all the books. Then you feed it to your data hungry AI.
Now, there are plenty of smart people discussing the implications of Google Print on copyright law, publishers, authors etc… In fact, too many. Publisher’s and authors alike have missed the entire point of Google’s book scanning process. It is not about the Publishers or the Authors. It is to “organize the world’s information and make it universally accessible.” Google’s book scanning project is thus two-fold. If they can get away with it they are going to show the large majority of entire books on the Internet in line with their mission. But that is secondary. Suppose they can’t get away with it and share all that data. Then there is something even more valuable than the LOA - it is the ability to translate any document to and from any language. The veritable Babel fish.
There is really only one thing at risk in this ensuing battle, and Google will not be the one who suffers for it. Google has, as far as I can tell, the world’s most sophisiticated bookscanning assembly line and OCR technology. If they want to purchase some 55 million books and scan them themselves in the basement they can do so, without ever telling a soul, albeit that is the hard way. If the Publishers and other Authors in their greed (whom I am not lambasting wholesale, just those taking them to court) decide to scare them off then they will be the ones who suffer, as their white knight who was going to rescue all those rotting titles from the depths of library hell, making starving authors a penny or two on their forgotten works, gallops away to quietly pursue their real research interest. (and perhaps not a traditional introduction into this widely debateable topic, but I wanted to beat the horse from a different angle)