Google Print: "We are scanning them to be read by an AI."

We are not scanning all those books to be read by people. We are scanning them to be read by an AI.link

Everyone sees a different elephant in the room with Google Print, so here’s mine. On Aug 1, it was revealed that Google won every category in NIST’s machine translation competition. In the evaluation you have, for example, a document in Arabic and are to translate it to English. The prototype translation is done by a human, and you are essentially ranked based on the 1-1 sentence matches. The general idea is that, “The closer a machine translation is to a professional human translation, the better it is.” (cite). A reasonable interpretation of Google’s .5131 score of Arabic-to-English is that 51.31% of sentences were equally good as compared to a human translation. That’s just phenomenal.

At a talk at my university last month Google presented a graph showing a linear correlation with the size of the web, which is their translation corpus (in addition to a boatload of UN documents), and the accuracy of their translation. It was clear that they just needed more data. Problem solver that I am, I wondered, “Why don’t you just use all those e-mails in Gmail?”. I kept my mouth shut, of course. If they’re doing it their recruiters wouldn’t know about it.

So the jackpot is to be the sole owner of the Royal Library of Alexandria. Accomplished by making deals with the biggest libraries in the world (sans the biggest, since they have their own digitization project), and giving each of them a digital copy of their own books, making you the only one with a copy of all the books. Then you feed it to your data hungry AI.

Now, there are plenty of smart people discussing the implications of Google Print on copyright law, publishers, authors etc… In fact, too many. Publisher’s and authors alike have missed the entire point of Google’s book scanning process. It is not about the Publishers or the Authors. It is to “organize the world’s information and make it universally accessible.” Google’s book scanning project is thus two-fold. If they can get away with it they are going to show the large majority of entire books on the Internet in line with their mission. But that is secondary. Suppose they can’t get away with it and share all that data. Then there is something even more valuable than the LOA - it is the ability to translate any document to and from any language. The veritable Babel fish.

There is really only one thing at risk in this ensuing battle, and Google will not be the one who suffers for it. Google has, as far as I can tell, the world’s most sophisiticated bookscanning assembly line and OCR technology. If they want to purchase some 55 million books and scan them themselves in the basement they can do so, without ever telling a soul, albeit that is the hard way. If the Publishers and other Authors in their greed (whom I am not lambasting wholesale, just those taking them to court) decide to scare them off then they will be the ones who suffer, as their white knight who was going to rescue all those rotting titles from the depths of library hell, making starving authors a penny or two on their forgotten works, gallops away to quietly pursue their real research interest. (and perhaps not a traditional introduction into this widely debateable topic, but I wanted to beat the horse from a different angle)

Absolutely fascinating. It’s like something from an Asimov novella.

Greed? Being paid for your work is greed? :rolleyes:

How, exactly, is Google going to make me any more pennies on my work? I am perfectly capable of creating an eBook out of each of my out-of-print titles, and I can distribute them as I choose (because, after all, they are mine). So our “White Knight” digitizes them and makes them available on the Web. Now people can find any particular snippets of information they want from my work without paying me, instead of buying a copy of my book–maybe I do get a few pennies instead of the $5.00 I’d charge for the eBook. Some White Knight.

I don’t know what you do for a living, alterego, but I’d be curious as to how you’d feel if Google decided they wanted a piece of your paycheck.

This one is better suited to Great Debates so I’ve moved it from GQ.

samclem GQ moderator

wonders how that happened

You have a valid point about the control of your work and not wanting Google to just be able to put it up on the web.

But Google does have something to offer you, and it will lead to more money made on your work (or, at least, on some works). Your distribution of ebooks on your own terms and profit from it is predicated on people actually knowing such works exist. The central repository Google’s creating probably won’t allow people to download major amounts of copyrighted works. But it will let them enter a term in a search engine and find a few surrounding pages. After which they can go through normal routes (library, book purchase, electronic purchase) to get the work available.

My understanding of the project was that it would make only paragraphs close to the search terms available, at least for books under copyright. I kind of wondered how useful this would be - the paragraph, out of context, is not what anyone would want.

Doing this as raw data for a translation project explains it. But wouldn’t one library be plenty?

We’ll see how the lawsuits go, but I think the opt-out policy is wrong. But getting out of print, lost books available in massive quantities is extremely valuable, and maybe now there will be a large company arguing against the lengthening of copyright terms we’ve been seeing driven by the likes of Disney.

I think the point is that while you may be perfectly capable of creating your book, and then turning it into an eBook, nobody is reading it, so its making you nothing. What’s more, there are a lot of great books published more than a year ago that wouldn’t get digitized and hence rot in a basement.

From what I understand, Google wants to digitize your book to make it searchable, such that somebody might actually be able to find it, read it, and subsequently pay you your hard-earned penny

I, for one, welcome our new Google Overlords and hope that someday they’ll see me as worthy enough to be digitized and then searched.

Weird. I knew about this, but didn’t give it much thought. Then I did a search on my name, and found that I’m quoted in a book I’ve never even heard of!

I typed in the Japanese, “Bakayarou-me. Gambaranakya to itta no ni.” Not really complicated, this would translate as, “You idiot. Even though you said [we, etc.] had to try/keep trying.” Etc.

I wanted to see how it would handle this vague and ambiguous (but not difficult or obscure) phrase. What I got was this:

Holy fuck that’s bad.

OK, let’s try a really easy sentence: “Anata no inu ha ii inu desu.” Your dog is a nice dog.

Wrong! The misuse of the article is a crucial mistake, changing the entire sentence.

I don’t think pro translators have anything to worry about for awhile.

How is Google planning on cross checking libraries and libraries of books with its translations manually?

It’s entirely possible that the prize-winning translator mentioned in the OP isn’t the same one as they make available online.

If all your readers want is the few snippets they could get from Google Print, then you should be more worried about public libraries, where patrons can read any book for free, and encyclopedias, which might contain any of the facts or memorable quotes from your book.

That seems pretty obvious, actually. Even simple German translations, which are relatively easy, are just done word-for-word, like Babelfish has for years. Also, their new system should by definition avoid things like “What called is you”, word combinations that would never be seen in an English sentence.

I think the point most people are missing is that the more info google has referenceable, the more advancement in tools that can be built. It’s about access to information, and I seriously don’t see it disenfranchising authors or publishing companies at all, at least not for the next few decades when it might do such a thing in combination with a bunch of other social factors, but that would be a dramatic shift in the way we conduct commerce, something I believe is happening, but I think google will be a part of the solution to the problem.

Also Google is going to be the most advanced AI on the planet, that is obvious, it will be learning from all of your great works. To me, being able to teach Google as it burgeons into this Intelligence is a far greater honor than any pennies on the dollar my copyright gets me.

Erek

You can send your book to Google, and then receive ad revenue for ads shown alongside it. In addition, they will display 4-5 prominent links to websites where people can purchase your book immediately. And the only way they will ever get to this page is if they were actively seeking out the contextual themes in your book.

But suppose you don’t send your book to Google. Suppose instead that your book is in a few libraries and used bookstores around the country, or maybe just buried deep in the stacks of the LOC. And someone comes over to Google who is interested in a topic related to your book. They press enter and are shown…well…whatever they are shown it’s not your book.

No, they want the whole book. Now that they know the topic is in the book and there are those shiny links to purchase it they are more likely to buy it then had they never heard of the book in the first place.

Last I heard the UN documents had the translator sounding a little too much like legalese. They could also be protecting their technology by not making it available for their competitors to scour. Hard to say, but i’m pretty sure its not the same translator you’re seeing.

This whole thing smacks of DVD country codes: We’ve got ALL the information you want right here, but you can’t have it! Because it’s cop:dubious:y:dubious:right :dubious:ed.

Information wants to be free.

This makes me think that that Googlezon video thing might happen, haha. But online translators still suck because sentence structures in other languages are sometimes quite different from english. Unless they can make it so it arranges the words to sound more like standard english, it is lame.

I would have to disagree.

I would say that a good portion of our intelligence is based on our ability to model our environment in our brains and to make accurate predictions about cause and effect. With respect to something physical this could mean realizing that we can’t walk through a wall.

This also means we have a mapping from words and sentences into our internal model. For example, if I say “can you walk through that wall,” you will understand what that means not because you memorized every word in our language but because you have experience that tells you how the meaning in those words maps into your internal model of your environment. And due to prior experience you will know whether you can or can’t walk through the wall.

Google will have none of this real world experience with which to create a model of an environment like we do, which will severely limit it’s ability to understand much of the data it absorbs.

What Google will/does have is a ton of data, and a good system of mapping input strings (search requests) with the most popular/relevant results.

That is SO not what Google is doing. If you’d go read their FAQ instead of just making kneejerk reactions you would see that.