Google Print: "We are scanning them to be read by an AI."

I did actually read it, Dan. I’m curious which parts of my statements you disagree with.

Let’s say I’ve written a technical book about designing flammuses. Today, if you want to design flammuses, you purchase my book. In the Google Print model, you search for some particular keywords related to the part of the design you’re having trouble with. Up comes the appropriate page, containing diagrams, schematics, tables, or whatever. Each time you have a problem or question, Google Print will dutifully show you the appropriate page of the book. You have no need to buy it. I don’t get paid for my work. “But wait,” you say. “You get income from Google Ads.” Read on…

My technical books accomplish two things. First, they provide me some income when people buy them…a lot more than the few pennies that Google Print will pay. Second, they provide me with consulting income because people see from the books that I’m an expert in the field, and hire me to help with their projects. The Google Print ads provide a convenient way for Google to direct readers to one of my competitors. Right there, beneath a page of my book, will be an ad for some other expert in the field. So I lose book sales and I lose potential consulting work.

Explain again how this helps me.

You missed my whole point, Alterego. My books aren’t hidden away. I have a Web site of my own with pertinent information and excerpts, and it’s already indexed by Google. The difference is that Google today doesn’t get to decide what parts of my books are available and where to direct customers. I get to decide that.

A lovely fluffy little quote, isn’t it? But think about it for a moment. Do you really think people are going to spend a year of their life putting their knowledge and expertise into words if they don’t get paid for it? Are publishers going to be eager to pay copyeditors, proofreaders, printers, illustrators, and authors to produce a book when their revenue model is a few pennies from a Google ad? If copyright goes away, people will stop writing the books!

Sure, there will be technical articles. There will be some Web sites where people with questionable credentials post unedited and non-verified information in formats that may or may not work with your Web browser. But copyrights provide the incentive for skilled individuals to put the work and effort into producing books.

Where on Earth did that come from? You have no idea what I’ve written, how it’s placed online, what it’s about, or how many people have purchased it. You don’t know how many books I’ve written, how many Web sites I have, how much money I make from them, or how well-indexed they are by Google today. How can you make a statement like that based on such staggering ignorance? Oh, wait. This is GD. Um… cite?

This is simply untrue. Unless the book is submitted to Google through the Google Publishers’ program or is in the public domain, all that will come is a brief paragraph stating something along the lines of:

The term flammus appears on the following pages of the work Flammus Design and You by InvisibleWombat: 3, 22, 897. The first appearance is shown below:

…ever again. Flammus design first came to the public’s attention…

To purchase a copy of the book, use one of the following links:

That’s it. No whole pages, no diagrams, nothing. Just enough, hopefully, for someone to decide if the book is worth finding. If that’s violating copyright, so is every search engine ever invented.

You do? Why not just find a free reference written by someone who has purchased your book and made the facts therein available to the public?

That reminds me of a crazy idea I had the other day… what if there were a parallel universe where people skilled in some field, like computer programming, sometimes spent hundreds or thousands of hours writing stuff just to give it away for free? What if they even gave out the source code so other people could improve on their designs?

Wait, never mind, that’s ridiculous. I mean, a world where people would actually do something like that for free would be nothing like our own. It’d probably have some kind of global computer network powered by that same free software. In that crazy upside-down world, I bet the most popular rapper would be a white guy, the best golfer would be a black guy, Apple would use Intel CPUs, Microsoft would use PowerPC chips, and the Germans would be against a controversial war!

Sounds great. Of course, if these incredibly talented people (who are both skilled in their field and capable of writing clear, understandable prose) spend all of their time writing stuff to give away, they’d need a way to eat and pay the rent, too. Do you have any fresh new ideas for that, or is it just your basic utopian socialist paradise?

Have a seat, because what I’m about to tell you just might blow your mind.

Ready?

The world I was talking about… is Earth!

Yes, people already do spend hundreds or thousands of hours of their own time writing stuff to give away for free. You’re using some of it right now. (Of course, most of their writing is done in languages like C, C++, and Java, not English, but the point remains.) If you want to know how they put food on the table and a roof over their head, you don’t need to ask me to speculate, you can just put the question to them directly.

That is not how Google Print works.

But isn’t ignorance wonderful?

You just don’t get it, do you?

Of course people write stuff for free today. Even cynics like me. I’ve written piles of technical stuff for my own Web sites and others. I contribute regularly to Wikipedia. I write free book reviews for the local newspaper. But I get to choose which of my work is given away free and which I get paid for. I make a living with my writing, and the better my paid stuff sells, the more time I can afford to spend on free stuff.

If you start stealing people’s copyrighted software, books, articles, and music, then they are going to have to find a different way to make a living. You think people are going to spend tens of thousands of dollars going to college to learn to program if they can’t make a living at it? You think people are going to spend a thousand hours writing a book while they’re making below-par wages at Wal*Mart or McDonald’s? Only being well-paid for what you do creates the kind of leisure time needed to write for free.

And you still do, as has been pointed out to you. If this is just a response to the idea of a totally open-source model of creative property, fine. But if you think this has anything to do with Google Print, you’d better explain what you know about it that no one else does.

This has indeed gotten off topic. Discussion of the merits of copyright is best suited to another thread.

It was a specific response to Mr2001’s implication that writers shouldn’t be able to control their own work products. I know the thread has drifted, but I just couldn’t let that comment go unchallenged.

Fair enough, then. I think open-source is a great idea, but I have no clue how it works.

I made no such implication here. I was simply responding to your claim, which I think was quite clear (although you seem to have backed away from it), that no one with skill would put any effort into writing if they couldn’t get paid for it.

No, this is wrong. You are graded using a function of how many of your machine translation’s n-grams (sequences of n words in a row) match n-grams in a set of human translations. The function can’t be interpreted as a percentage, and it certainly is not based on 1-to-1 sentence matches, since sentences won’t match up 1-to-1 in all cases—in fact, a big part of machine tranlsation is figuring out how to align sentences correctly. This score is in fact quite difficult to interpret, and has been the subject of some criticism as systems have achieved high scores while producing translations that are virtually unreadable. However, it is the best automatic scoring method currently available. The alternative is to have humans read and judge each machine translation separately, which is simply too expensive for NIST.

What Google has shown is that with fairly simple scalable algorithms you can score high using this automatic metric (called BLEU, for those who are interested) IF you have a ton of data. Google has a ton of data: the web. I’ve seen a couple presentations like this (one by Peter Norvig), and IIRC, it took a doubling of the amount of data to increase the BLEU score by a few hundredths. More data beyond “a ton” simply won’t get you that much further, and anyway it’s not clear that a higher BLEU score necessarily means a better translation. Other researchers have shown that better algorithms can also score high, but with a lot less training data. The trick is to figure out how to get the better algorithms to scale up to the massive amount of data out there, in the web and elsewhere.

I see from the NIST tables that Google, while impressive, is not vastly better than ISI, which I believe also uses simple algorithms with lots of data. ISI is restricted by not having easy access to the whole web as Google does.

Anyway, just wanted to correct some misconceptions.