New Amazon.com feature text searches ever word in every book they sell! Wow!

Go to amazon.com for specs on the new feature

It’s not quite every book they sell, but WOW! It is very cool. I wonder if dopers will be using it a lot to research and foonote their debates.

Interesting. I don’t know whether they’ve got it quite right yet, but it’s powerful stuff.

[Ego-search test] It found references to records I’m on, but failed to find a reference to me in a book they have that I wrote a chapter of. It did find an out-of-print biography of an ancestor and numerous references to him in appropriate places.

Well worth keeping an eye on.

How could this be? It would require a huge team of people transcribing every book in their inventory, from Ansel Adams photobooks to every version of the Bible.

If I search “Dad, your poll numbers are down.” would I get hits for all the Calvin and Hobbes anthologies?

Wow, I’m impressed. I put in the search words: Norman McLaren Begone Dull Care) and came up with 22 entries! Very cool.

There seems to be a ceiling of 120 on the search results. Plus, you can download the page in question if you have an account (a little screen comes up that says Authorizing Copyright, so maybe not all pages are eligible for this feature).

Well, I put in that quote along with search terms: Calvin Hobbes

And the first 10 items were all by Bill Watterson (without the C&H, none of the first 10 were; probably too many generic words).

Well, most of the books probably didn’t need to be transcribed. Nearly all books anymore are written, submitted to the publisher, and/or typeset electronically, so the electronic versions already exist. They just had to deal with the publishers to get ahold of those electronic versions.

As for other books (older books, or things like comic strip anthologies where the text isn’t typeset), if they list them at all, it’s probably just unproofread OCR. You’ll get a significant number of errors that way (somewhere in the vicinity of 1 in 100?), bad enough that you probably wouldn’t enjoy reading the book through that way, but still good enough that you’re likely to hit your search results.

Archive Guy, that seems like a mightily poorly controlled experiment, to me. It seems like putting any search terms together with Calvin Hobbes would give you a good representation of Bill Watterson books. And the generic words in “Dad, your poll numbers are down” should be OK, since they’re in an exact phrase, and that phrase isn’t particularly generic.

“Dad, your poll numbers are down” (with quotes as shown) produces zero results.

Dad, your poll numbers are down (without the quote) produces 120 results but none of them are Calvin and Hobbes. (At least, not on the first few pages.) Without quotes it seems to find results where all the words are more or less close to each other.

It’s a cool idea but they sorely need a boolean search engine to do useful searches on that much data.

It’s hard for me to admit that first phrase that came into my head to search for was “nail in bum” and yet I am forced to admit it because it’s so cool that when I searced for it, The Straight Dope: A Compendium of Human Knowledge came in number 7.

If you read the “Wired” article, you find that they’re not just getting the text from the publishers, they’re actually scanning the books and using OCR. Sometimes they ship the books to low-wage companies, and sometimes they chop the bindings and use automated scanners.

Hmm… I can’t imagine the damage if someone manages to crack into the server and steal everything. Massive IP abuse. On the plus side, does this mean that Amazon could potentially open an ebook store with nearly every book they sell? that could be pretty cool.

The beauty of the system is that they don’t have to have perfect copies. Just good enough OCR, and full page scans to show as the results. So you have pretty good accuracy for such things as searches, without the insane work of hand-checking that real e-book making takes (see Project Guttenberg).

We are getting close to the ideal data world – where everything ever written, past or present – is online and searchable.

I can easily see how publishers could provide computer data of each new book as it is published (but think of the security risk!) so Amazon can index it, but just how would entering older books work efficiently? I just can’t picture someone turning each page in a bound copy and putting it on the scanner; this would be too slow to be cost-effective. Or are there automated machines that can be fed a bound publication and they will turn and scan each page unattended? Or slice the binding off and feed single pages?

More info:

Publisher FAQ, which says they need a physical copy and can’t use electronic ones yet.

Wired article, that Cardinal referenced.

Something similar to this was started a few years ago at (I think) ancestry.com. They have a genealogy library that has been growing at a rate of about three books a day and which includes books dating way back to the earliest writings in this country. Some of it is photocopied, but some is also transcribed. A remarkable undertaking.

I imagine that a quote from a cartoon might not be in there, because the words of the cartoon probably have not been transcribed to computer, only words of text-only books.