I’ve been interested in having a book scanner. I want a device that can scan images from books (hardcovers or paperbacks) and then convert the image of the words on the page into a text file that I can edit.
I’ve been thinking about this for a while but it was recently brought to mind by my seeing an IndieGogo campaign for what’s being marketed as an economy portable book scanner. It is a good price but it’s still enough money that I don’t want to just trust them and send them my money without knowing a little more about the product.
I looked around online for reviews but it’s a new product so there don’t appear to be any reviews out yet. And most of the review sites I found seem questionable; they look like sites set up by the companies selling scanners where you’ll find all kinds of gushingly positive reviews.
But I can trust you guys, right? So tell me about your experiences with book scanners. What should I be looking for? What should I be avoiding? Are cheap models useless and not worth the money I save? Are expensive models overpriced and full of features I won’t use? Do all books scanners suck and I should wait until the technology gets better? Are there reliable review sites where I can get impartial reviews and information?
In many cases, any bog-standard multi-function printer/scanner will work. You’ll probably have to buy OCR (optical character recognition) software, however. That’s what makes the difference between getting an image and editable text.
Your mileage may vary considerably if you’re trying to scan an old or very big book. For special cases, you would probably have better luck using something like this.
I did it once with a regular flat copier/scanner, but I had to flip each page manually, turn the book over, and lay it flat each time before pressing the button. If I had to do it again, I would definitely use an HD-camera-based job like in the provided link; even if you are still flipping the pages yourself, the fact you do not have to move the book means it will take a fraction of the time. The software de-warps the image anyway, so that is not an issue.
Adobe Acrobat has some basic OCR; there is also Tesseract and similar. IME you need to proofread the output anyway (spell-check at the very minimum).
It seems like the prices range a lot for the consumer-level book scanners, and that probably reflects speed of scanning, as well as the quality of image. The better the image quality, the better the OCR will work. (And I imagine they all come with some form of OCR software bundled.) The speed of the scan doesn’t mean much if you have to spend a lot of time fixing OCR errors. In the same way, if the bundled OCR software doesn’t really integrate well, then you might end up spending a lot of time dealing with the document files after scanning to make them usable.
It seems to me that this is one of those types of products where it’s really important to be able to try it out under real-life circumstances to know which model to get within one’s budget range. I’m equally as curious as the OP on this. Right now, when I want to scan a book, I just get a (paperback) copy that is disposable, cut out the pages with a box-cutter, and put them into a ScanSnap stack scanner, which immediately performs OCR and converts to PDF with just about no errors. (With practice, I’ve can do this relatively quickly.) However, I’d like to be able to scan borrowed books without damaging them. If there’s book scanner that actually works well and quickly, for a reasonable price, I’d like to know, too.
Czur apparently is a major player in the book scanner business. The one I have ads for is also a Czur product.
But these video are the kind of thing I was talking about when I mentioned “gushingly positive reviews”. They appear to be ads made by Czur that are being presented as impartial reviews. I’ve read reviews of these same products on Amazon where people describe a lot of problems that these videos never mention. I’m not saying these negative comments are necessarily true. But the fact that these videos never even mention these possible problems - even to dismiss them - makes these videos feel more like sales pitches than reviews.
There are seveeal kinds, including Snoopy’s doghouse shaped ones and tricked-out overhead projector-looking ones (examples for illustration purposes, not specific recommendations.) The cheaper solution is just an almost normal flatbed scanner with the scanning head running as close to one edge as possible–on standard flatbeds, there will be a margin of inches around each side of the glass plate of the scan area–bookscanners have a margin of maybe around 1/8[sup]th[/sup] of an inch on the side opposite the cover hinge. Most books don’t have their [del]minds[/del] text that far in the gutter, so that type of scanner should be good enough for you. I used to use a Plustek Opticbook 3600 and it worked well (until eventually the scan head gears got misaligned somehow–probably an easy enough fix but by then I had scratched the glass all to heck scanning fossils anyway.)
Beyond the hardware, you also need good OCR software. I used ABBYY Finereader Pro, which was the best of the packages I tried (and not the one bundled with the scanner.)
Even with the best scanner and best software, you will spend a fair amount of time proofing and editing your pages. The cheaper the scanner and software, the more time you are going to spend. I have no idea what design the IndieGogo one is, but if it is just a camera stuck on an arm over an open book, you are not going to get the quality of results that you do from the designs I mentioned above–if you are going that path, just use your cell phone.
Sorry to be that guy, but unless the books in question are in the public domain (published before 1923), you would be violating copyright law. Do not believe anyone who tells you that if you purchased the book you have the right to convert it to digital format. You do not.
That article and 17 U.S.C. §108, which he cites a lot, are solely about reproduction from libraries and archives, not personal users scanning their own books. It is not relevant here.
That’s not the same thing as saying that personal scanning is legal. My guess is that it officially is not, given §107:
Copies being defined as any technology in the universe.
My guess is also that virtually every single image you see on the internet, all several billion of them from post 1923, 1924 on January 1 - you’re behind the times commasense - is also in violation of copyright.
Copyright law just is not designed to deal with personal book scanners or Pinterest. As an author, I have strong feelings about upholding copyrights. But I also recognize the real world. Making copies of already purchased materials for one’s own use is no more an ethical violation than copying a image from Google Images to use on one’s website.
Now, I’m assuming that Little Nemo is scanning for his own use and not to resell edited copies of the material. If that’s not the case, then let the hammer fall.
I had a portable wand scanner, that cost around $30. I wouldn’t want to scan a whole book with it, although I guess it would be possible.
It came with OCR software that really sucked. For OCR I used PDFelement OCR software. It worked pretty well after I figured out a couple of weird things about it and how to deal with them. Then after three years it stopped working, even though it was supposed to be a straight-up purchase and not one of those pay-by-the-month deals. And now PDFelement has pay-by-the-month deals only, as near as I can figure out.
That scanner–which I think was actually called Magic Wand–did not hold up a lot longer than the PDFelement software but it worked fine for a couple of years. For heavy use, I would recommend something else.
Now, I also scanned a whole book at Kinko’s (paperback book, and the scan destroyed it) and it was really fast. But it was also probably a very expensive scanner. Just a note to the copyright police here, I assume Kinko’s would know if they were violating copyright and they did not seem concerned at all. They just wanted to make sure I realized they were cutting the spine off that book and it was not repairable. They cut off the spine with their precision cutting tool and turned me loose on the machine, and charged me by the minute. I think it cost me like $16 and took about ten minutes. I don’t remember the per-minute price. It’s a lot more expensive if you let them do it as they charge by the page.
Pro scanner:10 minutes
Handheld scanner: Several hours and many redos of pages because I’m not a machine and for OCR purposes consistency is important.
That 1923 date gets thrown around a lot in discussions of copyright. There’s just one problem–it simply is not true.
A lot of people use that date because you can reasonably assume that any book published before then is in the public domain. But a lot of of post-1923 books are ALSO in the public domain. The problem is that it’s often difficult to tell which ones. The rules changed several times over the years, and for quite some time, authors/publishers had to take specific, concrete steps to renew copyright. Many of them did not, at least for some books.
As you describe the process, THEY weren’t violating copyright. If the book was under copyright, the person running the machine to make the copies (that would be you) would be the person at risk of being sued or prosecuted. As mere owners of the machine, they have basically no liability.
My experience at Kinko’s has been that if they are running the machine and they even suspect a possible copyright issue, they won’t make the copies. Period, end of discussion, no appeal.
For the record, I do not plan on using a book scanner in any illegal fashion. I want to scan books that I have bought for my own personal use. I don’t plan on scanning books that belong to other people and I don’t plan of scanning books for use by other people. I am putting a lot of books into storage and I was thinking it would be nice if I could scan them before packing them away because then I would still have access to the content even when the physical books weren’t handy.
Well, yeah, those videos show how the scanner is supposed to work. They’re by people who do reviews of all kinds of things (not just from that manufacturer), which they monetize on YouTube. It’s not so much that they are secretly working for the manufacturer as that they are newly using the product, so they aren’t likely to have come across problems like one does after using the thing for a while. So, of course, before buying anything on Amazon, check the reviews.
By no means was I recommending those scanners–just showing examples for that price range. As I said above, it’s one of those products you really need to use for a while to know if you like it, so I would get one that has 30-day return policy, or something like that.
I really wouldn’t put my trust in a scanner that doesn’t involve the page being flat. I haven’t actually used the other kind, but I’m highly skeptical that their software can adequately correct for curvature and different lighting near the spine.
A buncha years ago, I got it into my head to scan a hardback novel so I could read it on my then-new and hotsytotsy Nook. I needed a PDF of the book.
I took said hardback to a local Kinko’s and asked them if they had a bulk paper-cutter. ( The type that can go through 200 pages of card stock in a precise line without breaking a sweat. ) I told them I wanted the binding severed so I had a pile of pages. They refused, saying the cutter couldn’t make it through the cardboard on the outside.
I smiled. And tore the front and back hard covers right off. Handed them the rest of the bound book. The guy laughed and put it into the cutting machine. It took less than 3 seconds. I got back my pile of pages of Stephen King’s writing.
Came home. Scanned them one at a time by hand ( because a high speed scanner would have involved the copyright debate detailed elsewhere in this thread ), and voila. A PDF of a novel.
Now, I happen to be the son of a writer and a vigorous defender of IP rights and Copyrights in general. So scanning the novel did not mean I had a digital copy I could share around. It meant I had MY book on my device. And nowhere else.
Just to include other options: there are online services that will cut your book and scan. I have no experience with any of them, but I have considered 1DollarScan, whose charges start at $1 per 100 pages.
I asked a fellow to chop some spines at Kinkos once and he said they would only use their guillotine on new paper, saying that they didn’t want to damage the blade with hidden staples and such.
So I came home and ran the books through my table saw using a “sled” to keep them in position. Wow, that worked like a charm.
Another experiment using a band saw was just as good, perhaps safer, but didn’t leave as straight a line.
ocr … good luck with that … 95% of your workload will be in proofreading … that is, if your intent is to make the scanned pages searchable/editable. the finished pdf may appear as an ‘exact’ reproduction … but that’s only the sheep’s clothing. *[i.e. the scanned image-text is in front of the invisible *(masked)* alpha-text.] there is alpha-text within the pdf document (metadata) … whereby, once you have rendered the ocr and then try performing a “find word” … it will search the pdf (metadata) … possibly also highlighting the ‘text’ as well. got your loupe handy?
editing the text … it’s a whole other ballgame. replacing an element as simple as a comma with a semi-colon could produce hideous results. and, then, there is the font … if you’re using arial, and the original font was tnr (times new roman) … that’s another issue. of equal concern is formatting … such as kerning 'n leading 'n spacing 'n ligatures 'n diacritics.
i have no idea what your intent is, little-nemo … however, there will be other netizens peeking into sdmb globally. as member commasense has so graciously pointed out … copyrights need to be respected 'n maintained. additionally … same goes for fonts … each font has its own copyright to be adhered to. one final consideration: if tomorrow comes … and the directive gets changed … the media now needs to be published … how do you know what has, and what has not, been edited?
personally … i’ve been using a $12,000 iqsmart scanner (flatbed) … which utilizes oxygen software. veritably, this usually entails unbinding the media (removing spine). after pre-scan … this is followed up with another two passes … one for text and one for images. afterward … use indesign to combine/recreate the pages … and acrobat to preview the *.pdf files before publishing.