What’s the fastest way to scan searchable .PDFs?

I’m trying to convert all of my personal paper files to .pdfs, and I’d like to have them text-searchable.

I’ve tried all the different settings possible with my scanner and software (a run-of-the-mill HP Officejet, about a year or two old, with the bundled software), and I’ve found that the quickest I can scan a typical document (one sheet of a telephone bill, for example) is about 2 minutes and 30 seconds. Of course, with multi-page documents and the automatic document feeder, the average per sheet goes down, but most of what I have is single sheet. Even with the “create each page as a separate file” function it takes a long time.

Is there any way to do this faster? It will take forever at this rate, especially as it requires input to the software interface. The questions I have are:

  1. – Does it all boil down to the scanner itself, or can different software speed things up.
  2. – If software can speed things up, which one is recommended?
  3. – If the hardware really matters, which scanner is recommended?
  4. – As this is effectively an OCR scan, is there a tenable (less time-consuming) way to scan as only image first, and then convert the images to searchables?

You’re essentially limited by the hardware. Lowering some of the quality settings (resolution and switching to black and white) will have a decent impact. If it’s a multi-document scanner, that makes life easier… you can just kick off the scan and go grab a cheeseburger while it scans 20 (or however many) docs.

In the world of home scanning, there is no equal to the Fujitsu Scansnap series.

These sheet-fed scanners scan both sides of a sheet into PDF format in about 4 seconds. You can load fifty pages and turn it loose.

The OCR part is a post processing step, but you can scan a hundred docs and then do other stuff while the OCR catches up.

I don’t know what the current lineup is, but when I bought mine a few years back, it cost a little above $400. I even have sawed the spines off of some books and fed them to the scanner before tossing them.

I agree, my ScanSnap S510 has been well worth the price, one of the most useful and reliable computer peripheral I’ve bought. The price is reasonable especially when you consider that it comes with a full version of Adobe Acrobat. I’ve also owned several different multifunction devices (with automatic document feeders) and none of them come close to the performance and convenience of the ScanSnap.

The best part is how easy it is to operate. All you do is insert the document in the scanner, push the big button on the scanner, and that’s it. It gets scanned (both sides), converted and OCR’d into a PDF file and stored on the hard drive.

That said, 4 seconds is a bit optimistic. The actual scan may be 4 seconds per page, but once the entire document (which can be multiple sheets) is scanned, you have to wait for it all to be saved before you can scan the next document. Still, a single sheet of paper (printed both sides) shouldn’t take more than 20 seconds to scan/convert/save. If you scan a 30-page document, you’ll need to wait for a few minutes before you can scan the next document.

It depends on what your platform is and how your workflow is configured.

At one time the OCR step couldn’t be queued separately, forcing you to wait between scans, so I ended up writing an Applescript script to fix this.

In short, the scrip that would listen for files in a particular folder and feed them to Abbyy FineReader (the OCR software used for the ScanSnaps) one at a time. That way I could scan at full speed and the OCR process would happen in parallel.

Recent releases of ScanSnap Manager have fixed things so that you can scan stuff as fast as you can shove pages into the scanner, while the OCR process simply munches on the files at its own pace in parallel. The OCR is still slow, but the fact that you can scan 30 thick documents and then go watch TV is a big win.

If you don’t need searchable PDFs, you can turn the OCR bit off and then the docs are ready immediately after the 4s/page scan.