I investigated OCR software solutions about 3 years ago for my company. They wanted me to get all of our documents like warehouse picking papers, invoices, etc scanned in and indexed in searchable format. We concluded that the price to do it properly was too high, in excess of 50K. I was wondering if there was anyone out there sucessfully using OCR stuff for a similar purpose today and if it was worth your investment. It has come up again (this time on a smaller scale) and I’m wondering if now is the right time to get into it.
I’ve used the Finereader OCR engine from Abbyy as the basis for an automatic document conversion tool operating on TIFF images of scanned documents. They provide a nice front end for converting single files and limited batch capability, but more importantly you can call their OCR engine from your own code, so you can write a wrapper to process documents in any way you need. The Abbyy system has a lot of parameters which allow you to adjust it for tabular data, foreign language character sets, etc.
My company does boutique OCR archiving for small firms and wealthy individuals–basically, they give us a shoebox full of letters, clippings, magazine articles, photos, etc., and we scan them, clean them up, and turn them into PDFs.
I guess it depends on what you’re doing, but I recommend Adobe Acrobat/Acrobat Capture. The final product of each OCR’d document is a PDF file, containing an image of the scanned document (you still need to scan it in w/ some other software), whose text you can highlight, copy, and paste like any “live” computer font (like these words, for instance). And, once you have all of your PDFs created (or enough of them to work with), you just tell the software to index them and then you can do speedy boolean searches of all the text, as well as any keywords you’ve embedded (for custom categorization, for photos w/o text, etc.). Pretty neat, great output product, and minimal effort/skill required. The OCR is pretty faithful, I’ve found–much better than the sort of gobbledygook you got from such programs 4 or 5 years ago. But you’d still want to proofread it before you copied and pasted it into your annual report.
Again, I don’t know what kind of setup you have (multiple locations? do you want docs scanned immediately? or can you have them rounded up once a week/month/quarter/year for simultaneous scanning? who’s doing the scanning? who’s doing the info. retrieval?), but it seems to me that you could get a dedicated PC, high-speed (or medium-high-speed) scanner, and software for a few grand, then just worry about paying someone to do the dirty work–maybe even a part-timer.
Note that, to my knowledge, this kind of flexibility is NOT available from Acrobat, so it may not be up your alley if you need that. Think Web-style search rather than powerful database. But, PDF is an open source, so there are probably third parties out there who are making such software to work with PDFs right now. (And, indeed, there are cheaper OCR-PDF programs than Capture … it’s just that Adobe Acrobat/Capture is like the name-brand drug, and the others are the generics).
I use Abbyy FineReader and I must say I absolutely love it. It’s accurate and you can train it for added accuracy. Def ready for primetime IMO.
We use Genesis/OnBase. We scan and ‘image’ about 30,000 letters, or upwards of 50,000 pages per day, which are indexed and then are worked in India by our offshoring partner.
We have been in the groove for several months, and most problems seem to be volume related (FTP issues, ISP issues, etc)
The process seems to work, and their are various ways to sort work and find it later, but we have big time issues because the volume is so high and we pump the files over to India (from the U.S.)
I just gotta chime in and agree with Toadspittle; from my experience Adobe Acrobat is the way to go. Now my experience is fairly limited, but I have in fact looked into converting paper to data in my office just a couple weeks ago. It seems to me that OCR gets the basics right (i.e. it can convert written word to digital text), but bungles the formatting on a regular basis. Unfortunately, depending on your needs this can be a real killer. The thing that’s great about Adobe is that is makes a standard .pdf ‘picture’ of the original document and then adds the OCR stuff as an individual layer underneath. Therefore the document is reproduced perfectly just like a standard .pdf file, but you can highlight text, etc. It really seems to be the best of both worlds. I suppose it all depends on your needs but since I saw this question and just went through this predicament a couple weeks ago, I had to chime in and say that Adobe was the only satisfactory solution we could find and we’re pretty happy with it.
Oh, and if you’re looking to outsource the work, Bongmaster, feel free to drop me an e-mail …
Wow, thanks so much everyone for the very helpful replies! I’ll definately check out the software you guys are using, sounds about right for me. As for outsourcing, I got your name on my short list, will definately call if needed. We just have to leave my username out of any stuff outside the board here.
Thanks again everyone, your help is very much appreciated!