Paper-to-computer file conversion advice wanted

Where I work, we have loads of paper files. My boss recently said to me that it would be a Good Thing to get rid of those paper files—perhaps I could figure out some way to computerize them?

Knowing zip-friggin-ola about computer information retrieval, I appeal to you for help.

I imagine something like a million pdf files, created by scanning our paper documents, perhaps arranged in folders according to ATA number, with some kind of search engine that could look at the contents of every file, looking for words or phrases like “engine fire” or “1566678-45”.

Has anyone here faced this challenge–reducing gobs of paper files to a searchable computer file system? And if so, how did you do it? What tools were useful?

Obviously, the company I work for must have people who know something about this, but I’d like to start talking to them with some kind of knowledge in my back pocket.

Depending on what the text looks like you may be able to scan them and then use an Optical Character Recognition program to get them into an easily searchable text format.

My experience, a few years ago now, is that’s it’s possible but only on a large scale, and especially with things you aren’t that fussy about. I worked with an insurance company that wanted to scan, store, and then toss, all of their backlog paper files that were kept for legal reasons but rarely used. That was done, but not smoothly. Document feeders do not take kindly to any page that has been handled. They skew, jam, dog-ear, slip out of frame, etc. But mostly they were fine for seldom used backup.
I worked with banks recording checks and they had very elaborate ways of dealing with recording checks. Still not smooth, but they had stanardized the problem streams so ultimately everything got recorded and read. On that one, the dollar amounts were picked up with OCR and confirmed by checkers looking at an enlarged screen image. They would sit there all day pushing OK, overriding with a number, or diverting it into a Hand Check pile. Worst job ever.
And I worked with one company who was doing important scanning of corporate annual reports for use in a stock broker’s database. The OCR came up with “99.8” accuracy, but it was seldom clear to us how to locate the 00.2 errors that would cause problems later. So there was virtual word by word, number by number eyeball comparing of document and OCR image.

So, I’d say it’s going to cost a whole lot and not be worth it.

When my agency took on the IT responsibility and budget of the DC Public School System last year, one of the first challenges was electronic document management (EDM). DCPS had building after building full of boxes of paper, sitting on the floor in huge stacks. They weren’t even in filing cabinets.

We created an EDM staff and purchased about two dozen enterprise scanning machines, IBM’s FileNet software and a couple of servers to run the whole thing. We used summer interns to get it off the ground. They loaded up the scanners with stacks of documents and turned everything into an image file and stored all of that on the enterprise SAN.

The next task was pulling all of these document images back across the network into an office (loaded with more interns), and one-by-one somebody looked at each document image and tagged it with relevant key words. Names, SS#, city, state, zip, school year, home-room teacher name, free lunch, transcript, etc.

Your office may or may not have $3.2 million (over 5 years) to spend on something like that. But for us, having all of these documents laying around with no way to find anything or answer FOIA requests was a huge liability and put our AAA bond rating at risk.

You need to do an assessment analyzing the cost of maintaining a paper records system and use that to justify the expense of going paperless. I believe that we projected the cost of warehouse space to house file cabinets and personnel to manage all of that paper at $13 million. So spending $3 million to go paperless seemed to be a no-brainer.

My friend works as an IT guy for a lawyer. He was laid off from that job for a really long time until this case came up that required the scanning of hundreds of thousands of documents into PDFs, then OCR-ing the PDFs to text and I think ultimately making the text searchable via database.

So far he’s put a ton of time into this process, and they have put a good chunk of money into it. Not only does the lawyer have to pay my friend to do the job, he also had to buy a powerful dedicated machine (or two?) to handle the scanning. You’ll need a good scanner with a document feeder plus a program that converts scans to PDF (they had those already). Then you’ll need a machine dedicated to doing the OCR-ing, and a software package to do it which is NOT cheap. You also can’t trust OCR to be perfect so you’ll need to be able to check the output and make corrections. Then you need some sort of database server into which you can dump all of this text. MySQL is free and should work, but support from the MySQL people is not free.

So it can be done but you will need dedicated machines, software and a dedicated person to do the job.

You should speak to a document management company like Xerox.

Adobe Acrobat already allows this. If you are really talking about a literal million PDF files, then I agree with Quartz–get a pro in there.

If it’s not exactly “millions,” then you could do it yourself (or, rather, hire a low-wage employee to do it). We did this for a consultant’s personal library (1,000-ish pages of clippings, articles, etc.) several years ago, so the technology is not new.

The DIY way:

  1. Buy a sheet-fed scanner (so you can load in 50 pages at a time and have the scanner auto-scan them)

  2. Buy Adobe Acrobat. It has OCR technology built in, and does a beautiful thing where it turns the OCR’d text into a hidden layer that sits on top of the scanned page image. So you see the original document in all its gory glory, but you can search text, select text, etc. Pretty awesome. Acrobat also will index everything for you so you can search all the documents.

This is what I do for a living - be prepared for it to take a long time and be fairly costly. The company I work for started with 50-something file cabinets (four drawers each) and now, two years after the project started, has twenty-four. We have two people working on it and I replaced a high schooler - its not really skilled work, but very tedious and exacting. We use a program called LaserFiche and have a dedicated computer and high-speed scanner (a Canon DR-7580).

The results are great - searchable OCR’d text, high quality images of the pages, the electonic filing system is very intuitive if you know Windows.

Thank you all for your inputs; now I have some idea what’s involved.

There must be people here at MegaGiantCorp who have done it, so I’m running the trapline here, too.

I did this at two hotels I worked at.

What I did was get a copier and it scans the document, and instead of making a paper copy it goes onto a file. This file, if the copier is networked into your computer system, stays there or you can have a module on the copier burn it to a CD.

It’s very easy. I used Canon Copiers. In my experience, Xerox is HORRIBLE. It’s not so much their products, though they aren’t up to Canon, but the service. I would call Canon and within 60 minutes, usually less than 30 minutes, they’d show. Xerox would be we’ll be there next week, then miss the appointment.

As for putting these in a database, I tried to use optical recognition programs, and I found they are really hit and miss. This isn’t so much of a problem in of itself, but it becomes one with the attitude of people. They start to think, well I searched the database and it’s not there so we don’t have it. While in reality it’s there but the optical recognition program didn’t read it, or read it wrong.

In today’s economy, it probably would be better hiring a minimum wage temp to physically read each document scanned, type in key words off it and then shred the original.

Finally check with state and federal laws. If you may need to use these documents, some localities have special rules. For instance, file must be made in such a way they can’t be altered, they can’t be stored on re-writable media, etc etc

So before you do that check to see how you need to store the type of documents do you can use them legally if the need arises