I’ve volunteered to help a publisher produce a digital archive of their newspaper. The newspaper has been printed monthly since 1904 on A4 paper, with about 20 pages per issue. Issues until about 1965 are black-and-white, then spot colour until around 2003. My task will be to scan the printed copies (up to about 1995; thereafter I have access to the original electronic files) and produce OCR’d PDFs for distribution on CD/DVD/Internet.
I thought I’d ask for some tips or recommendations on the following aspects:
[ol]
[li]What sort of scanning DPI is typically used nowadays to archive documents? I have two high-speed professional RICOH scanners which can do up to 600 dpi.[/li]
[li]The RICOH devices have a “Text OCR” setting with dropout colour, which I presume is best for postprocessing the image with OCR software. (The scanner does not do OCR itself.) The resulting image is a 1-bit TIFF. There are also settings for grayscale and colour JPEGs.[/li]
Any suggestions on what scan settings I should use for the black and white pages, and for the spot-colour pages?
I presume that for the spot colour pages, I should scan once with the “Text OCR” setting, for the purpose of OCR, and then once again with the full-colour JPEG setting for presentation purposes. That is, the JPEG images will be stitched together to form a PDF, with the OCR text captured from the TIFF image “underneath”.
For the black and white pages, would it make any sense to take a similar approach? That is, should I make a grayscale scan of the page, or will the 1-bit TIFF look good enough in a PDF?
[li]Any recommendations for OCR software? I am working on a GNU/Linux machine and have gocr and ocrad installed, but don’t have much experience with them. I would prefer to use free/open-source software, but can obtain an MS-Windows machine and commercial OCR software if necessary. As mentioned above, I will need the software to be able to make PDFs with text “underneath” a TIFF or JPEG image. This way the user will see the original scanned page in his PDF viewer, but will also be able to select the text with the mouse or search for it with the Find tool.[/li]
Because of the huge volume of newspapers I have to process, my primary criterion for the OCR software is that it should be as close to “batch mode” as possible – I want it to run with minimum user interaction.
[/ol]