Scanning 100 years of newspapers. Advice?

I’ve volunteered to help a publisher produce a digital archive of their newspaper. The newspaper has been printed monthly since 1904 on A4 paper, with about 20 pages per issue. Issues until about 1965 are black-and-white, then spot colour until around 2003. My task will be to scan the printed copies (up to about 1995; thereafter I have access to the original electronic files) and produce OCR’d PDFs for distribution on CD/DVD/Internet.

I thought I’d ask for some tips or recommendations on the following aspects:
[ol]
[li]What sort of scanning DPI is typically used nowadays to archive documents? I have two high-speed professional RICOH scanners which can do up to 600 dpi.[/li]
[li]The RICOH devices have a “Text OCR” setting with dropout colour, which I presume is best for postprocessing the image with OCR software. (The scanner does not do OCR itself.) The resulting image is a 1-bit TIFF. There are also settings for grayscale and colour JPEGs.[/li]
Any suggestions on what scan settings I should use for the black and white pages, and for the spot-colour pages?

I presume that for the spot colour pages, I should scan once with the “Text OCR” setting, for the purpose of OCR, and then once again with the full-colour JPEG setting for presentation purposes. That is, the JPEG images will be stitched together to form a PDF, with the OCR text captured from the TIFF image “underneath”.

For the black and white pages, would it make any sense to take a similar approach? That is, should I make a grayscale scan of the page, or will the 1-bit TIFF look good enough in a PDF?

[li]Any recommendations for OCR software? I am working on a GNU/Linux machine and have gocr and ocrad installed, but don’t have much experience with them. I would prefer to use free/open-source software, but can obtain an MS-Windows machine and commercial OCR software if necessary. As mentioned above, I will need the software to be able to make PDFs with text “underneath” a TIFF or JPEG image. This way the user will see the original scanned page in his PDF viewer, but will also be able to select the text with the mouse or search for it with the Find tool.[/li]
Because of the huge volume of newspapers I have to process, my primary criterion for the OCR software is that it should be as close to “batch mode” as possible – I want it to run with minimum user interaction.
[/ol]

For accurate OCR you will need at least 300 dpi, but because this is a newspaper you will need higher. I’m not certain there is anything to be gained by scanning twice. Try scanning once in full color at 600 dpi and then run those images through the text mode of the OCR mechanism.

I’ve used Abbyy for scanning newspapers before, and if you train it up it does an excellent job of recognizing text. Of course, newspapers are a special and more difficult case, as sophisticated and inconsistent typography even among pages is common. You have your work cut out for you :wink: