Scanning 100 years of newspapers. Advice?

psychonaut · December 24, 2005, 6:37pm

I’ve volunteered to help a publisher produce a digital archive of their newspaper. The newspaper has been printed monthly since 1904 on A4 paper, with about 20 pages per issue. Issues until about 1965 are black-and-white, then spot colour until around 2003. My task will be to scan the printed copies (up to about 1995; thereafter I have access to the original electronic files) and produce OCR’d PDFs for distribution on CD/DVD/Internet.

I thought I’d ask for some tips or recommendations on the following aspects:
[ol]
[li]What sort of scanning DPI is typically used nowadays to archive documents? I have two high-speed professional RICOH scanners which can do up to 600 dpi.[/li]
[li]The RICOH devices have a “Text OCR” setting with dropout colour, which I presume is best for postprocessing the image with OCR software. (The scanner does not do OCR itself.) The resulting image is a 1-bit TIFF. There are also settings for grayscale and colour JPEGs.[/li]
Any suggestions on what scan settings I should use for the black and white pages, and for the spot-colour pages?

I presume that for the spot colour pages, I should scan once with the “Text OCR” setting, for the purpose of OCR, and then once again with the full-colour JPEG setting for presentation purposes. That is, the JPEG images will be stitched together to form a PDF, with the OCR text captured from the TIFF image “underneath”.

For the black and white pages, would it make any sense to take a similar approach? That is, should I make a grayscale scan of the page, or will the 1-bit TIFF look good enough in a PDF?

[li]Any recommendations for OCR software? I am working on a GNU/Linux machine and have gocr and ocrad installed, but don’t have much experience with them. I would prefer to use free/open-source software, but can obtain an MS-Windows machine and commercial OCR software if necessary. As mentioned above, I will need the software to be able to make PDFs with text “underneath” a TIFF or JPEG image. This way the user will see the original scanned page in his PDF viewer, but will also be able to select the text with the mouse or search for it with the Find tool.[/li]
Because of the huge volume of newspapers I have to process, my primary criterion for the OCR software is that it should be as close to “batch mode” as possible – I want it to run with minimum user interaction.
[/ol]

alterego · December 25, 2005, 7:06pm

For accurate OCR you will need at least 300 dpi, but because this is a newspaper you will need higher. I’m not certain there is anything to be gained by scanning twice. Try scanning once in full color at 600 dpi and then run those images through the text mode of the OCR mechanism.

I’ve used Abbyy for scanning newspapers before, and if you train it up it does an excellent job of recognizing text. Of course, newspapers are a special and more difficult case, as sophisticated and inconsistent typography even among pages is common. You have your work cut out for you

Topic		Replies	Views
Printed journals from the nineties - can I make Word documents of them? Factual Questions	6	959	January 23, 2009
Best way to convert photo to text In My Humble Opinion	10	1147	February 13, 2008
Why are scanned pdf's so much more compact now? Factual Questions	22	2467	June 14, 2016
Photo Scanner (DPI question) Factual Questions	14	1359	September 7, 2003
Photo Quality Factual Questions	16	1314	July 27, 2000

Scanning 100 years of newspapers. Advice?

Related topics