How does Time magazine transfer 90 years of archives to HTML?

Quintas · September 2, 2011, 5:06am

I’m using TIME as an example. All their articles are archived and searchable from their website. (reading about WW2 that way is fascinating BTW. ex: search ‘France’ from Jan. 1940-April 1940 and read the articles speculating on how the coming battle would unfold. Or the articles about Russia in the summer of '42 and the consensus seemed to be that the Soviets were gonna lose).

Back to the question: Do they have people actually typing these articles in? If they were just scanned, wouldn’t it be in a .pdf format? Or are there tools now to scan a page and convert it to HTML?

Una_Persson · September 2, 2011, 5:11am

Along those lines, I myself have wondered how the New York Times was able to so quickly transfer all of its content back to the mid-1800’s, including advertisements and want ads, to a searchable database.

Darth_Sensitive · September 2, 2011, 5:12am

My unfounded assumption is OCR of old proofs or similar, the tech there is pretty good. They would then be formatted and proofread by interns or something like that.

Reply · September 2, 2011, 5:36am

There are machines that can take books, documents, etc. and scan them at incredibly high rates. Then computers can read the text (mediocrely) and reformat it into HTML.

There are commercial machines that scan like 1000 pages per hour with vacuum-based page flipping, and home-made DIY ones that can do the same thing at much slower rates.

Edward_The_Head · September 2, 2011, 12:22pm

A lot of newspapers and magazines have been microfilmed. As with the books and such they can scan them at a high rate and OCR them. It may not work great, but it does help quite a bit, especially if you don’t know the date you’re looking for, or even when you do and want to find the name or word rather quickly on the page.

I would say that for things like magazines that they are already on microfilm so they went that route instead of trying to take an original magazine and copying it that way.

Keeve · September 2, 2011, 12:45pm

So quickly? Do you have a start and end time for this project?

I would not be surprised if they’ve been working on it for ten years. I bought my scanner, which came with OCR software, in 2002 for $120. I’m confident that the NYTimes had easy access to much higher-quality stuff even then, and certainly in more recent years.

anson2995 · September 2, 2011, 3:44pm

There were several companies involved in digitizing newspaper archives in the late 1990s, and they all worked pretty much the same. They digitized pages from microfilm, ran those through an automated OCR program, and ended up with two massive databases. One had the text of each individual article. The other was an index. In earliest incarnations, you ran a query on the database and then looked at an image of the article (PDF or TIFF). Later, once they’d had a chance to proof the output, you could create a text or HTML version of each article.

The company that I worked with did 110 years worth of one newspaper in four months. In the decade since that happened, Google has developed technology that cut that time down dramatically.

dracoi · September 2, 2011, 4:12pm

I don’t know of any scanning or OCR tool that can only produce PDF output. That’s usually the default; it’s certainly what I prefer for my own records. However, you can change the settings.

Even if you got PDF output from the scanning/OCR process, Acrobat can export PDF files to a wide array of formats, including HTML.

For professional-looking output, the most time-consuming process would be proofing and fine-tuning the layout

Una_Persson · September 2, 2011, 4:20pm

No I don’t, but since good OCR software didn’t seem to be around until the early 2000’s (and even today, our OCR software at work has a pathetically poor rate of accuracy with old documents…like old newspapers, for example) that I would have assumed they’ve been doing it for no more than 5 or 6 years. And the complete NYT has been online to subscribers and universities for several years.

Note I was asserting anything, I was asking a supporting question to the OP. Don’t pick on my question because I’m not making a statement of fact, I’m looking for an answer as well.

Mr_Downtown · September 3, 2011, 5:06am

I think the majority of the big early projects, such as court cases on Lexis or the major newspapers on Nexis and ProQuest, were actually done by offshore keyboarding. You hire two Indian women to type the same article, and double-check any place they typed different things. The entire run of Time seems like a lot of words, but it pales in comparison to a law library—and there are a lot of Indian women you can hire.

OCR technology has advanced a lot in the last decade, though, and I now see some old magazine archives (American Heritage comes to mind) that seem to have been done that way.

Topic		Replies	Views
Printed journals from the nineties - can I make Word documents of them? Factual Questions	6	959	January 23, 2009
Time magazine CD-ROM archive? Factual Questions	7	3024	June 7, 2008
Old newspaper archives online? Factual Questions	16	1321	June 10, 2006
Paper-to-computer file conversion advice wanted In My Humble Opinion	9	1729	July 16, 2009
Extracting text from a pdf file Factual Questions	24	1733	November 14, 2008

How does Time magazine transfer 90 years of archives to HTML?

Related topics