How does Time magazine transfer 90 years of archives to HTML?

I’m using TIME as an example. All their articles are archived and searchable from their website. (reading about WW2 that way is fascinating BTW. ex: search ‘France’ from Jan. 1940-April 1940 and read the articles speculating on how the coming battle would unfold. Or the articles about Russia in the summer of '42 and the consensus seemed to be that the Soviets were gonna lose).

Back to the question: Do they have people actually typing these articles in? If they were just scanned, wouldn’t it be in a .pdf format? Or are there tools now to scan a page and convert it to HTML?

Along those lines, I myself have wondered how the New York Times was able to so quickly transfer all of its content back to the mid-1800’s, including advertisements and want ads, to a searchable database.

My unfounded assumption is OCR of old proofs or similar, the tech there is pretty good. They would then be formatted and proofread by interns or something like that.

There are machines that can take books, documents, etc. and scan them at incredibly high rates. Then computers can read the text (mediocrely) and reformat it into HTML.

There are commercial machines that scan like 1000 pages per hour with vacuum-based page flipping, and home-made DIY ones that can do the same thing at much slower rates.

A lot of newspapers and magazines have been microfilmed. As with the books and such they can scan them at a high rate and OCR them. It may not work great, but it does help quite a bit, especially if you don’t know the date you’re looking for, or even when you do and want to find the name or word rather quickly on the page.

I would say that for things like magazines that they are already on microfilm so they went that route instead of trying to take an original magazine and copying it that way.

So quickly? Do you have a start and end time for this project?

I would not be surprised if they’ve been working on it for ten years. I bought my scanner, which came with OCR software, in 2002 for $120. I’m confident that the NYTimes had easy access to much higher-quality stuff even then, and certainly in more recent years.

There were several companies involved in digitizing newspaper archives in the late 1990s, and they all worked pretty much the same. They digitized pages from microfilm, ran those through an automated OCR program, and ended up with two massive databases. One had the text of each individual article. The other was an index. In earliest incarnations, you ran a query on the database and then looked at an image of the article (PDF or TIFF). Later, once they’d had a chance to proof the output, you could create a text or HTML version of each article.

The company that I worked with did 110 years worth of one newspaper in four months. In the decade since that happened, Google has developed technology that cut that time down dramatically.

I don’t know of any scanning or OCR tool that can only produce PDF output. That’s usually the default; it’s certainly what I prefer for my own records. However, you can change the settings.

Even if you got PDF output from the scanning/OCR process, Acrobat can export PDF files to a wide array of formats, including HTML.

For professional-looking output, the most time-consuming process would be proofing and fine-tuning the layout

No I don’t, but since good OCR software didn’t seem to be around until the early 2000’s (and even today, our OCR software at work has a pathetically poor rate of accuracy with old documents…like old newspapers, for example) that I would have assumed they’ve been doing it for no more than 5 or 6 years. And the complete NYT has been online to subscribers and universities for several years.

Note I was asserting anything, I was asking a supporting question to the OP. Don’t pick on my question because I’m not making a statement of fact, I’m looking for an answer as well.

I think the majority of the big early projects, such as court cases on Lexis or the major newspapers on Nexis and ProQuest, were actually done by offshore keyboarding. You hire two Indian women to type the same article, and double-check any place they typed different things. The entire run of Time seems like a lot of words, but it pales in comparison to a law library—and there are a lot of Indian women you can hire.

OCR technology has advanced a lot in the last decade, though, and I now see some old magazine archives (American Heritage comes to mind) that seem to have been done that way.