I read 3 or 4 different ebooks per monthnon my Android phone using the Kindle app. I have to say that for the most part everything is formatted properly. However, I tend to notice some mistakes that seem to appear over and over in different books (kinda ironic, me talking about mistakes) Generally things like hyphenated words that shouldn’t be or page numbers in the middle of sentences. I’m wondering how publishers go about digitizing books. My theory is that the books are scanned page by page and a computer formats the ebook. That would explain the mistakes I’ve seen as I don’t think a human would continue to make the same ones in so many books. So, how’s it done?
There’s an apparatus that works like you think it does. A camera is above the book, with a bright light, and someone just opens the book and turns each page, pushing a button each time. An algorithm undistorts the printed page and then recognizes the text.
These days, the author just gives the publisher the digital file he used when he wrote the book. Word document or whatever. The publisher can just convert that document to something convenient for editing, edit the book, then convert it right to an ebook directly.
The formatting errors you are describing can be caused by a lot of things. The page numbers might be part of the text and they did a lazy conversion. Ditto the hyphenation. They might have just done a bad job converting from the digital file they were going to use for printing to the ebook format.
There are 3 stages (or more, depending on how you break it down.) Scan the pages, run it through and OCR program, and edit it. Editing is by far the longest step. Scanning and OCRing the book could take less than an hour. Editing it back into publishable, error-free shape could take 10 or 20 hours (or more.) If you were seeing fully automated, human-free digitization, it would be much worse.
I once digitized a coupleof issues of an old SF pulp magazine (the only two issues of that title) with the intent of preserving the original formatting as closely as possible. Those two issues took so much time that I abandoned the idea of doing it for any more of my old pulps.
Any book published in the past 15 years was sent to the publisher in an electronic version. The author would have an electronic copy if the publisher didn’t.
By the mid-1980s, all books were computer typeset. Those files may have gotten lost over the years, but if they exist in a readable form, then that can be the start.
Anything else can be scanned in from a printout and formatting. More time consuming, since the scanning can introduce errors and the format will need to be fixed.
I’ve noticed that some e-books have serious problems with ligatures – assuming they were typeset in software that supports them and during some point in the process, dumped into software which does not – words with “f” are a common place for errors. Sometimes the errors look more like poor OCR/scanning issues, and sometimes you can tell that a word was hyphenated on a line break but now isn’t on a line break anymore…as a graphic designer, I’m always kind of interested to see where the errors are and think about how they went wrong.
Or a third option: The author turns it into an ebook themselves.
According to the NYTimes “Last year, a third of the 100 best-selling Kindle books were self-published titles on average each week, an Amazon representative said.”
I once converted a non-fiction PDF to an EPUB (for personal use on my reader) where each of the ligatures (ff, ll, etc) was embedded as a single character that Calibre didn’t recognize, leaving that empty box character. I had to go through the whole book by hand and figure out which pair of letters was missing from each word with the square character. Serious pain in the ass, and I should have abandoned the task except for being too stubborn to let the book beat me.
It’s examples like this that illustrate the incompatibilities between programs that still remain. Outside the “standard” ASCII character set A…Z, a…z, 0-9 and a limited list of punctuation characters, not every formatting program interprets every symbol the same way.
So in this case, nothing to do with either the OCR or the source material, but rather with the actual digital format.
It’s not unlikely that problems might occur with the various readers’ implementation of the file format standards as well. As in, different readers may render the same e-book slightly differently based on how their software is written.
As observed above, you can go straight from the author’s file to e-book, or you can use scanned versions with software to convert. You also have to use scans to reproduce illustrations.
I have an edition of Jules Verne’s Civil war novel North and South on my Nook, and it’s converted so badly that it’s hilarious. (It cost me nothing, and you pretty much get what you pay for).
It was clearly scanned from a 19th century original, but the original was so damaged, with possibly torn or partly missing pages, or maybe with water damage, that it was a real challenge for the digitizing software to keep up. In many places it can’t recognize the letters, and is forced to come up with a “best guess” as to which letter is supposed to be there, and it often guesses wrong. In other cases it can’t find a matching letter at all, and ends up using whatever mark in its repertoire is closest, often something like a slash ( / ) or dashes or underscores ( - _ | ) or whatever (~< > ^ ). This makes reading a real adventure. Sometimes you can figure out what it’s supposed to be saying. Other times its total gibberish, looking like Captain Kidd’s cipher from The Gold Bug.
Of course, the lazy way out would be just to scan it as an image PDF and not attempt any OCR. Not nearly as useful, but at least you “preserve the original formatting.”
Every paper I ever published is on my web site. Some were obviously scanned (by the publisher or by J-Stor) and look it. Some were posted from my original computer files (and might differ slightly from the published version) and a few I retyped myself since the scanned version is such poor quality (published from a typescript) or, in one case, because I could not find an electronic version. In all cases the posted version is pdf, not epub. It is hard to see how I could make an epub version of a mathematics paper.
I also did some OCR work on some of my papers to get e-versions of them.
My thesis was lots of fun. I only had a daisy wheel printer so I faked a lot of symbols. E.g., “There exists” was a “]” overtyped with a “-”. So I got something like [del]][/del]. Arrows were just “->” and on and on. The OCR software kept trying to convert these into letters/digits and not consistently. So find and replace was limited.
Just getting rid of the extra hard returns while keeping the required ones was time consuming.
So I give some slack to people who do this sort of thing.
But to see obvious errors, e.g., where it’s not even a word or missing spaces, in recently written ebooks is disheartening. They didn’t even run it thru a spellchecker? Do a quick scan thru?
Lazy’s not just for individuals, either. National Geographic famously did this (at fairly low resolution, even) when they created their “every issue on CD” collection back an the early 2000’s. It was…not well received.
I don’t know about it being not well received–it was withdrawn from the market because of a lawsuit over republication rights brought on by photographers. The courts eventually found in NG’s favor and the collection was re-released (in higher resolution on DVD) after that. I have both versions.
Even before the suit, the reviews were brutal, especially for the large price (couple hundred dollars, I think). Pages were misaligned, some were unreadable, photos were in low resolution, no search… I assume the second generation product fixed that.
I had some illegal(bootlegged) ebooks from back before ereaders were a thing. Most were in Word or pdf format and actually, they did suck.
I remember taking copies of the first 1-4 Harry Potter books with me on a plane, all in Word format, and all on my laptop. It was painful.
It’s clear that now, proper digital copies are prepared in advance and sent to the ebook distributor, who makes sure they are correctly setup before putting them out there.
Mahaloth: Proud owner of all legitimate physical and ebook Harry Potter books for some time now.
To convert files to publish through Amazon as an ebook, a proprietary software is used. In the words of Bart Simpson, “I didn’t think it was physically possible, but this both sucks and blows.” Worst example, in one of my ebooks you will see different errors depending upon which font you select on your reader. No, these did not show up on Preview.
Kindle is notorious for have multitudes of bugs. It wouldn’t be surprising to see the same errors creep into multiple publications.
There are numerous formats for ebooks. They all do somewhat different things to the primary file, even if it enters in a common format like .docx. Every site online that creates ebooks have guidelines about what you need to do to format a book. All are horrendous.
Nobody would accept print books with a tenth as many errors as a converted ebook. I don’t know why it’s not more of a scandal.
Among other problems, like the ones already mentioned, is the fact that ebook readers (or, more accurately, ebook protocols in general) suffer from the same looseness of implementation as websites and browsers, where some website+browser combination may not work particularly well but most others will. This is not surprising since both HTML and ebook formats (at least, EPUB and Kindle formats like KF8, and probably all the others) are really just containerized HTML and CSS. Ebooks are, in effect, websites and ereaders are specialized browsers, and sloppy encodings and implementations abound.
I never saw a page that was misaligned, I never saw a page that was unreadable, and a couple of hundred bucks for well over 1,000 issues of a magazine doesn’t sound unreasonable to me. As for low resolution, at the resolution that they were, the set took up 30 CDs. Would people have preferred it to be a couple of hundred discs? No, the set (old or new) don’t fit some miraculous ideal of perfectly reproducing the resolution of the print magazine for free, but it also takes up a heck of a lot less shelf space than a complete run of the magazine (and cost less, too–many issues may be easy and cheap to find, but good luck finding all of them without making it a major, long-term effort.)
I’m glad that somebody apparently agreed with the set not being that good, though, because I got my CD-ROM set for $7.00 at a Goodwill Store. (Although–all things considered–I wish that I had picked up this painting from that exact store instead.)