How are ebooks made?

What I suspect is an easily answered question: how are ebooks made? Is an existing book scanned? Is it retyped?

I doubt that they are scanned because I have bought ebooks with layout errors. Equally, it doesn’t say much for the typist. So how does it work?

It can work either way, or both ways. I imagine books with fancy typography and illustrations are scanned and OCR’d to preserve the original layout while still having the text available electronically for text-to-speech and copy-and-pasting. In others only the text is preserved, either by OCRing (for older works) or by exporting the original markup into a format suitable for the ebook.

Well, when a book and a computer love each other very much…:smiley:

Another method might simply be to skip the printing step, assuming that many writers nowadays do their writing on a computer, meaning that the book may never exist as a concrete thing until it is printed. For an eBook, instead of printing, they’d just encode it (or whatever) into the proper format and publish it online.

I know a person who for a while had a job reading scanned books and correcting them for errors in the OCR textifying process. She said it could be an especially tedious job if the book in question wasn’t terribly well written to begin with. :stuck_out_tongue:

As noted, it depends on whether you’re talking about taking a printed book and turning it into an ebook, or taking a book that was originally created electronically (e.g. with a word processor) and put into an ebook format.

You may find it informative to read through the Scanning FAQ at Project Gutenberg, which is aimed at people who want to scan (old, out-of-copyright) books and make them available in electronic form.

Most ebook formats are layout independent… the layout is determined by your reader.

Current eBooks are created from the electronic version of the manuscript. Older books can be scanned or re-typed. When I worked for a major publisher and we were reprinting an older book, we occasionally sent a copy of the book out to be re-typed. I doubt anyone does that anymore. Scanning and OCR technology makes that unnecessary.

Recaptcha is often used to augment the OCR.

Some ebooks use proprietary formats, like the Kindle. Amazon has a converter that takes your electronic file and [del]butchers it[/del] converts it to its format. There are several other proprietary formats as well. Kindle won’t convert from pdf’s because they don’t scale the way a font does, although programs are available to allow you to read a pdf on a Kindle.

Other readers can use .pdf files or .doc files or .txt files or any one or more a dozen others.

There are sites and firms that will convert books for you in these various formats for reading on the various devices.

Books with straight text are fairly easy to convert. Books with many fonts and formats are harder. Images are a problem unless embedded in a .pdf file. They may or may not transfer and may or may not appear where they are supposed to. Footnotes also have problems. Sometimes footnotes are clickable, sometimes they aren’t. Kindle converts footnotes to endnotes, which can really screw things up.

IOW, there are dozens of readers, dozens of formats, dozens of converters, dozens of original forms, and millions of ways that it can be done and can be screwed up. What works on one reader may not work at all on another. It’s a field in its infancy and will be cited in the future as an object lesson in why standards are so important.

I have come across layout problems such as a gap between a l etter and the rest of the word, and inexplicable gaps between words,
or incorrect justification.*

  • which I can’t get this text to display.

Layout independent is a different thing from the source text that is being laid out. If the conversion software puts it an incorrect space it will show up that way on any reader. But what shows on the page as a whole is determined by the reader, because it will scale fonts. That creates the pagination you see, but not what lies in between the top and bottom of a page.

If you look at even a simple manuscript in Word with the Show button clicked, you’ll see how many hundreds of formatting marks are necessary. (Word Perfect has a better Show feature that reveals far more detail.) Every single thing you do or add to most straightforward text - a change in size or paragraphing, hanging or indented text, line spacing, a symbol or an accent, a dash or a dropped capital, numbers, formulas, justification, headers and footers, chapters, and the thousands of other things that Word can do - has to be interpreted by the converter software and redone in a way that limits the thousands of variables to the half dozen that will appear on screen. Not only do no two converters interpret in the same way, I’ll bet that you’ll get different results putting a complicated text through a single converter several times. That’s because one tiny change will ripple through the rest of the document and affect every character that comes after it.

Every ebook is a software program that needs to be debugged by a human. And humans are lousy editors.

Not sure what you mean here. The latest version of the Kindle will read PDFs (no additional programs necessary), although the small screen size can make them hard to read. And Amazon will convert PDF’s to Kindle’s format (email the document to your Kindle, with the word “convert” in the Subject line); but if the PDF file contains something other than straight, unformatted text, the results will leave something to be desired.

There is a difference between using Kindle as an ebook reader and publishing an ebook. You’re right that you no longer have to go through a third party for a .pdf. The underlying problem still remains.


Most published ebooks are reflowable and so have to be in those formats. You can use the Kindle and most other readers and tablets and computers and PDAs and phones and everything electronic to read material from almost any format. But that’s not usually what is meant by an ebook. An ebook takes advantage of the form.

The Kindle mostly uses Mobipocket, a long-established e-book standard that is based on HTML 3.2 with a few custom tags added. They also uses Topaz for some books, reportedly books that had no existing electronic copy and had to be scanned in. Topaz is interesting because it stores images of each distinct character, and then recreates the book by linking the images together in the proper order. And there’s also some often-not-very-well-OCRed text for searching and whatnot.

ePub is what the Nook and Adobe Digital Editions uses for books; it is basically a zip file with a bunch of HTML files inside. Most genuine e-book formats are really based around HTML; Mobipocket is just worse than the rest because it’s so old it was based on HTML 3.2, while most newer formats (like lit and ePub and the like) are based around HTML 4.0+, and include style sheet support and other handy features that Mobipocket lacks.

The Kindle can read PDF files.

Well, technicalities abound.

Is used exclusively different from proprietary? Yeah, I guess so. From the user’s perspective, not much of a useful distinction though.

Which I’ve said twice.

When I was writing manuals, the output was PDF files, which we could distribute as they were, or send to the printers to get printed onto paper. I believe the latest versons of the software we used can export to various ebook formats as well.

Well, it means it’s not really any more proprietary than any other e-book format, since it means that AZW files are just standard Mobipocket files, which can A) be read on other devices that have Mobipocket Reader available even if the platform doesn’t have a Kindle Reader application, and B) be DRM-stripped and converted to HTML, ePub, PDF, DOC, or whatever other format you like. It also means that you can buy Mobipocket files from other sources (or check them out from the library if your library offers Mobipocket files, as many do) and read them on the Kindle or with the Kindle application.

Unless I missed something (totally possible), you said “programs are available to allow you to read a pdf on a Kindle” and “… you no longer have to go through a third party for a .pdf. The underlying problem still remains,” which … I guess I may not understand what you mean by, but both sound incorrect if I am indeed understanding you correctly. You don’t need to do any conversion or install any special programs for a Kindle to be able to read a PDF file, before or after transferring it to the device. The Kindle supports PDF natively.

You wouldn’t want to use a PDF file as a source to convert to another ebook format, because PDF files make terrible source files, which is what your quote about KDP refers to.

The distinction I also made is between publishing ebooks, which is the OPs question, and being able to read material in electronic form.

As a writer who has self-published print books and then needed to publish them as ebooks, having pdf files be a terrible source is a nightmare. Self-publishing, whether through PoD or a short-run printer, and whether offset or copier, runs pretty much entirely on pdf files. (I’m sure there are exceptions but I’ve never dealt with any.) You get the best product. And you get to make all your own decisions about the book. Having to lose every piece of formatting that made a book especially readable and good-looking and needing to figure how how to redo all that work to make it readable in a reflowable manner is a slog through hell.

As a reader you may not need to know any of these details and there’s no good reason you should care. As a publisher it is a Giant Huge Enormous Issue and the current bane of my existence.

Part of the reason for my question in my OP is that I was wondering if it is the difficulty and/or the expense of making ebooks that affects a publisher’s decision to issue a title in ebook format. I’m still surprised how few (relatively speaking) ebooks are available. I can’t see anyone refusing another revenue source.

The problem is in your premise. It’s not the publisher’s decision.

It’s a matter of contract law. Until very recently, contracts did not have a clause that assigned rights and royalties to electronic books. Some publishers did try to issue ebooks anyway but authors went to court and stopped them because they had no right to; the authors retained that right. (Under copyright, authors control all rights to their work. Publishers essentially license certain rights for certain time periods by promising to pay money, either as an advance or in royalties, to the author.)

Authors and publishers have been having enormous battles over what it means to license ebooks (do they ever go out of print? how long are permissions good for? what formats are included? what happens when a new format comes out next year?) and what the proper payment is for these rights. Contracts are already many pages long to cover the multitude of contingencies and subrights that occur in the broad world of publishing. The holdup for most books is not just the technical issues - although up until astoundingly recently publishers insisted on print manuscripts and refused electronic submissions - but agreeing on mutually attractive contract provisions.

Here’s an example:

Publisher’s cap on library downloads begs question — when do e-books wear out?

Thank you **Exapno **and Sunspace. I really had no idea that these issues existed. (Some) ignorance fought.

ETA: So would I be right in thinking that most publishers and authors want their books to be in ebook formats, but they just can’t agree on the terms?