I called OUP-USA today, and they e-mailed me that they have identified the scanning error, and will be fixing it. we’ll see.
I was making a couple of assumptions:
- This was about typos and bad OCR, not grammar or “proper” usage. I don’t think anyone would think “shitted” is an acceptable substitute for “shifted”. That’s not non-standard usage or an actual misconception about language that anyone has; it’s simply an obvious error. I can think of several situations where apparent errors are not really errors, but they are few and far between.
- I was also assuming that this would be tied to a device - that is, some form of Kindle. Thus, it would be easy to ignore submissions from any device detected to be spamming. While the possibility of a zombie botnet of Kindles is theoretically possible, it’s very remote.
Note that I never said the editing would be done directly by crowdsourcing. The crowd merely finds the errors. This is the hard part of fixing these kinds of errors, since in many cases it requires a human-level intelligence to read the book.
I do admit that after thinking about it, it would be simpler just to hire some damn editors.
I have read just a a couple e-books, and they were full of typographically messes. Stuff that a spell checker would trivially catch. Run togetherwords (!) and such. And these are brand new books, not scans of old texts.
I have heard that e-book publisher might be deliberately putting typos in books as a method of watermarking them. No two people would get a book with the same errors. So when the e-book gets posted on a file sharing site, they know who bought the original.
Can anyone confirm whether this is happening?
(And it is absolutely disgusting that for modern print books they don’t do even the most basic typo scans.)
I noticed a BUNCH of these when I went through the Project Gutenberg Sherlock Holmes novels last year. The Deep Space Nine novels started off HORRENDOUSLY for the first book or two, but the last 10 I’ve read have been error free. It looks like this is a problem that’s going to go away, albeit very, VERY slowly…
Now that you mention it, I picked up a free copy of The Adventures of Sherlock Holmes when I first got my Kindle, and the most consistent problem I saw was that, somewhere in the chain of fonts and/or file formats (from the original OCR scanning to the conversions to various digital formats), the British “pound sterling” symbol got messed up. So every time an amount of money was mentioned there was a string of gibberish.
I fear that we’ll be stuck with this problem for a few more years, but that it will go away inside a decade (probably even sooner). Not because publishers will start doing anything to fix it, but because OCR software will get better, higher-resolution scans will get cheap, and maybe even some sort of A.I. editor will be developed.
I’m reading a book right now and every capital N is rendered ||
I have been a freelance full-time proofreader for close to 30 years and can unequivocally state that over the last 2 years, copyediting has become atrocious. Mainly due to editors now “electronically editing” on computer screens, such “mechanical” screwups such as sentences missing periods at the end, duplicated words, and missing quotation marks in dialogue have become endemic.
I work for about a half dozen major publishing houses on a regular basis, and only one of these still seems to have a competent stable of editors. But, as to a previous poster’s diagnosing the problem as copyediting and proofreading work being increasingly farmed out to freelancers, this has been the case for decades now, and virtually no in-house proofreading has been done since the mid-1980s, or earlier.
Last week I did a “reflow” for a major publisher–which means an already published hardcover book is being redone as a paperback and/or foreign edition–and found approximately 350 typos (whereas, on average, such jobs a few years ago used to contain maybe a grand total of 2 dozen).
Not to mention stuff like (all the following in this same book) “she went threw the door,” “Fineas and Ferb,” and a timeline at the start of every chapter that nobody bothered to check for consistency/sense and the days of the week were in part scrambled.
And mind you, all these errors are in the hardcover edition and ebook.
I think this kind of junk has been happening since the first ebook was published. :rolleyes: I bought my first PDA (a Handspring Visor, which tells you how long ago that was) about 2 weeks after I found out you could read books on them. I’ve dealt with so many scan errors that they hardly faze me anymore.
Apparently sometimes just converting the file from one format to another introduces errors too. And as others have pointed out, there apparently are NO proofreaders for ebooks. None.
I agree with Smeghead:
It should have been better by now.
Yes.
I have noticed it.
It is bad.
Can I get a job as a freelance proofreader/copyeditor please?
This should be easy to crowdsource. Kindles let you highlight passages and share them via Twitter; why not highlight errors, and share them on Twitter with a common hashtag like #kindletypo? Then either Amazon or the publisher could easily identify the errors, correct them, and push down corrections to the purchasers.
You assume that either Amazon or the publisher are interested in doing so.
I am not convinced this is true.
Some time ago I bought an ebook where literally half the book was missing. I complained to the publisher, and they actually emailed me back, admitted the fault, and promised that they would be supplying an updated version to the ebook site. More than six months later … no sign.
My SO is just in the process of educating herself in how to create ebooks. At no stage is there any scanning, as (in her case) they are all done from the same software files that the original books were printed from (or more precisely, the In Design files that the PDFs that go to the printer came from). It boggles me to see the above; do these software houses not have access to their original files?
Note that some of the above-mentioned OCR problems are also found in PDF -> text translations as well. So having a PDF file isn’t a fool-proof solution (I can’t speak about In Design files).
That’s only true if the PDF file was created from a bitmap, which should be your last choice. PDF isn’t causing the OCR problem, but it can’t fix it.
One of the best features of Adobe Acrobat is the ability to retain the text, unaltered, from text sources. The only problems you are likely to run into there is nonstandard characters (non-alpha, non-numeric) which can sometimes be misinterpreted if the font changes.
A sloppy editor will be sloppy regardless of the medium. Please don’t blame the tools. If anything, I feel that I have become more accurate since the advent of onscreen editing. I can let automation take care of the fiddly stuff and reserve my brain cells for the stuff that needs a human touch.
For example, wildcard search and replace in Word is a powerful tool – for the editor who knows what she’s doing. Ditto templates, macros, and plug-ins. In the wrong hands, they can be disastrous.
As for freelancers being poorly paid, I do all right. Plus I feel MORE secure than if I had a Real Job because I can’t be fired or let go. I can lose clients, by either their choice or mine, but I can also get new ones. When my husband got downsized in 2001, I became the new breadwinner. He finally has a fairly decent-paying job, but I still make more then he does, and will for the foreseeable future.
Freelancing isn’t for everyone, but for some of us it’s the cat’s nightwear.
Actually, there are a lot of issues recovering text from PDF files even when generated directly from word processors (that, really, ought to know how to do the job properly). Ligatures, Unicode characters from non-English languages, etc., can all be quite difficult to recover. Legitimate punctuation as well - do you have any idea how many ways there are to make a hyphen-like character in Unicode?
Part of the difficulty is that many fonts are protected by copyright, with the terms of use being that you cannot include in a PDF document any characters that are not actually used in the document. So programs that generate PDF are supposed to “subset” the font. This completely scrambles the ordering and numerical assignments to the characters, e.g., the character associated with the ASCII code for ‘A’ might be a semicolon. The programs are supposed to provide a map back to the conventional ASCII characters, but not all of them do. And some of those characters don’t HAVE an ASCII counterpart.
For fun and chuckles, try copying and pasting from a variety of PDF documents into a simple editor like NotePad. It won’'t take much effort before you find documents that, in places, yield absolute gibberish. [Extra credit fun: view the metadata for the document and see what nonsense, if anything, they bothered putting in for the title and author info.]
Musicat is right that PDF has features that allow you to retain the text. But those features get used, in my experience, only slightly more often than they get abused.
Unless the original book was typeset in an electronic format that was preserved and that allows accurate conversion to HTML (because both the Kindle/Mobi and EPub formats convey the main text in HTML), there’s no cheap AND accurate way to republish something as an e-book.
Funny. I bought Gatsby from Barnes and Noble for my Nook several months ago, and found maybe two errors. Why so much difference?
jjimm’s in the UK, right? I believe The Great Gatsby is in the public domain there, but not here in the USA, so he may have gotten a different edition.
But the PDFs are what the printers print from directly. No translation required.