Quote mark coding - why quote marks are changed into different characters

You see this a lot on the internet, in E-mail and news stories, etc. Double quote marks aren’t always displayed right when text is reproduced on a different application from the one it was created on. Sometimes the quote marks are changed into garbage characters.

Obviously the coding used for the quote mark characters is not uniform across applications, not universally-accepted ASCII. I’ve seen some online newspapers use a double `` mark instead of the quote mark to avoid this screwup.

Microsoft Word has an option called “Smart Quotes” which substitutes curly quote marks for straight ones automatically. I suspect this is the cause of much of the character snafus. The coding used is not recognized by other applications. You would think that Unicode would solve this problem, but obviously everyone is still not Unicode-compliant. My question is: How did quote marks happen to be left out of the ASCII set? Why do so many programs get the coding wrong? Is there any reliable solution while waiting for full Unicode implementation?

But quote marks were not left out of the ASCII set - they’re ASCII 34. However, in HTML, quote marks have a special meaning (within an HTML tag anyway), so they’re supposed to be indicated by &quote. I agree that those ugly curly quotes are likely the cause of the screwups. I particularly hate reading text with the double-backtick used as ``quote marks." Everytime I come across that, my brain stumbles on it and I have to slow down while it parses what was supposed to be written.

There isn’t much room in standard ASCII for lots of symbols. In fact there are only 95 slots assigned for printable characters. A pretty detailed history of ASCII, almost character by character, can be found here.

So instead of separate left and right quote marks, we got a generic “left double quote / right double quote / inches / kind-of-umlaut” symbol that serves all these purposes equally well, and equally poorly. The code for this generic symbol is 34.

The next character code standard that was widely adopted — at least in the West — was ISO Latin-1. It’s reign was short-lived, but it is an 8-bit standard as opposed to ASCII’s 7 bits, and therefore had an additional 128 codes to represent characters. Nevertheless, proper double quotes (or single quotes for that matter) were not among the characters represented in the extended range.

However, Microsoft then defined their own superset of ISO Latin-1 — which standard is often called Windows Latin 1 or some such — taking advantage of the fact that Latin-1 had reserved codes 128 through 159 for duplicates of the ASCII control characters in the range 0-31. Microsoft chose to use this range for additional printable characters instead, among them curly quotes of all types. The curly double quotes are at codes 147 and 148.

Meanwhile, or several years earlier actually, the Mac had already issued its own 8-bit character standard, now commonly called MacRoman. It defines curly double quotes at codes 208 and 209.

Nowadays we have Unicode available on most machines, a standard that defines almost every character that mankind ever blemished a piece of paper with. Or, at least it pretty well nails down everything used in the world’s alphabetic languages. The curly quotes are in there somewhere, though I’m too lazy to look up the numbers just now. It shouldn’t surprise you that the curly quotes are encoded by yet another pair of numbers, different from all those above.

Now combine all this history with the fact that computer text often does not specify a character encoding — because the software producing it or transmitting it assumes the whole world goes by Windows Latin-1, for example — and the fact that the displaying software might make similar parochial assumptions, or might not recognize the encoding specification even if it’s present, and you can understand where the chaos comes from. Both the producer and consumer of the text need to agree on the character encoding used, otherwise you’ll see what look like bizarre typographic errors. This is a very general problem of course. It doesn’t just affect the display of curly double quotes.

Unicode hasn’t conquered the world quite yet. There are still several popular but mutually incompatible character encodings in use, many legacy files and documents, and many software titles that weren’t written with different standards in mind. And I couldn’t tell you how long it will take for all this to shake out.

Forgot to mention…

If your machine is a breed of Unix, you might find the Recode utility handy for translating text between various character encodings.

But doesn’t the generic full-quote character predate ASCII? I mean, typewriters used it in the olden days.

According to the page I cited above, the generic double-quote character at code 34 has been around since ASCII–1963. The same character was in the earlier FIELDATA standard(s) as well, though with a different code value.

I’d certainly have no trouble believing that the ASCII designers copied the idea for this character from typewriters, though I haven’t found a source specifically saying that. This same frugality with character codes (and typewriter keys) is also what probably explains the schizophrenic “apostrophe / single quote” character at position 39.

If you’re asking if they specifically copied the typewriter keyboard, I can’t offer any proof, but is it truly necessary to prove that a new device intentionally borrowed an element that was standard in its direct antecedent technology? If you want proof that manual typewriters used a doublequote symbol, then I offer this high resolution shot of a classic 1920 Underwood #5. It’s far from the oldest example I found, but it’s the largest, clearest picture, and the Underwood name should be familiar even today. The double quote over the number “2” is unmistakeable to the most screen-weary eyes.

Better proof can be found in almost any typewritten document on NARA (National Archive and Records Administration). I’ve seen documents using the double quote going back to the 1800s. The typewriter (as its name suggests) borrowed its standards from the far older type-setting technology that goes back to Gutenberg Sadly, the Gutenberg, being in Latin, doesn’t use English punctuation conventions, else the famous “Fiat lux” (“Let there be light”) of Genesis 1:3 in the Gutenberg. It would have been a pretty fair contender for the “first mechanical double quote in Western Civilization”.

Okay, maybe it’s not sad that the verse didn’t use quotes. What’s sad is that I actually checked. I should have known better on so many levels.

On re-reading, I realized that my link text didn’t clear that the “Genesis” link goes to a reasonable resolution image of the relevant page in the Gutenberg Bible. That image might be of interest to Dopers, independently of this discussion.

Bytegeist, you covered quite well the answers I was looking for, thanks. At work this issue has recently come up and I was asked to explain it to everyone. Some of the text that we prepare goes other places to be opened by antiquated applications that goof up characters which are not part of basic ASCII. Everyone uses Microsoft Word to to edit their text before copying and pasting into the database. My suggestion was to turn off the “Smart Quotes” in Options. I E-mailed my fellow workers: Usually smart is preferable, but in this one case, dumb is better than “smart.”

So if we use Word to edit our plaintext, but use “dumb” quotes, will they be coded as ASCII 34 and therefore cause no problems across legacy applications?

Yes, that should work. However, I’m an infrequent (and always unwilling) user of Microsoft Word, so I’m often caught off guard by its strange and wily ways. You might want to test the process once to check.

For example, create a simple Word document that uses double quotes, but with Smart Quotes disabled like you say. Save the document as a plain text file. Now start up a DOS window and go to where the file is. Use the TYPE command to display the file’s content in the window. The quotes should show up as plain vanilla, no nonsense, “dumb” ASCII 34 quotes.

Hope this helps.