Why do some places on the internet replace apostrophes with odd characters?

I have seen this on and off, but I’m not sure what is going on.

Someone will attempt to post a word such as “railroad’s” and it comes out “railroad’s”, consistently replacing the apostrophe with “’”.

What causes this? Why are some writings affected and not others? And why these characters?

Is this on the writer’s end, or the internet site hosting the test, or my end? I’ve seen it on message boards as well as “regular” websites.

Oooh. Good question!

Yep. I haven’t a clue.

I believe that this is caused when if you use smart quotes, i.e., the curly of slanted quote marks and apostrophe that Word automatically uses and the font that’s rendering the page you see does not support this.

Yes, the problem is when the application you’re creating the words on substitutes the neutral ASCII quotes or apostrophes with directional quotes or apostrophes. And then when someone opens the text in an application or browser that doesn’t support the directional quote, it displays a mess instead.

This is why directional apostrophes are pure evil. Stop creating them, for crying out loud.

It’s not about the font, it’s about encoding.

If you’re using something shitty like MS Word to write web pages, the software is fond of auto-replacing the ’ character with curly-quotes that are encoded as a different value. The ASCII apostrophe has a value of 39. Curly apostrophes do not appear in the 7-bit ASCII character set, nor in the 8-bit ISO-8859-1 character set, which is a superset of ASCII. They do appear in the 8-bit Windows-1252 character set, which perversely mirrors ISO-8859-1 except replaces a bunch of control codes with things like the curly quotes and the Euro sign.

So, why the string ’ specifically?

If you take the right curly single quotation-mark character, which in Unicode exists at the codepoint U+2019, and you mistakenly encode this value in Windows-1252 encoding, you get the bytes 226, 128, and 153, which correspond to the Windows-1252 characters ’.

There are a few ways to encode an apostrophe in Unicode. The traditional one is code point U+0027, directly equivalent to the ASCII apostrophe.

What you’re seeing is U+2019, Right single quotation mark. If you look at that page, you can see that its encoding in UTF-8 is 0xE2 0x80 0x99.

It so happens that :
U+00E2 = â
U+0080 = €
U+0099 = ™

As for where the error came from, it’s probably the author of the Web page who didn’t encode things correctly. Possibly they forgot to check the encoding when exporting the text from their text editor to HTML.

No. Typography shouldn’t bow to temporary technical limitations. Character encoding standards are here, they’re widely-implemented, and they’re not going away.

You’re naïve if you imagine that a small number of 20ᵗʰ Century reactionaries will be able to keep computerized text in the typewriter age. ¡Viva el estándar Unicode!

Thanks for the quick explanation.

So it would seem that when you see this it is because the original text was composed in MS word and moved over without checking the encoding?

This particular mis-encoding happens when the curly-quote is encoded as UTF-8, but served by something that thinks it’s in Windows-1252.

I concur if you are sending me stuff to put on the website and I ask for text file and you decide to “help me out” by sending me a Word doc.

To answer the OP, this is probably what happened and the web manager didn’t catch and change it.

Is this the kind of thing that can be fixed by declaring an encoding in your HTML like this article describes?

It depends. The problem might be caused by the web server sending the wrong content encoding header (in which case overriding that in the HTML might work (but the better solution is to fix the server config.)) But it can also be caused by something earlier in the chain server-side, like a misconfigured database, which is sending out stuff in the wrong encoding, or badly-written server-side code, which is failing to properly decode stuff from the filesystem or database.

Sometimes encoding problems can be notoriously difficult to track down.

My question in all this is why have we been seeing this on the web for like 15 years? Can’t the browsers or someone make it so that it works right?
Is this an unfixable problem?

Hear! Hear! Bravo.

Guessing at character encodings is far more problematic than just rendering the crap that you’re given. Every shitty browser bug in history has been caused by trying to use flawed heuristics to guess what the author “really means.”

So, this is why when I use a shift 7 I get this & in the posted post but shows as a normal and symbol in the post reply window?

Or it it because I don’t use IE or Chrome or ______ ???

It looks normal when I type in while composing the post but displays incorrectly in the actual post.

Huh??? :confused:

That is a normal “and” symbol.

Gus: What do you mean by “It looks normal when I type in while composing the post but displays incorrectly in the actual post.”?

As far as I can see the symbol in your post is an ampersand. There are many shapes of ampersand, just like there are many shapes of the letter “A”. Each font draws the letters and symbols as slightly different shapes.

The font used inside the text input box and the font used to display posts are different on the SDMB. Which is a pretty common thing to see in web forms.

The fact the ampersand is a different shape has nothing to do with the issue that the OP is asking about.

& is in ASCII. Every Internet-connected Web-browsing computer on Earth agrees what byte to interpret as the & character. This is down to either a font difference or incipient insanity on your part, and right now, your description of what’s confusing you isn’t enough to rule out either.

I can explain fonts. Nobody can fully explain insanity.