Why do some places on the internet replace apostrophes with odd characters?

LSLGuy · May 26, 2015, 9:39pm

It’s unfixable only in the sense there are a lot of amateurs being paid to operate supposedly-professional websites. And there are a lot of homebrew content creation and publishing processes used even in sophisticated companies.

You’ll keep seeing this at any given website until the content creators and the website operators of that site A) are made aware of the issue, B) give a hoot and C) are given the resources (time and software) to make correcting this stuff simple enough (or automatic enough) that it actually gets done every time.

You’ll keep seeing it across the 'net at large until *every *content creator and *every *website operator makes the changes above.

This is not technically hard. It’s only administratively hard for sloppy workers in sloppy processes using long-obsolete tools. And no, Word is not the problem. Neither are so-called smart quotes. This is totally a result of 2005 attitudes still in place in 2015.

Derleth · May 26, 2015, 9:45pm

2005? Try 1985. We’d solved this problem by the time the Web was really hitting the big time. UTF-8, which is the only encoding anyone should be using on the Internet, existed in 1992, and by 2008 Google was reporting that it was the most-used encoding on the Web. If you use UTF-8 consistently and correctly, none of this happens. None of this is a problem. Your concerns about character encodings are gone, because the only one that matters in this context is UTF-8.

Other people in other contexts have to deal with other encodings. Those encodings aren’t relevant to Web sites.

iamthewalrus_3 · May 26, 2015, 9:58pm

It’s a font difference.

The ampersand symbol in the font used to compose messages for the SDMB looks different than it does in the font that displays posts. One looks like a classic ampersand, the other more like the ligature for “Et”.

Acsenray · May 26, 2015, 10:02pm

The ampersand is just a very stylized e-t ligature. In some typefaces, it looks more obviously like e-t than it does in others.

Senegoid · May 26, 2015, 11:19pm

We should be glad the ampersand & character works right in posts at all, and likewise a certain few other characters like less-than < and greater-than > and semi-colon ;

These are all super-duper special fragile snowflake characters in HTML, and they require special encoding to make them appear at all. When you enter the single character & into a post, the back-end software must translate that to:
&
(that is, ampersand followed by the letters amp followed by semicolon) in the HTML file it sends to your browser.

Web programming gets messy because there are several separate groups of translations like that which must be made, and they have to be done in the right order too. If your site does any work with databases too (as a message board obviously does), there’s yet another set of character translations that needs to be done to get text into or out of a database without trashing the text (or worse, trashing the entire database).

Senegoid · May 26, 2015, 11:28pm

The problem was actually solved in the early 1970’s or earlier, when all text was plain-old plain-ASCII 7-bit text with a grand total of 94-or-so characters (capital letters, lower-case letters (English only, of course), ten digits, and an odd assortment of punctuation marks).

The encoding problem got unsolved when people started using 8-bit ASCII and adding non-standard extra characters, and got even more unsolved when people started using all kinds of extended character sets, and when (gasp!) people outside the English-speaking United States discovered computers too.

Going back even further in history, the problem was solved the day that Adam named all the creatures, but got unsolved at Gen. 11 (Tower of Babel story). It’s all been downhill from there.

Derleth · May 27, 2015, 12:09am

This only solved the problem if you only cared about English without good typography, where fractions looked like 1/2 instead of ½, and correctly spelling certain words such as “rôle” and “naïve” wasn’t a concern for you. And I won’t even mention the inanity that was hand-lettered mathematical formulae in otherwise-typeset works because you’re likely old enough to remember stuff like that.

English typography and orthography had already been simplified due to the typewriter and computers inherited that. However, computers are inherently more flexible than typewriters, and so don’t need to be constrained to what will fit on a platen.

… and when people inside the English-speaking United States decided that not being able to have visually-appealing quotation marks or even currency symbols other than the dollar sign was kinda stupid, given that they were now using machines which could typeset reams of documentation in fractions of a second. You want the manuals for Seventh Edition Unix in your hand, replete with bold and underline and justification? No problem. You want to print out the international shipping cost in £ or ¥ without hand-writing it or resorting to some ugly work-around? Go fly a kite.

And to this day, that story is one of the first things to be translated into any respectable constructed language. (Tolkien’s constructed languages predate this, I’m pretty sure.)

Saint_Cad · May 27, 2015, 2:06am

Whoa for a sec. I use UTC-8 in my webpages but was able to use Chinese ideograms for the Mandarin teacher. Fancy formatting can be had but 90% or more of “web programmers” can’t handle anything not dome on Dreamweaver or worse yet Wordpress.

Measure_for_Measure · May 27, 2015, 3:26am

Ok, what about parentheses? Typewriters can handle them fine. But they render oddly when I paste a wiki link:

aka

Oh, I get it. vBulletin’s url rendering robot can’t handle parentheses. It does this:

Font (disambiguation) - Wikipedia …which gives the equivalent of a 404.

…so Wikipedia substitutes in %29, anticipating problems in other websites. Or so I guess.

Senegoid · May 27, 2015, 6:18am

Here is the must-read tutorial on modern character sets and encoding:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), Joel Spolsky, October 8, 2003.

Donnerwetter · May 27, 2015, 8:11am

If 7-bit was good enough for Jesus Christ, it ought to be good enough for the Internet.

LSLGuy · May 27, 2015, 9:39am

I agree we had a technological solution in 1992: UTF-8. Before UTF-8 we had progress towards a technological solution. But as of today, the actual problem isn’t solved yet; the OP is still seeing evidence of that.

In fact I originally wrote 1995 then changed it to 2005. 2005 is about when web companies got serious about actually using, not just talking about, UTF-8. You’re right that by 2008 it was the largest single encoding. IOW, by then UTF-8 usage had achieved plurality, but not yet majority. It certainly hasn’t achieved universality yet.

tldr: IMO, the problem we all identified and I was discussing is actual usage in the wild. And that problem will be solved when the last ISO-8859-1 or plain ASCII or whatever website goes off the air never to return. Which IMO, will be some time after we disable all support for IPv4. IOW, never in internet years.

friedo · May 27, 2015, 12:14pm

Parentheses are reserved characters in URLs and must be encoded properly when used in HTML. See percent encoding.

That has nothing to do with fonts or character sets. It’s an entirely different issue.

leahcim · May 27, 2015, 2:12pm

[nitpick]I believe that they are perfectly fine in HTML itself, it’s just in URLs that they have to be encoded.

Just_Asking_Questions · May 27, 2015, 3:53pm

Funny you should mention that. I have a part drawing here at work with an “&” in the title but in the database it shows up as “ANDAMP;”.

I’m glad I asked this one - I’m learning a lot about text coding. Thanks, all.

Derleth · May 27, 2015, 4:57pm

UTF-8 does allow this, yes. It’s one of the big reasons to use UTF-8: Every living language you’ll have to work with can be handled using that one encoding. All of them. And the number of them is growing, but UTF-8 remains the same, so your code will still work and your pages will still render correctly years or decades from now, as long as you’re using UTF-8 correctly.

Back in the Ancient Times, before Unicode, each language could potentially need its own character encoding, and therefore it was impossible to have some mix of languages in the same document. For example, Russian and German couldn’t coexist: Russian needs the Cyrillic alphabet, whereas German needs characters such as ß and ö, and back in the Old Days there wasn’t any character encoding scheme which had both. Bizarre hacks, such as transliteration and using image files full of text, were used, but none of them were acceptable.

(There was at least one standard which theoretically allowed documents to shift between a restricted mix of character sets mid-stream in a standardized fashion (and a few non-standard ways to accomplish something similar), but to the best of my knowledge it wasn’t relevant or implemented even when it was still theoretically useful.)

Now, with Unicode to define the character set and UTF-8 to encode it acceptably, documents can have Chinese and Japanese and Russian and Greek and English all at the same time, efficiently and without resort to bizarre hacks.

There are some odd edge cases, and new characters get added to Unicode (but never removed!), but as far as you’re concerned, the problem is solved: Use UTF-8.

GusNSpot · May 27, 2015, 7:49pm

It is nice to know that I am not insane & there was a polite way to explain it to me.

There is an actual different image displayed on my screen for this symbol between the composition window and the displayed post.

&

Way above my pay grade… Does not make me insane though…

LSLGuy · May 27, 2015, 9:35pm

Gus: Glad you got it.

Derleth: Now let’s discuss collations in MS SQL Server, Oracle, etc. On second thought, let’s not.

friedo · May 27, 2015, 11:30pm

I still love the fact that the default collation for MySQL is latin1_swedish_ci. A real wake-up call for English-speakers who think the world is asciibetical.

LSLGuy · May 28, 2015, 12:50am

26 accent-less letters ought to be enough for anyone.[sup]1[/sup]

With apologies to Bill Gates who never said something similar.

Topic		Replies	Views
Why do apostrophes so often get screwed up on the WWW? Factual Questions	10	1643	September 12, 2006
Why strange symbols for punctuation? Factual Questions	7	1656	November 6, 2015
Quote mark coding - why quote marks are changed into different characters Factual Questions	9	1809	July 14, 2004
For all the reasons Microsuck is going to hell, Alt+0146 is the hellishest The BBQ Pit	47	6850	January 20, 2009
Why are the apostrophes in Google lyrics question marks? Factual Questions	2	750	February 10, 2004

Why do some places on the internet replace apostrophes with odd characters?

26 accent-less letters ought to be enough for anyone.[sup]1[/sup]

Related topics