Why do some places on the internet replace apostrophes with odd characters?

Derleth · May 28, 2015, 12:51am

Collation is inherently more complicated in Unicode than it is in ASCII. Unicode is closer to how complex it is in reality, once you move beyond only caring about a subset of English-language text.

Another thing which becomes more complicated is letter case: For example, what is the upper-case version of ß? Ask any German speaker, and they’ll tell you it’s SS. Two letters. However, the letter pair ss does occur in German text, so there’s no way to always know what the lower-case version of SS is. If you’re trying to do it with a simple algorithm, you lose. The real world doesn’t work like that.

(Interestingly, there’s a very rare character which is a single-letter capital ß: ẞ. It does occur in German text. It was not invented for Unicode. It is vanishingly rare, and essentially never occurs in modern German text.)

The upshot is, nontrivial text processing needs to be done using specialized libraries. This has, really, always been the case, it’s just that now, multilingual text is technically feasible, making it impossible to ignore some things the ASCII-only world let us gloss over.

friedo · May 28, 2015, 12:55am

The old roundtrip casefolding myth. It has led to much heartbreak. And did you know Unicode also has titlecase?

Unicode is hard. Really hard. Yet people still think they can handle text with cheap hacks from the punchcard days.

Derleth · May 28, 2015, 1:25am

The only round-trip Unicode cares about is Unicode to other character encoding to Unicode: The text must be the same out the back-end as it was on the front-end. This has lead to Unicode preserving oddities and arguable mistakes from decades ago.

Yep. And the only way to justify such nonsense is the idiotic fact real languages have it.

In a way, I blame ASCII for having been designed so well: So many hacks work in ASCII: Want to go from a numeral to a number? Subtract ‘0’. Want to sort alphabetically? Sort numerically by character code. Want to flip case? Flip a bit. Easy-peasy, and you get enough of English to convince people that characters not in ASCII are weird or special somehow.

Now that world is gone, and good riddance. The habits die hard, though.

Just_Asking_Questions · May 28, 2015, 10:42pm

Found another one in the wild today.

From a Yahoo! news article:

The next sentence had properly displayed double quote marks. But the (apparently) single quotes and apostrophe failed to convert.

LSLGuy · May 29, 2015, 1:31am

I always enjoyed the Turkish (dotted vs. undotted) * (upper vs. lowercase) I. Case folding is, as you’ve said, very far from trivial.

Topic		Replies	Views
Why do apostrophes so often get screwed up on the WWW? Factual Questions	10	1643	September 12, 2006
Why strange symbols for punctuation? Factual Questions	7	1656	November 6, 2015
Quote mark coding - why quote marks are changed into different characters Factual Questions	9	1809	July 14, 2004
For all the reasons Microsuck is going to hell, Alt+0146 is the hellishest The BBQ Pit	47	6850	January 20, 2009
Why are the apostrophes in Google lyrics question marks? Factual Questions	2	750	February 10, 2004

Why do some places on the internet replace apostrophes with odd characters?

Related topics