What the �

Why don’t computers automatically “translate” text so that punctuation and other characters in web pages and e-mails aren’t replaced by weird symbols, e.g. “—.

Although I lack technical expertise, I am dimly aware that this arises from the incompatibility between (among?) numerous “character sets” and network protocols. Paradoxically, even something called “unicode” comes in several flavors that do not mix well.

This annoyance affects not only personal e-mails, but business e-mails, web pages, etc. I’m sure that most end-users don’t know what to make of the geeky error messages asking whether a reply should be sent “as is” or “Unicode”, etc.

So again: why isn’t there a programming/software fix for this by now? Even simply deleting incompatible characters and/or leaving a blank space would be an improvement. (While they’re at it, they also need to fix HTML so that certain punctuation, e.g. the numeral “8” next to a right parenthesis “)”, doesn’t automatically morph into an emoticon.)

Note to moderators: This OP sounds like a rant, but I really hope someone can shed some factual light on it, so please leave this thread here in GQ!

Computers don’t do anything, software does. There are lots of programs out there that would need this functionality, and with display issues there are many characters that wouldn’t look right anyways.

That’s the advantage of standards, there are so many to choose from.

There are many solutions, all with advantages and disadvantages. You can throw away anything you don’t understand but that means you’re destructive to documents that pass through you and that’s typically a bad thing. You can fail to display anything you don’t recognize, but that can cause all sorts of confusion by itself. You can attempt to translate but without knowing what the source was or the destination is it’s pretty hard to guess.

There are dozens of character sets and encodings because the problem is hard. No one designed languages, they don’t all line up together in a simple format. Throw in languages like the various Chinese/Japanese character representations, Arabic, and shudder Thai and any solution you chose as a one-size-fits-all ends up as a mess.

Unicode is a system in which every symbol used by humans for communication, including the Latin alphabet but also the Greek, Cyrillic, and Arabic ones, the ideographs of Chinese, the syllabaries of Japanese, the yen, Euro, pound sterling, and other money symbols, the accented, tilde-ified, and umlauted letters of various European languages, and punctuation. MS Word seems to default to using Unicode. The accented letters, smart quotes, sigle symbol for ellipsis (…), and similar Unicode symbols are often not displayed well by browsers which use ASCII or ANSI coding. The †is probably your computer’s best guess as to how to render the HTML rendering of Unicode symbol #8817, which is actually the ellipsis. And so on.

Then we have oddities like the interpretation of the Milwaukee Doubletree Hotel page, which Chrome thinks is in Japanese, and asks if I want to translate it. If I do, nothing changes, so it’s a strange sort of Japanese in the American Heartland.

Au contraire, Unicode symbols (in the UTF-8 encoding) are very well displayed by browsers which assume ASCII and ANSI, which is why you can see most of the text just fine without only a few spots where it goes weird. Try interpreting a JPEG stream as a PNG and you wouldn’t get ten bytes in without failing.

UTF-8 is designed to be the same as ASCII for the low characters (0-127). Maybe it would have been better to crash out entirely to drive home the point that you’re doing something non-sensible.

There are well-established ways of using the correct character set in webpages, this has been perfected for decades. The sites you’re seeing this problem on were coded by people who don’t know what they’re doing.

In short: it’s not a computer problem, it’s a people problem.

It’s not quite that simple. There are hundreds of legacy programs and systems that would be difficult and expensive to upgrade, and most for little gain. Things are probably better in countries with more complex native character sets. Even if you upgrade your personal machine there may be weak links along the way that make perfect transmission problematic.

:confused: My version of Chrome doesn’t do this. (Neither does Firefox.)

Hmm. Looks normal to me. さよなら?

Since this part of the rant doesn’t seem to have been addressed – this has nothing to do with HTML at all, just a program that believes that users would prefer to have a simple way to write an emoticon than a simple way to write an equation.

Emoticons have never been part of any HTML standard (although if the browser wars in the 90s had gone on any longer, IE would probably have supported them at some point :slight_smile: .)

And this is the root of the problem. UTF-8 is the clear solution for encoding text, and it’s well-supported in all modern software. The problem is all the software that isn’t modern, and all the webpages using some crazy character encoding because they were written before UTF-8, and all the programming languages which are still stuck on 8-bit characters and Latin-1, or just to confuse things, languages using UTF-16 (lookin’ at you, Java).

A system badly enough designed that it is even possible to be used incorrectly by the majority of people isn’t a people problem, it’s a computer problem.

However, I’m glad that this was perfected for “decades” (HTML was created in 1990, so they must have been quick about it) – that means that the billion or so dollars a year my company spends on it can now be reclaimed so that we can blow it all on hats.

One man’s “normal” is another’s Japanese.

Firefox doesn’t have translation links to Google, IMHO, so that’s not surprising. Chrome is highly customized, so I’m not surprised there, either. It’s likely that something in my system is misleading the browser engine.

But in case you don’t believe me, here is a screen shot, with only the extreme top and bottom (showing tabs) parts removed. Nothing else has been edited.

I haven’t been browsing Japanese sites or doing anything else that might trigger such a function in Chrome, but this Milwaukee hotel site (and only this site) brings up the “translate?” message every time. Weird.

Except, so far, systems are designed by people, not computers, so at the beginning, it really is a people problem.

Yes, they’ve been working on the problem for over 2 decades.

And your company spends “billion or so dollars a year” dealing with character set compatibility issues?? Even if you work for Google or Apple, I doubt they spend 2% of the revenue on this issue alone.

Perfectly intelligible Romulan.

And the problem is . . .?

Here’s a good primer on Unicode character codes:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Wà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅit till you deà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅl with stà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅcked dià̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅcritics.

AaronX, that does something very weird (Firefox 20.0.1, Windows 7).

About 15-20 years ago my father-in-law took a stab at comprehensive internationalization for computing systems (he was a lecturer in computer science at a university in NZ) - something more than just Unicode. To really get a grip on the scale of the problem he started studying Mandarin, to the point where he could read and speak the language. I don’t think the interface library developed very far, but he enjoyed the journey (which eventually included a trip to China).