It’s unfixable only in the sense there are a lot of amateurs being paid to operate supposedly-professional websites. And there are a lot of homebrew content creation and publishing processes used even in sophisticated companies.
You’ll keep seeing this at any given website until the content creators and the website operators of that site A) are made aware of the issue, B) give a hoot and C) are given the resources (time and software) to make correcting this stuff simple enough (or automatic enough) that it actually gets done every time.
You’ll keep seeing it across the 'net at large until *every *content creator and *every *website operator makes the changes above.
This is not technically hard. It’s only administratively hard for sloppy workers in sloppy processes using long-obsolete tools. And no, Word is not the problem. Neither are so-called smart quotes. This is totally a result of 2005 attitudes still in place in 2015.
2005? Try 1985. We’d solved this problem by the time the Web was really hitting the big time. UTF-8, which is the only encoding anyone should be using on the Internet, existed in 1992, and by 2008 Google was reporting that it was the most-used encoding on the Web. If you use UTF-8 consistently and correctly, none of this happens. None of this is a problem. Your concerns about character encodings are gone, because the only one that matters in this context is UTF-8.
Other people in other contexts have to deal with other encodings. Those encodings aren’t relevant to Web sites.
The ampersand symbol in the font used to compose messages for the SDMB looks different than it does in the font that displays posts. One looks like a classic ampersand, the other more like the ligature for “Et”.
We should be glad the ampersand & character works right in posts at all, and likewise a certain few other characters like less-than < and greater-than > and semi-colon ;
These are all super-duper special fragile snowflake characters in HTML, and they require special encoding to make them appear at all. When you enter the single character & into a post, the back-end software must translate that to:
&
(that is, ampersand followed by the letters amp followed by semicolon) in the HTML file it sends to your browser.
Web programming gets messy because there are several separate groups of translations like that which must be made, and they have to be done in the right order too. If your site does any work with databases too (as a message board obviously does), there’s yet another set of character translations that needs to be done to get text into or out of a database without trashing the text (or worse, trashing the entire database).
The problem was actually solved in the early 1970’s or earlier, when all text was plain-old plain-ASCII 7-bit text with a grand total of 94-or-so characters (capital letters, lower-case letters (English only, of course), ten digits, and an odd assortment of punctuation marks).
The encoding problem got unsolved when people started using 8-bit ASCII and adding non-standard extra characters, and got even more unsolved when people started using all kinds of extended character sets, and when (gasp!) people outside the English-speaking United States discovered computers too.
Going back even further in history, the problem was solved the day that Adam named all the creatures, but got unsolved at Gen. 11 (Tower of Babel story). It’s all been downhill from there.
This only solved the problem if you only cared about English without good typography, where fractions looked like 1/2 instead of ½, and correctly spelling certain words such as “rôle” and “naïve” wasn’t a concern for you. And I won’t even mention the inanity that was hand-lettered mathematical formulae in otherwise-typeset works because you’re likely old enough to remember stuff like that.
English typography and orthography had already been simplified due to the typewriter and computers inherited that. However, computers are inherently more flexible than typewriters, and so don’t need to be constrained to what will fit on a platen.
… and when people inside the English-speaking United States decided that not being able to have visually-appealing quotation marks or even currency symbols other than the dollar sign was kinda stupid, given that they were now using machines which could typeset reams of documentation in fractions of a second. You want the manuals for Seventh Edition Unix in your hand, replete with bold and underline and justification? No problem. You want to print out the international shipping cost in £ or ¥ without hand-writing it or resorting to some ugly work-around? Go fly a kite.
And to this day, that story is one of the first things to be translated into any respectable constructed language. (Tolkien’s constructed languages predate this, I’m pretty sure.)
Whoa for a sec. I use UTC-8 in my webpages but was able to use Chinese ideograms for the Mandarin teacher. Fancy formatting can be had but 90% or more of “web programmers” can’t handle anything not dome on Dreamweaver or worse yet Wordpress.
I agree we had a technological solution in 1992: UTF-8. Before UTF-8 we had progress towards a technological solution. But as of today, the actual problem isn’t solved yet; the OP is still seeing evidence of that.
In fact I originally wrote 1995 then changed it to 2005. 2005 is about when web companies got serious about actually using, not just talking about, UTF-8. You’re right that by 2008 it was the largest single encoding. IOW, by then UTF-8 usage had achieved plurality, but not yet majority. It certainly hasn’t achieved universality yet.
tldr: IMO, the problem we all identified and I was discussing is actual usage in the wild. And that problem will be solved when the last ISO-8859-1 or plain ASCII or whatever website goes off the air never to return. Which IMO, will be some time after we disable all support for IPv4. IOW, never in internet years.
UTF-8 does allow this, yes. It’s one of the big reasons to use UTF-8: Every living language you’ll have to work with can be handled using that one encoding. All of them. And the number of them is growing, but UTF-8 remains the same, so your code will still work and your pages will still render correctly years or decades from now, as long as you’re using UTF-8 correctly.
Back in the Ancient Times, before Unicode, each language could potentially need its own character encoding, and therefore it was impossible to have some mix of languages in the same document. For example, Russian and German couldn’t coexist: Russian needs the Cyrillic alphabet, whereas German needs characters such as ß and ö, and back in the Old Days there wasn’t any character encoding scheme which had both. Bizarre hacks, such as transliteration and using image files full of text, were used, but none of them were acceptable.
(There was at least one standard which theoretically allowed documents to shift between a restricted mix of character sets mid-stream in a standardized fashion (and a few non-standard ways to accomplish something similar), but to the best of my knowledge it wasn’t relevant or implemented even when it was still theoretically useful.)
Now, with Unicode to define the character set and UTF-8 to encode it acceptably, documents can have Chinese and Japanese and Russian and Greek and English all at the same time, efficiently and without resort to bizarre hacks.
There are some odd edge cases, and new characters get added to Unicode (but never removed!), but as far as you’re concerned, the problem is solved: Use UTF-8.
I still love the fact that the default collation for MySQL is latin1_swedish_ci. A real wake-up call for English-speakers who think the world is asciibetical.