I looked at the source code of this very page. Right near the top, it says
<meta charset="utf-8">
Now
If you view the source of the problematic web site, some joker may well have put in a different charset while leaving the page as is.
The GDPR says stuff like “Data controllers should be encouraged to develop interoperable formats that enable data portability”, which I suppose can be interpreted as encouraging the use of standardized character sets.
Meh. All this stuff is commonly referred to as extended ASCII. Which I’m also not going to spell out every single time.
I don’t think a program exists which renders just the lower 7 bits of ASCII and nothing more. They all pick some code page to render the 128-255 characters. So plain ASCII doesn’t really exist as such.
And as long as we’re getting technical, ASCII is just a character set, not a storage specification. There are other ways of storing the 7 bits of each ASCII character than 8-bit bytes, but they’re so rare they’re barely worth mentioning.
Yes, but that UTF-8 may have taken a circuitous route to get to your screen. Various languages/environments, including Windows, may have a different internal encoding. For instance, C/C++ code in Windows uses UTF-16LE for its wchar_t type. But say your database returns (single-byte) char data. The naive programmer knows they need to convert one to the other, finds some routine off Stack Overflow or whatever, and finds that it works. Success! Some later piece of the program converts back from UTF-16LE to UTF-8.
The problem of course is that if that char data was UTF-8, and not ASCII+codepage 1252 or whatever, it will come out like the above.
I’ve seen this bug on slashdot.org within the past year, at least. And lots of other places, though I don’t remember specifically where. Occasionally, there is a kind of third conversion, where the ™ comes out as (TM) or some such, because there’s some step that recognizes ™ isn’t plain ASCII, and decides to translate it to something printable.
I had no idea my idle curiosity about a random observation of ’ symbol was so complicated. I know almost nothing about coding, but I want answers to ‘anything’ I don’t understand.
I’ll add my guess after reading many discussion on my question. So here’s my educated guess from reading 7 different articles including the comment sections.
And my pick is
When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.
I did C&P from the site. It was a rather long article hence my aggravation plus I am a speed reader. The only time I read at a relaxed pace is when I need to solve a problem.
The strangest thing I didn’t mention was it lesson on different methods of hypnosis and theory. The skeptic in is thinking this is not right, good try though!
A slightly weird one, but the same basic idea. The description contains this quote:
“To avoid repeating myself I figure it might be worthwhile briefly explaining why here⦔
Ok, some junk at the end. That a with circumflex looks suspicious, though. If we follow the tweet, we see it says this:
to avoid repeating myself I figure it might be worthwhile briefly explaining why here…
So it’s an ellipsis. In UTF-8, that ellipsis has a code E2 80 A6. In decimal, that’s 226 128 166. But it looks like there are only two junk characters above. Looking at the article source, it comes through as:
“To avoid repeating myself I figure it might be worthwhile briefly explaining why here⦔
It’s not actually using Unicode at all; it’s using an HTML numeric reference to specify the characters. As it happens, the two shown match their equivalents in codepage 1252. That is, â = â and ¦ = ¦.
The slight mystery is what happened to the 0x80 (128) byte. If they’d been using codepage 1252, it should have come through as a €, but it didn’t. However, I note that €comes through as “pad”. To be honest, I don’t know precisely what that means, but it’s considered a control character.
So all in all, I believe the defect in the Slashdot code is that while it takes UTF-8 input, when displaying the page it processes the data a byte at a time, converting values 128-255 directly to the HTML codes (via &#nnn; syntax), and stripping out anything identified as a control character. Completely sloppy programming, though not totally surprising coming from an ancient codebase like that.
Addendum:
The article description, used for the OneBox above, has this text:
To avoid repeating myself I figure it might be worthwhile briefly explaining why here...
That’s right, three periods in a row. So something recognized the ellipsis code and converted it to three separate periods. Very strange how it differs from the main summary code.
I set the board to use the largest font. On my screen, the summary cuts off sooner than that. So it may just be the board software adding ellipses to where it cut off the summary, rather than actually recognizing what the original is supposed to mean.
Plausible under other circumstances, but if you look at the source of the Slashdot page, it is actually three periods. Search for “og:description”, which is the Open Graph tag that OneBox uses.
One is the same as the OP, a curly apostrophe that translates to ’. The next is the word Prüm. If we again assume codepage 1252, the junk chars translate to C3 BC, which is the UTF-8 sequence for ü. So our city is Prüm, Germany, which makes sense in context. Finally, there is a stray Â. I think this might have been cut off from something else, since it isn’t an obvious corruption of something else.
Anyway, these are again bugs with the web site, though the nature of the bug is slightly different here than on Slashdot. I’m guessing we’ll still be seeing these errors 20 years from now.
Check out the mouseover text. It’s intentional, of course (that’s the joke). There are corrupted left and right curly quotes, a curly apostrophe, and an em-dash.
A circular one, often red, the top looks like either a chess queen, or the front of a Klingon ship. Also has “wings” too that follow the circle’s curve and point forward. See this all the time on various (private) vehicles. Some weird Trek thing, or quasi-religious iconography of some sort?
Please what is the answer? @John_DiFool I spent an hour trying to figure it out. I do not know java or anything but I really have learned a lot, mainly the history of coding going back to FOAF all the way to Hash. That is cool!! I have found a new hobby!
I wish there was a dedicated pinned topic teaching code. Of course it would be whenever y’all felt like it. Is this a possibility? I think the first lesson should be mastering Discourses! I believe it would be a benefit for the community.
Unicode encodes many characters, including the banjo , the ninja 🥷, and various cuneiform ligatures 𒅬, but, thankfully, no Klingon, which shows that a modicum of sanity still reigns.