Would Someone Please Explain What This Symbol Means

DPRK · July 18, 2021, 5:03pm

I looked at the source code of this very page. Right near the top, it says

<meta charset="utf-8">

Now

If you view the source of the problematic web site, some joker may well have put in a different charset while leaving the page as is.

The GDPR says stuff like “Data controllers should be encouraged to develop interoperable formats that enable data portability”, which I suppose can be interpreted as encouraging the use of standardized character sets.

Dr.Strangelove · July 18, 2021, 7:59pm

Meh. All this stuff is commonly referred to as extended ASCII. Which I’m also not going to spell out every single time.

I don’t think a program exists which renders just the lower 7 bits of ASCII and nothing more. They all pick some code page to render the 128-255 characters. So plain ASCII doesn’t really exist as such.

And as long as we’re getting technical, ASCII is just a character set, not a storage specification. There are other ways of storing the 7 bits of each ASCII character than 8-bit bytes, but they’re so rare they’re barely worth mentioning.

Yes, but that UTF-8 may have taken a circuitous route to get to your screen. Various languages/environments, including Windows, may have a different internal encoding. For instance, C/C++ code in Windows uses UTF-16LE for its wchar_t type. But say your database returns (single-byte) char data. The naive programmer knows they need to convert one to the other, finds some routine off Stack Overflow or whatever, and finds that it works. Success! Some later piece of the program converts back from UTF-16LE to UTF-8.

The problem of course is that if that char data was UTF-8, and not ASCII+codepage 1252 or whatever, it will come out like the above.

I’ve seen this bug on slashdot.org within the past year, at least. And lots of other places, though I don’t remember specifically where. Occasionally, there is a kind of third conversion, where the ™ comes out as (TM) or some such, because there’s some step that recognizes ™ isn’t plain ASCII, and decides to translate it to something printable.

Chronos · July 18, 2021, 8:30pm

I’m using Firefox on a Mac, and I still see “circumflex a” “euro” “trademark” in the OP.

DPRK · July 18, 2021, 8:59pm

OK, I do too, but, since the OP starts out as

<!DOCTYPE html>
<html lang="en" class="desktop-view not-mobile-device text-size-normal">
  <head>
    <meta charset="utf-8">
    <title>Would Someone Please Explain What This Symbol Means - In My Humble Opinion (IMHO) - Straight Dope Message Board</title>
    <meta name="description" content=":arrow_right:      â€™       :arrow_left:

I figure the OP was able to cut-and-paste what was visually on his or her screen despite the encoding mismatch.

DPRK · July 18, 2021, 10:21pm

…or else the bad conversion happened even earlier along the pipeline.

shh1313 · July 18, 2021, 11:11pm

I had no idea my idle curiosity about a random observation of â€™ symbol was so complicated. I know almost nothing about coding, but I want answers to ‘anything’ I don’t understand.

I’ll add my guess after reading many discussion on my question. So here’s my educated guess from reading 7 different articles including the comment sections.

And my pick is

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

shh1313 · July 19, 2021, 12:07am

I did C&P from the site. It was a rather long article hence my aggravation plus I am a speed reader. The only time I read at a relaxed pace is when I need to solve a problem.

The strangest thing I didn’t mention was it lesson on different methods of hypnosis and theory. The skeptic in is thinking this is not right, good try though!

shh1313 · July 19, 2021, 4:35am

Duplicate posted deleted.

Dr.Strangelove · July 19, 2021, 9:04am

An example from today:

A slightly weird one, but the same basic idea. The description contains this quote:

“To avoid repeating myself I figure it might be worthwhile briefly explaining why hereâ¦”

Ok, some junk at the end. That a with circumflex looks suspicious, though. If we follow the tweet, we see it says this:

to avoid repeating myself I figure it might be worthwhile briefly explaining why here…

So it’s an ellipsis. In UTF-8, that ellipsis has a code E2 80 A6. In decimal, that’s 226 128 166. But it looks like there are only two junk characters above. Looking at the article source, it comes through as:

“To avoid repeating myself I figure it might be worthwhile briefly explaining why hereâ¦”

It’s not actually using Unicode at all; it’s using an HTML numeric reference to specify the characters. As it happens, the two shown match their equivalents in codepage 1252. That is, â = â and ¦ = ¦.

The slight mystery is what happened to the 0x80 (128) byte. If they’d been using codepage 1252, it should have come through as a €, but it didn’t. However, I note that  comes through as “pad”. To be honest, I don’t know precisely what that means, but it’s considered a control character.

So all in all, I believe the defect in the Slashdot code is that while it takes UTF-8 input, when displaying the page it processes the data a byte at a time, converting values 128-255 directly to the HTML codes (via &#nnn; syntax), and stripping out anything identified as a control character. Completely sloppy programming, though not totally surprising coming from an ancient codebase like that.

Addendum:
The article description, used for the OneBox above, has this text:

To avoid repeating myself I figure it might be worthwhile briefly explaining why here...

That’s right, three periods in a row. So something recognized the ellipsis code and converted it to three separate periods. Very strange how it differs from the main summary code.

BigT · July 19, 2021, 9:37am

I set the board to use the largest font. On my screen, the summary cuts off sooner than that. So it may just be the board software adding ellipses to where it cut off the summary, rather than actually recognizing what the original is supposed to mean.

Dr.Strangelove · July 19, 2021, 9:41am

Plausible under other circumstances, but if you look at the source of the Slashdot page, it is actually three periods. Search for “og:description”, which is the Open Graph tag that OneBox uses.

Dr.Strangelove · July 20, 2021, 5:32am

Yet another example with multiple errors:

One is the same as the OP, a curly apostrophe that translates to â€™. The next is the word PrÃ¼m. If we again assume codepage 1252, the junk chars translate to C3 BC, which is the UTF-8 sequence for ü. So our city is Prüm, Germany, which makes sense in context. Finally, there is a stray Â. I think this might have been cut off from something else, since it isn’t an obvious corruption of something else.

Anyway, these are again bugs with the web site, though the nature of the bug is slightly different here than on Slashdot. I’m guessing we’ll still be seeing these errors 20 years from now.

Dr.Strangelove · July 21, 2021, 9:13am

Just one more example and then I’m done:

Check out the mouseover text. It’s intentional, of course (that’s the joke). There are corrupted left and right curly quotes, a curly apostrophe, and an em-dash.

John_DiFool · July 21, 2021, 1:35pm

Oh, I got one.

A circular one, often red, the top looks like either a chess queen, or the front of a Klingon ship. Also has “wings” too that follow the circle’s curve and point forward. See this all the time on various (private) vehicles. Some weird Trek thing, or quasi-religious iconography of some sort?

shh1313 · July 21, 2021, 3:14pm

That’s driving me nuts!! I new plz

shh1313 · July 21, 2021, 11:36pm

Please what is the answer? @John_DiFool I spent an hour trying to figure it out. I do not know java or anything but I really have learned a lot, mainly the history of coding going back to FOAF all the way to Hash. That is cool!! I have found a new hobby!

I wish there was a dedicated pinned topic teaching code. Of course it would be whenever y’all felt like it. Is this a possibility? I think the first lesson should be mastering Discourses! I believe it would be a benefit for the community.

Fingers crossed!

@BigT @codinghorror @Chronos @Dr.Strangelove @What_Exit @engineer_comp_geek @John_DiFool

Dr.Strangelove · July 21, 2021, 11:48pm

This one?

Star Wars, not Trek. Logo for the Rebel Alliance.

shh1313 · July 22, 2021, 12:10am

Yeah! I missed that. How do you code that?

The HASH is AWESOME I found a site where you could experiment with it …

Let me find it.

shh1313 · July 22, 2021, 12:25am

@Dr.Strangelove Go do it! They won’t let me link to the experiment. Hunt for it. It’s there.

Simulations are powerful tools for thought: safe virtual environments for learning and experimentation.

Learn more >l

DPRK · July 22, 2021, 12:50am

Unicode encodes many characters, including the banjo , the ninja 🥷, and various cuneiform ligatures 𒅬, but, thankfully, no Klingon, which shows that a modicum of sanity still reigns.

Topic		Replies	Views
Why do some places on the internet replace apostrophes with odd characters? Factual Questions	44	8203	May 29, 2015
What the â€? Factual Questions	29	3650	April 30, 2013
Test how your browser handles some Unicode. About This Message Board	8	1319	September 28, 2004
â€ ? Factual Questions	7	704	November 2, 2004
£ (do all nations see the pound sterling sign on the internet?) Factual Questions	25	2276	July 25, 2003

Would Someone Please Explain What This Symbol Means

Related topics