Get off my internet! (Old Fogey question about little boxes with letters in 'em)

WhyNot · August 8, 2008, 1:26am

So…what are they? I noticed one in a post here on the Dope, but danged if I can find it again. Also in an email forward my mom sent me today, and I remember seeing a whole bunch of 'em on some website with a bunch of time waster games. What are these things, where did they come from, and when will they be going the way of the lolcat?

Example:
��

GuanoLad · August 8, 2008, 1:34am

It’s a bug in the character encoding in Firefox 3.

Mozilla seems to be claiming it’s the user’s fault, because it’s hard to re-create. But as nobody using FF2 has this problem, but thousands are suffering it in FF3, it has to be a bug.

Go into the View menu and try jiggering the various character encoding options until it does something closer to accurate.

WhyNot · August 8, 2008, 1:43am

Woah! Unicode (UTF-16 Little Endian) looks like a acid trip, man!

Nothing seems to make them look like anything different. Can you tell me what my example boxes *should *look like? I cut and pasted them out of that forwarded email I got today.

Ximenean · August 8, 2008, 1:45am

Questionable logic, there. See Martian Headsets – Joel on Software.

Santo_Rugger · August 8, 2008, 1:51am

If it makes you feel any better, I see:

WhyNot · August 8, 2008, 1:57am

Santo Rugger, darling, if I have a problem with the way my computer shows things, it doesn’t help if you show me things on my computer! It’s a little like my cable company trying to show me the benefits of their HD channels on my 10 year old television set!

But I assume you mean it shows two question marks, yes? On my 'puter, it’s two little rectangles, each one with a small FF over a small FD, like a little four letter eye chart.

Santo_Rugger · August 8, 2008, 2:26am

Hey, you started it!

I was just hoping somebody knew what the problem was on my end, because regardless of the character encoding I use, all I see are two question marks.

Derleth · August 8, 2008, 2:34am

GuanoLad: Really specious logic, there. You could apply that same logic to all bug fixes that change what the user sees.

WhyNot: You’re somehow entering characters that don’t have actual glyphs associated with them. (A glyph is the letter you see on the screen or on the page. A font is a collection of glyphs, and a typeface is a collection of glyphs of a specific size.) Firefox is trying to be ‘helpful’ by telling you what Unicode codepoint (collection of bytes, kind of*) is producing that character in the file. That way, you might be able to find and fix it more easily.

Interestingly, the actual codepoints you have in your OP are two replacement characters. They’re supposed to look like boxes or (maybe) question marks, because they indicate characters you cannot represent.

*(Unicode has a complex relationship with the sequence of bytes that actually end up in a file. Look up ‘UTF-8’ and ‘UCS-2’ if you’re interested.)

GuanoLad · August 8, 2008, 5:26am

Is it?

If it used to work fine in FF2, then when upgrading to FF3 it no longer works, and yet if you look in IE7 it works just fine, then doesn’t the blame lie in FF3? Why shouldn’t it?

SharkB8 · August 8, 2008, 12:33pm

Hmmm…I’m using Safari on a Mac and I’m seeing 2 black rhombuses (rhombi?) with a white question mark in each of them…no eye chart, no “FF” or “FD.”

Dante · August 8, 2008, 12:41pm

I had the same problem in a discussion on this board about kanji characters. Turns out I needed to install the Japanese character set. That could be the issue, whether it be foreign character sets, mathematical sets, etc.

ed · August 8, 2008, 1:28pm

You have to carefully define “works” and “no longer works”. Saying “it looks ok to me” isn’t really specific enough. I’m not trying to be mean, this is an argument I regularly have with other programmers (although in their case I don’t forgive them as easily). Sure, those little boxes don’t look right, and it might be the case that in FF2 you don’t get them (although I would wonder what you actually get instead, and if it’s really any better…), but the real question is: Is what FF2 is doing really correct? Does FF2 just happen to look ok for a few example cases that are convenient for you, but fails for a large set of cases you haven’t accounted for?

Remember, there’s a whole world out there, and we’ve been notoriously anglo-centric in the past when it comes to making assumptions about how to display things. A lot of the pain we’re having with character encodings today is the long term result of our past failure to deal with character sets outside of our own in the first place.

GuanoLad · August 8, 2008, 2:38pm

Then who do you blame? How do you determine what the fix is?

It’s got to be the changes in FF3 that is causing this. Therefore it’s FF3 that should be changed to make it work for everybody. It may not strictly be a “bug” semantically speaking, but it is something in FF3 that’s messing things up.

ed · August 8, 2008, 3:12pm

The point is that there is no way to make it work for everybody. It’s impossible, we have collectively painted ourselves into a corner that we can’t get out of without breaking something for someone.

pulykamell · August 8, 2008, 3:38pm

I’m on a Mac, on both FF3 and Safari I see two black squares oriented on their corners (or rhombuses, as you say) with a question mark in each.

ed · August 8, 2008, 4:21pm

I’d like to expand on my claim that there’s frequently no easy way to solve these sorts of problems. However since we don’t know much beyond the vague details of the op, I’m going to construct a fictional example of the problem posed by encodings, so I can better illustrate why these problems don’t have a straightforward “just make it work” type of fix. The details here are simplified and not meant to reflect any specific problem someone is having, although it’s meant to be analogous (and the technical details of how encodings work should be accurate).

Let’s say I write a simple web page. In fact, it’s not even a proper web page at all, there’s no HTML or anything, it’s just this:



hello everyone!

Except I’m not satisfied with that, I want a copyright notice in there, so I look and sure enough, there’s a handy character I can use for the © symbol. So I put that in:



hello everyone! © copyright by me

(If you don’t see the © that’s ok! It’s kind of the point I’m trying to make…)

Now at this point, I’ve made a small mistake. I used a character that looked ok to me, but here’s how that copyright symbol looks to the computer (in binary):

10101001

See that very first “1”, all the way on the left? That’s a problem… what it means is that there’s not enough information in this data for another computer to know how to display it. It’s an “extended” character, and that means that this data is encoded in some way and has to be decoded by whatever is going to display it. I didn’t notice this when I wrote it, because my computer is naturally decoding it the same way that it encoded it… so it looks the same. In this case, my computer is set to use the “latin-1” encoding, so as long as everyone else’s computer assumes everything is in latin-1 they will see the same thing.

Now throw the browser into the mix. Let’s say the popular open source browser FireWombat 1.0 started out by assuming everything should be latin-1 if there was no indication otherwise. This worked well at first because FireWombat was developed in the US and was mainly used by people in the US. And my site always looked just like I wrote it, even if you loaded it up somewhere like Japan, because even though they don’t really use latin-1, FireWombat is set to use it anyways, so they still get my copyright symbol as I intended, and all is well.

But, FireWombat gets popular, and people in other parts of the world start writing web pages. They of course use annoying* non-latin-1 encodings with other characters (the ones in their actual language). They are annoyed because when they write their version of the “hello everyone!” page in their own language, it comes out as garbage, since FireWombat always assumes everything is in latin-1. They have to set special things on their webserver that they don’t understand, or put mysterious code in their page that they don’t remember to do, just to get FireWombat to display their page right.

So FireWombat 2.0 changes. Now it uses the user’s computer’s default encoding to display everything. People in japan see their own pages now the way they intended them, although here in the US we just see garbage whenever we look at one of their pages. That’s ok though, because who understands Japanese anyways, right?* Unfortunately, when they view my page, that one copyright character is garbage. I get complaints from my users in other countries, who wonder why the site broke (these people take their copyright notices very seriously).

Even worse, the developers at FireWombat continue to get complaints. Users want their browser to default to a “new” (I should say newer) encoding called unicode, which is meant to be universal to all languages. This way all pages will look right to all people no matter where they are. Except when FireWombat 3.0 comes out and starts defaulting to unicode for everything, it breaks my old site. Since that copyright symbol (with the first “1” set) is not valid in unicode, FireWombat replaces it with the funny little square with a question mark in it, because it doesn’t know what else to do with it, and figures you should know that something is wrong.

I am outraged, and I complain. FireWombat developers insist I can fix this by setting something on my web server to tell the browser I want it displayed using latin-1, but I’m not very technical and besides, I didn’t change anything, they changed, it works just fine under FireWombat 1.0, so they should just fix it. But FireWombat can’t go back to 1.0 behavior, because now they’ll break all of the sites that rely on the new behavior. We are at an impasse, and someone is going to lose.

Again, this is mostly a mocked up situation and is not meant to reflect anything exactly (the real situations and arguments are much more mind-numbingly technical). The short version is that there is lots of existing content, and lots of new content, and constantly changing browsers from different companies that have to decide between appeasing new users and preserving the functionality of millions of existing pages. Something is going to break somewhere.

I’m not really this obtuse, I’m mocking the mentality that got us into this mess in the first place.

WarmNPrickly · August 8, 2008, 4:49pm

Couldn’t Firewombat 3.0 recognize that there are characters that it can’t understand using unicode then fall back on character recognition it used for Firewombat 2.0 and if that doesn’t work 1.0?

ed · August 8, 2008, 7:09pm

There’s no way to tell when to fail to 1.0 behavior from 2.0, because there’s no way to know it’s broken (for example, there is no reliable way to tell the difference between latin-1 and iso-8859-5 cyrillic). It doesn’t look right to you as a person, but to a computer it’s fine.

Regarding detecting broken unicode characters and switching encoding, this is referred to as “sniffing” and is used by some applications. Unfortunately it’s not perfect, there are some (rare) situations where a computer won’t be able to tell the difference between a couple of latin-1 characters and a unicode utf8 character. As a result, the encoding “sniffer” could end up picking the wrong one. (There is a minor but rather funny bug with notepad.exe where exactly this happens.)

Human nature is such that website authors will end up relying on this “feature” of FireWombat 3.0 to do autodetection, rather than worry about specifying it properly. This autodetection will have a built in bug that can’t be fixed: sometimes, when the page content is just right, it will pick the wrong encoding. Once again, users will wonder why it works most of the time but not others, when really it goes all the way back to a problem with the website.

Now maybe this seems preferable, trading a common problem due to ignorance in favor of a rare but unsolvable problem later. However this is the thinking that has got us into this mess, by trying to be as accepting as possible of all input, we entrench that behavior and make it more and more difficult to actually make any improvements in the future.

Myglaren · August 8, 2008, 7:40pm

I’m on Ubuntu Linux/Firefox 3 and see the same.

Same with Opera,
and Epiphany
and Galeon
and Kazahakase

Two empty squares in Konqueror.

Acsenray · August 8, 2008, 8:37pm

You got that turned around.

The typeface is the overall design.

A font is a display of all the characters in a specific size (and weight, etc.) of a particular typeface.

Topic		Replies	Views
Browser woes. Please help! (Firefox) Factual Questions	4	744	September 5, 2008
The Kanji Character, 高, Totally Looks Like a K't'inga Class Battlecruiser as Viewed From the Bow Miscellaneous and Personal Stuff I Must Share	16	1558	January 12, 2010
testing (disregard) About This Message Board	6	857	August 11, 2004
Testing Dings About This Message Board	32	1705	April 16, 2003
WTF? Question about firefox fonts Factual Questions	11	945	February 5, 2007

Get off my internet! (Old Fogey question about little boxes with letters in 'em)

Related topics