What the �

Yes, but that’s not what the quote I was responding to said.

Unicode support doesn’t come for free, has to be implemented across a wide range of products, and has to deal with third-party systems that include non-standard encodings, don’t support both byte orderings or all the UTF word sizes, with a variety of line endings (both standard and not)–and that’s not even counting that many of these systems don’t handle surrogates and think it’s “OK.” Or the “Unicode” fonts that support only odd subsets but are in standard use anyway and need to have characters substituted from other ones – again, across hundreds of products.

I’ll admit I have no cite for it, but I’d be stunned if my billion dollar estimate wasn’t low – this is the foundation on which all localization is made possible; character encoding issues are at least 2% of my work, and I’m not in a particularly text-oriented part of the company. We’ve got libraries to standardize this stuff, of course, but no company is an island (especially companies that make web browsers), and we have to work and play well with others, including the vast majority of software developers who don’t have the resources to even understand all the nuances of unicode, much less the time and dollars to implement them.

The unicode base specification alone is over six hundred pages long, and that’s not counting the almost four thousand pages of amendments, proposed changes, exceptions, and committe recommendations that have been adopted by various organizations.

TimeWinder’s location says “Redmond”, so it’s not hard to guess the multi-billion software company he works for. Given their position in the market and the effort they spend on compatibility, a billion a year probably *is *about right. And Google or Apple probably don’t spend that much, because their product lines are more focused and somewhat more isolated.

You don’t need to fix the program, just run it through a charset converter.

But if those characters can’t be represented in the program in question then what do you do with them? Lots of programs don’t support multi-byte characters to begin with.

In my experience, I found it’s not so much a Unicode problem as it is a Microsoft Word problem. MS Word has an autocorrect feature that, by default, will change straight quotation marks to “curly” or “smart” quotes, three periods to an ellipsis, and so on. If a document autocorrected in such a manner is copied and pasted into a plain text format, and in turn displayed on the Web, the result is often strange symbols (to our English-reading eyes, anyhow).

A hypothetical example: the code point of a smart right quote in the Arial font in Word 2003 might be the same as for the code point of an inverted Spanish question mark in Unicode. What looks like ❝Hello world❞ in MS Word will display as ¿Hello world¥, or something similar.

Ultimately, it comes down to “garbage in, garbage out.” How would a program know you wanted a right curly quotation mark, when you really wanted a Spanish question mark, or vice versa?

Heh. First, one character set’s multi-byte (or nonexistent) character is another character set’s single-byte character. Thinking about text in those terms is broken anyway; characters are characters, not bytes or byte sequences, and the complexities around actually handling them in an intelligent way are sufficient to justify leaning on special-purpose libraries as often as possible.

Your question is a prime example of the complexities here: In some character sets, like the Latin character set used in Western Europe, there are simple rules to take you from one country’s national variant to another’s in a fashion everyone involved agrees is ‘lossless’. For example, the German ‘ö’ is universally agreed to map to ‘oe’ whenever the ‘ö’ is unavailable. Similar techniques apply generally, reducing the problem to a lookup table.

Moving Eastwards, Cyrillic and Greek text can almost be mapped to the Latin alphabet using lookup tables as well, but the ‘almost’ will make you look like a fool if you don’t have some extra intelligence to go along with it.

Moving Southwards, you run into the Semitic written languages (Arabic, Hebrew, Syriac, etc.) and the lookup table scheme breaks down entirely if you expect anything intelligible to non-experts to come out the other end. The big problem is partially the simple fact Semitic scripts are abjads, where vowels are very often not written but can be inferred from context. The main problem, though, is that transliteration is a social convention, and the social conventions surrounding transliteration out of Semitic abjads are nothing if not hairy and inconsistent, as the Nine Billion Names of Gaddafi will attest.

Going into the Far East, trying to do anything with writing systems based on Han characters (Traditional Chinese, Simplified Chinese, Kanji, Hanja, etc.) with a lookup table is idiocy and will only result in utter nonsense. It’s likely you’ll need a human to do anything with those writing systems.

So that’s one tiny example of why Unicode is as complex as it is compared to everything that came before: It’s the only text encoding standard that even begins to be complex enough to grapple with the world’s writing systems.

I do hope you are not implying that the Word .doc format encodes text based directly on font glyph mapping tables. I always though MS was a bit lazy and slipshod in their software designs, but if that were the case, the situation would be much worse than I had imagined.

Sorry to step on toes (or whatever, you sound angry.)

But I stand by what I said: this is a solved problem, it was solved in HTML4 at the absolute latest (release date: 1997, decades ago), and if you are still wrangling with it, it’s due to:

  1. Your company developing software wrong in the first place,
  2. Your company failing to upgrade older software as standards changes,
  3. Your company relying on other companies that have one of the above two problems.

In other words, it’s a people/management problem, not a technology problem. The technology’s solid.

That said, I do completely agree that in an ideal universe, computer technologies would be impossible for people to misuse, and judging by that standard, HTML has been a failure in that respect-- since you can fail to specify a character encoding and it still “mostly” works, when it should probably simply fail and display nothing.

But we don’t live in an ideal universe, we live in this one. So.

Billions of dollars a year? On character encoding issues? Seriously?

Internationalization is much more than just character encoding issues.

Yes.

But all we’re talking about in this thread is encoding issues. So the cost of (the rest of) internationalization is irrelevant.

Unless TimeWinder was trying to change the topic and I didn’t pick up on it.