I’ve taken a few basic computer science courses, and whenever one dealt with text it was always ASCII. Ever since I was a kid, computers were always limited in the sort of text and things they handled, compared to paper for instance. I’ve heard of Unicode for some time, and assumed it was probably just used in word processors and things like that, but lately I’ve seen it crop up all over the place. There was one thread here that was all in upside-down letters, for instance.
Is everything using Unicode now? Is it just major consumer apps like web browsers and such, or is it everything? Would a student use it for some trivial program, or would they still use ASCII? And, how long has it been in wide use?
I’m fascinated that computer text, which was always such a limited thing, is suddenly so wide open. I’m also wondering how programmers deal with it. It used to be you could have every ASCII code enumerated on a piece of paper, but obviously that’s impossible with Unicode.
At a laymen’s level, pretty much all modern programming languages & standards are Unicode. When you create a string variable, the things in it are Unicode characters, not bytes. If you really need to worry about bytes, you can convert back and forth. But generally you can ignore that issue.
So yes, trivial student programs are all Unicode now, at least if written in modern langauges. And nobody at that level much cares about the difference between Unicode and ASCII , since the trivial programs deal with trivial data.
There are issues that arise in real programs when dealing with multiple human languages which have different alphabets. If I have an input box and am expecting the user to type either “A” or “B” but they are Russian and have a Cryllic keyboard, what’s my code going to do when the input is neither “A” nor “B”?
And then there’s the fun of dealing with RTL languages, like Hebrew & Arabic, not to mention the pictographic languages.
It’s not even close to everything. It’s such a tiny fraction of everything that it might as well be zero. There are billions and billions of lines of legacy software out there in the world, running on emulation layers designed to mimic 50-year-old hardware, and that shit isn’t going anywhere any time soon. A lot of it even uses wacky stuff like EBCDIC, which is an even bigger mess than ISO-8859 code pages are.
The majority of web content is not encoded in UTF-8 or any other Unicode encoding. There are huge numbers of web pages whose HTTP content encoding headers disagree with the encoding specified in their DOCTYPE tags. There are enormous numbers of pages that don’t specify either, and make your browser guess (often incorrectly.)
Things are slowly improving, though. With UTF-8 support in most consumer-facing software (and all modern OSes), the problem has been largely relegated to fixing backend software and all that broken data.
Unicode is variously supported, but while lots of software reads and displays it, very little actually writes it out to a file. The default language input is generally whatever was historically popular whereever you are (English = Ascii, Japanese = Shift-JIS (Windows) / EUC (*nix), etc.) That’s just the default for the system, and the program just saves it like you put it in.
Plus, Unicode, in many ways, doesn’t actually exist. There’s an 8-bit version (which can grow a character from 1-4 bytes), a 16-bit version (which can grow a character from 2-4 bytes), and a 32-bit version. Europe generally uses 16-bit since that handles all of the European languages without having to expand. But probably there’s a lot of code that assumes everything will be 16-bits and ignores the possibility of expanded characters. The US wallows about between straight ASCII and UTF-16 because of our closer ties to Europe. China and Japan mostly seem to use UTF-8 and are probably the only ones who actually implement it right since they need the expandability.
So pretty much it’s all fracked up and pretty much any program you do that deals with anything other than ASCII you have to sit there translating stuff from one encoding to another and back because different packages that you use will require a particular encoding. And then you have to keep track of what the strings are at the moment so you don’t do something bad to them, etc.
ETA: Java is all UTF-8. They did it right and made it the law.
But I think you got UTF-8 and UTF-16 mixed up; my impression is that Japan and China prefer UTF-16 so they can get most of the Han characters in 16 bits, whereas the west generally goes with UTF-8 since we mostly stick to the characters in the ISO-8859 standards; and the first 255 code points mostly overlap with ISO-8859-1 (Latin-1).
Not in my experience. All of the Microsoft “Unicode compatible” functions I’ve seen are all based on WCHARs, and when I was in Japan, everything I worked on used UTF-8. Kanji are pretty uniformly 3-bytes in UTF-8 but would be 4 bytes in UTF-16. Hiragana and Katakana are 2-bytes either way. So overall, text takes up the least space in UTF-8. I would presume that the same is true for Chinese. I think the Unicode character space that’s up in the 32-bit area are all weird symbols or currently left open for expansion. There may be some Chinese characters in there, but the characters used 99% of the time are going to be in the 24-bit region.
This is largely still not the case. Apart from legacy C and C++ programs which still use standard libraries that assume 1 byte = 1 character, a lot of modern programming languages still are lagging in this. Ruby for example, only added full Unicode support in their 1.9 release late last year. PHP will only add native support for unicode in version 6.
Even if we reach a level of global support, there’s still a ton of essential complexity that goes along with unicode. Among some of the more obscure issues are weird turkish capitalisation rules in which the lower case of the upper case of a lower case is not the original letter and weird security flaws that can come from foreign letters looking similar to ASCII letters.
After I wrote that, I realize there could be some confusions between the Java programming language (what most programmers think of when they say Java) and the Java virtual machine. At the virtual machine level, characters are indeed represented in a variant of UTF-8. Class File Specification. This may be what you were thinking of.
From what I have seen, the divide is more along the line of Windows vs Unix rather than Europe vs Asia (assuming we’re talking about system-level programming). As noted, the Microsoft API has been expanded to “Unicode version” that accepts 16-bit strings natively, so UTF-16 strings would be easier for Windows, whereas this never happened in the Unix world, so UTF-8 strings would be easier for them.
Actually, the vast majority of East Asian text only needs 2-bytes in UTF-16, but requires 3-bytes in UTF-8, so UTF-8 is not really a win space-wise for them. Latin characters with no accents require only 1-byte in UTF-8 (ASCII equivalent), but 2-bytes in UTF-16; those with accents require 2-bytes in both UTF-8 and UTF-16. So in terms of space, UTF-16 is definitely a loss for Western Europeans.
Wait a minute! Can’t both UTF-8 and UTF-16 encode any Unicode code point, even those outside the Basic Multilingual Plane? Isn’t part of the problem that a lot of people were assuming that a UTF-8 representation would always be from the BMP, and then their code gacked when presented with a code point above U+FFFF?
Yes, both UTF-8 and UTF-16 can encode all Unicode code points, but with different tradeoffs in terms of compatibilities with legacy systems and space efficiency.
For characters outside the BMP, I think UTF-16 code is significantly more likely to have problems than UTF-8 code. Back in the early days, the assumption was that 16-bit was enough to represent all of the world’s living languages, so people started to assume that Unicode == 16-bit characters. This assumption pretty much broke down when surrogates were introduced. Whereas the people dealing with UTF-8 are already doing most of the work needed to deal with the multiple-byte characters from the very beginning, so this really wasn’t a significant change.
If people knew early on that things would go beyond the BMP, I very much doubt that they would have invested the efforts into converting things to 16-bit. They would have either stayed with UTF-8, or would have just moved everything over to 32-bit. UTF-16 is the worst of both worlds right now: it breaks compatibility without giving much in return.
To answer the OP’s question: no, not everything is in Unicode now. It is increasingly being used in many places where it makes sense (an Internet message board with an international audience, as one obvious example); however, the existing encoding system is perfectly adequate in many other cases without incurring the occasional overhead of Unicode. However, it is fair to say that most of the modern operating systems and programming environments are equipped to use Unicode if and when needed.