Looking at a table of Unicode codes, I noticed, for instance, that the letter ‘f’ is defined as “U + 0067.” What is the “U?” I’m guessing it’s an “escape-like” character – I’m old enough to remember 027 as the prefix for “escape sequences.” What is the “U?” Is it a hex string that lets the system in question know, “Aha, a Unicode code follows?” What is that hex string?
The U is U+0055, with the name “Latin Capital letter U”. That means that in hexadecimal notation it is 0055, with the “U+” telling you that the hex number is to be interpreted as a Unicode character. There’s no offset involved.
Furthermore, you might be able to enter these codes into a file or editor or text box, directly from your keyboard, depending on your keyboard driver. For example, you can type Control-U (I think) followed by those digits, followed by a blank space and it will enter that character into whatever text you are typing.
This works this way in my Linux system. I don’t know how it plays on any Winders system.
…is absolutely worthless, because one of the big advantages of UTF-8 is that it lacks endianness. It turns Unicode codepoints into sequences of bytes, and everyone agrees on how to interpret bytes, in a way that makes endianness irrelevant. Therefore, you never have to specify the order of bytes, which is what endianness is, so a byte order mark is pointless.
That’s not the only reason to prefer UTF-8: The abilities to easily recognize malformed and damaged files, to recognize files which aren’t UTF-8, and to resynchronize after losing a byte and lose a maximum of one character per lost byte are all very important in certain applications, and UTF-16, the other major Unicode encoding, offers none of them. UTF-16 also requires BOMs.
As for “U+” that’s just the notation for writing the numeric value of Unicode codepoints. It has nothing to do with how unicode data is encoded. Look up the various UTF standards for that. (UTF-8 as mentioned above is the most common.)
When it comes to Unicode, this is the understatement of the century. It’s a big topic and even very experienced programmers can get tripped up easily. Just get me started and I’ll talk your ear off about combining diacriticals, canonical equivalence, and UTF-16 surrogate pairs.
This is probably the safest way, but one of the nice features of UTF-8 is that the byte sequences have enough internal structure to them that anything which validates as UTF-8 is almost certainly actually a UTF-8 file. The odds of it successfully parsing without being UTF-8 are minuscule.
Ah, surrogate pairs. Another thing UTF-8 doesn’t have or need.
Anyway, Unicode is inherently more complex than any encoding system which came before it because it is the first family of standards which are, together, complex enough to attempt to do a good job of representing text in more than one or a small number of human languages at a time. ASCII couldn’t even handle all of English and you’re naïve if you think it could; other character encoding schemes did certain things better, most others worse, and weren’t any more universal than ASCII was. Unicode unifies a lot of experience and both practical and theoretical work to get a single body of standards that finally come to grips with text in a coherent fashion.
ASCII could handle English just fine because there are no required letters other than A-Z. Sure, you can use diacritics and other such markings, but they are not strictly necessary. You might not be able to handle every symbols that books tended to use at the time, but that doesn’t mean you couldn’t handle all of English.
My only problem with Unicode is that it seems like a lie. A system that can represent everything–except that most fonts aren’t going to have everything in them. And there are still multiple ways to encode everything. So now you just have the same problems you had before.
It surprises me that such a messed up system works as well as it does.
But only after English had been hammered down to fit on, first, typewriters and cheap printing presses and then pre-Unicode computers. You’re saying that a reduced version of the written language, which was influenced by technical limitations, could fit on technically limited systems. Well, no duh.
So? Most fonts have the characters which are most-used, and text rendering software will mix multiple fonts in the same document to get the glyphs it needs. (Yes, it might not look the best, but it’s readable.)
Obviously not. This is precisely what Unicode solves, so it isn’t a problem with Unicode.
It isn’t nearly as messed-up as you seem to think.
Heh! Ask the database administrator… Mr. Cheré gets outraged when we list him as Mr. Chere. But if you try to search for his name, you have to spell it properly in the search field. Or try to alphabetize a list that includes Mr. Étranger. He won’t fall between Dangerfield and Fitzhugh! Or, how about people with an invisible hard “space” in their names? Robert Boise Herbert wants to be listed as “Boise Herbert,” and not as “Herbert.” And then there are the “Juniors” and “The Thirds.” Oh, yeah, and the “AKA” gang. Mr. Richards, AKA Mr. Fantastic. Oy vey…
What I was wanting was a straightforward hex depiction of a Unicode string, where a prefix would alert whatever interpreter might see it, “Hey, Unicode follows.” I was wondering what the Unicode code might be for “Italics Begin Here” and “Italics End Here.”
Ah, how I admired “Reveal Codes” in old Word Perfect!
Unnecessary complexity seems to be a thing with our species! Just look at what they did with Dungeons and Dragons!
ETA: by the way, thank you all for your cites and links! Very educational!
Just to be clear, there’s no hex depiction of a “Unicode string” because Unicode is just a list of numbers (called codepoints) and what those numbers mean, and a set of semantics for how they work together. Encoding is a separate issue; there are multiple ways to encode Unicode text, with UTF-8 being the de facto standard because it does almost everything very well. (The only real disadvantage is that multiple-length encodings can make string-processing a bit hairy, but UTF-8 is self-synchronizing so it’s not that hard. But you can always convert to UTF-32 if you’re lazy. Memory is cheap.)
Unicode also doesn’t say anything about formatting (italics and such.) Unless it can be argued that an italic character is semantically distinguishable from the regular version. For example, the Mathematics block contains italic versions of all the Latin letters, but you wouldn’t use those to write English.
Is there an “industry standard” set of codes for start/stop italics, or bold, or strikethrough? I know the HTML tags, but is there a recognized binary/hex set of codes? For instance, what do Kindle/Nook use to start/stop italics?
(Last question, I promise! And, yes, I have Googled, and not found answers. Googling was where I found the “U+0067” notation in the first place.)
There’s no commonly-used standard for formatting text. HTML and PDF are de facto standards because they’re so common, and they support Unicode. I suppose MS Word is also a de facto today since it can be read by a lot of things.
Kindle uses the Amazon AZW format which is proprietary; Nook can use EPub which is an open standard, and also supports some other formats.