The "U" in Unicode

Looking at a table of Unicode codes, I noticed, for instance, that the letter ‘f’ is defined as “U + 0067.” What is the “U?” I’m guessing it’s an “escape-like” character – I’m old enough to remember 027 as the prefix for “escape sequences.” What is the “U?” Is it a hex string that lets the system in question know, “Aha, a Unicode code follows?” What is that hex string?

(Don’t need the answer fast.) :wink:

(Also, no, not a homework question!)

The U is U+0055, with the name “Latin Capital letter U”. That means that in hexadecimal notation it is 0055, with the “U+” telling you that the hex number is to be interpreted as a Unicode character. There’s no offset involved.

Is there a prefix? Some hex string that says, “What follows is Unicode?” Or is it up to the programmer to make sure that a given code is known to be Unicode?

(I’m still thinking in terms of magic “escape” sequences… Well, it worked in the past!)

Furthermore, you might be able to enter these codes into a file or editor or text box, directly from your keyboard, depending on your keyboard driver. For example, you can type Control-U (I think) followed by those digits, followed by a blank space and it will enter that character into whatever text you are typing.

This works this way in my Linux system. I don’t know how it plays on any Winders system.

Entering Unicode characters directly via the keyboard:

I was close. It’s Ctrl-Shift-U to start.

This Wikipedia article describes numerous ways to enter Unicode characters in various operating systems, HTML files, various text editors, GUI interfaces, etc.

ETA: To whom it may interest: Here is an excellent beginner’s tutorial on Unicode, what it is, how it works, how it is coded, etc.:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

It varies. For example, in HTML you can indicate a code point with &#n; or &#xn; (decimal and hexadecimal, respectively).

With plain text, you can use a Byte Order Marker (BOM) to specify a unicode encoding and endianness. The UTF-8 BOM (0xef 0xbb 0xbf) is the most common, though its use isn’t encouraged by the standard.

But, no, there isn’t a single standardized way to specify a unicode code point. The “U+” convention is more for people than machines.

…is absolutely worthless, because one of the big advantages of UTF-8 is that it lacks endianness. It turns Unicode codepoints into sequences of bytes, and everyone agrees on how to interpret bytes, in a way that makes endianness irrelevant. Therefore, you never have to specify the order of bytes, which is what endianness is, so a byte order mark is pointless.

That’s not the only reason to prefer UTF-8: The abilities to easily recognize malformed and damaged files, to recognize files which aren’t UTF-8, and to resynchronize after losing a byte and lose a maximum of one character per lost byte are all very important in certain applications, and UTF-16, the other major Unicode encoding, offers none of them. UTF-16 also requires BOMs.

There is no UTF-8 BOM. The BOM is for UTF-16 or UTF-32 and used to determine endianness.

If you have UTF-8 text it is your job to know that via some out-of-band means, for example HTTP headers (which are always in ASCII.) Or some OS metadata attached to the file.

But people do (incorrectly) put BOMs in UTF-8 anyway.

As for “U+” that’s just the notation for writing the numeric value of Unicode codepoints. It has nothing to do with how unicode data is encoded. Look up the various UTF standards for that. (UTF-8 as mentioned above is the most common.)

Cool, and thanks! I see this is a lot deeper than I’d realized.

(I’m old enough to remember when “RUN” was the entirety of a working job control language…)

When it comes to Unicode, this is the understatement of the century. It’s a big topic and even very experienced programmers can get tripped up easily. Just get me started and I’ll talk your ear off about combining diacriticals, canonical equivalence, and UTF-16 surrogate pairs.

This is probably the safest way, but one of the nice features of UTF-8 is that the byte sequences have enough internal structure to them that anything which validates as UTF-8 is almost certainly actually a UTF-8 file. The odds of it successfully parsing without being UTF-8 are minuscule.

Ah, surrogate pairs. Another thing UTF-8 doesn’t have or need. :smiley:

Anyway, Unicode is inherently more complex than any encoding system which came before it because it is the first family of standards which are, together, complex enough to attempt to do a good job of representing text in more than one or a small number of human languages at a time. ASCII couldn’t even handle all of English and you’re naïve if you think it could; other character encoding schemes did certain things better, most others worse, and weren’t any more universal than ASCII was. Unicode unifies a lot of experience and both practical and theoretical work to get a single body of standards that finally come to grips with text in a coherent fashion.

ASCII could handle English just fine because there are no required letters other than A-Z. Sure, you can use diacritics and other such markings, but they are not strictly necessary. You might not be able to handle every symbols that books tended to use at the time, but that doesn’t mean you couldn’t handle all of English.

My only problem with Unicode is that it seems like a lie. A system that can represent everything–except that most fonts aren’t going to have everything in them. And there are still multiple ways to encode everything. So now you just have the same problems you had before.

It surprises me that such a messed up system works as well as it does.

But only after English had been hammered down to fit on, first, typewriters and cheap printing presses and then pre-Unicode computers. You’re saying that a reduced version of the written language, which was influenced by technical limitations, could fit on technically limited systems. Well, no duh.

So? Most fonts have the characters which are most-used, and text rendering software will mix multiple fonts in the same document to get the glyphs it needs. (Yes, it might not look the best, but it’s readable.)

Obviously not. This is precisely what Unicode solves, so it isn’t a problem with Unicode.

It isn’t nearly as messed-up as you seem to think.

Heh! Ask the database administrator… Mr. Cheré gets outraged when we list him as Mr. Chere. But if you try to search for his name, you have to spell it properly in the search field. Or try to alphabetize a list that includes Mr. Étranger. He won’t fall between Dangerfield and Fitzhugh! Or, how about people with an invisible hard “space” in their names? Robert Boise Herbert wants to be listed as “Boise Herbert,” and not as “Herbert.” And then there are the “Juniors” and “The Thirds.” Oh, yeah, and the “AKA” gang. Mr. Richards, AKA Mr. Fantastic. Oy vey…

What I was wanting was a straightforward hex depiction of a Unicode string, where a prefix would alert whatever interpreter might see it, “Hey, Unicode follows.” I was wondering what the Unicode code might be for “Italics Begin Here” and “Italics End Here.”

Ah, how I admired “Reveal Codes” in old Word Perfect!

Unnecessary complexity seems to be a thing with our species! Just look at what they did with Dungeons and Dragons!

ETA: by the way, thank you all for your cites and links! Very educational!

Just to be clear, there’s no hex depiction of a “Unicode string” because Unicode is just a list of numbers (called codepoints) and what those numbers mean, and a set of semantics for how they work together. Encoding is a separate issue; there are multiple ways to encode Unicode text, with UTF-8 being the de facto standard because it does almost everything very well. (The only real disadvantage is that multiple-length encodings can make string-processing a bit hairy, but UTF-8 is self-synchronizing so it’s not that hard. But you can always convert to UTF-32 if you’re lazy. Memory is cheap.)

Unicode also doesn’t say anything about formatting (italics and such.) Unless it can be argued that an italic character is semantically distinguishable from the regular version. For example, the Mathematics block contains italic versions of all the Latin letters, but you wouldn’t use those to write English.

Is there an “industry standard” set of codes for start/stop italics, or bold, or strikethrough? I know the HTML tags, but is there a recognized binary/hex set of codes? For instance, what do Kindle/Nook use to start/stop italics?

(Last question, I promise! And, yes, I have Googled, and not found answers. Googling was where I found the “U+0067” notation in the first place.)

(Mein Gott, how I love the information age!)

There’s no commonly-used standard for formatting text. HTML and PDF are de facto standards because they’re so common, and they support Unicode. I suppose MS Word is also a de facto today since it can be read by a lot of things.

Kindle uses the Amazon AZW format which is proprietary; Nook can use EPub which is an open standard, and also supports some other formats.

Thank’ee! (I do have the Kindle .mobi converter, which seems to work pretty well. I just have to remember to convert underlining to Italics, as the Kindle doesn’t display underlining very well.)

Cheers! Off to Google-land!

Technically, it’s a UTF-8 representation of the BOM (U+FEFF). It’s permitted by the standard as a way to identify UTF-8 text but, I agree, it’s an aBOMination.