The "U" in Unicode

Trinopus · July 24, 2013, 3:56am

Looking at a table of Unicode codes, I noticed, for instance, that the letter ‘f’ is defined as “U + 0067.” What is the “U?” I’m guessing it’s an “escape-like” character – I’m old enough to remember 027 as the prefix for “escape sequences.” What is the “U?” Is it a hex string that lets the system in question know, “Aha, a Unicode code follows?” What is that hex string?

(Don’t need the answer fast.)

(Also, no, not a homework question!)

Giles · July 24, 2013, 4:05am

The U is U+0055, with the name “Latin Capital letter U”. That means that in hexadecimal notation it is 0055, with the “U+” telling you that the hex number is to be interpreted as a Unicode character. There’s no offset involved.

Trinopus · July 24, 2013, 5:49am

Is there a prefix? Some hex string that says, “What follows is Unicode?” Or is it up to the programmer to make sure that a given code is known to be Unicode?

(I’m still thinking in terms of magic “escape” sequences… Well, it worked in the past!)

Senegoid · July 24, 2013, 7:22am

Furthermore, you might be able to enter these codes into a file or editor or text box, directly from your keyboard, depending on your keyboard driver. For example, you can type Control-U (I think) followed by those digits, followed by a blank space and it will enter that character into whatever text you are typing.

This works this way in my Linux system. I don’t know how it plays on any Winders system.

Senegoid · July 24, 2013, 7:31am

Entering Unicode characters directly via the keyboard:

I was close. It’s Ctrl-Shift-U to start.

This Wikipedia article describes numerous ways to enter Unicode characters in various operating systems, HTML files, various text editors, GUI interfaces, etc.

ETA: To whom it may interest: Here is an excellent beginner’s tutorial on Unicode, what it is, how it works, how it is coded, etc.:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

eastcheap · July 24, 2013, 8:36am

It varies. For example, in HTML you can indicate a code point with &#n; or &#xn; (decimal and hexadecimal, respectively).

With plain text, you can use a Byte Order Marker (BOM) to specify a unicode encoding and endianness. The UTF-8 BOM (0xef 0xbb 0xbf) is the most common, though its use isn’t encouraged by the standard.

But, no, there isn’t a single standardized way to specify a unicode code point. The “U+” convention is more for people than machines.

Derleth · July 24, 2013, 9:07am

…is absolutely worthless, because one of the big advantages of UTF-8 is that it lacks endianness. It turns Unicode codepoints into sequences of bytes, and everyone agrees on how to interpret bytes, in a way that makes endianness irrelevant. Therefore, you never have to specify the order of bytes, which is what endianness is, so a byte order mark is pointless.

That’s not the only reason to prefer UTF-8: The abilities to easily recognize malformed and damaged files, to recognize files which aren’t UTF-8, and to resynchronize after losing a byte and lose a maximum of one character per lost byte are all very important in certain applications, and UTF-16, the other major Unicode encoding, offers none of them. UTF-16 also requires BOMs.

friedo · July 24, 2013, 3:15pm

There is no UTF-8 BOM. The BOM is for UTF-16 or UTF-32 and used to determine endianness.

If you have UTF-8 text it is your job to know that via some out-of-band means, for example HTTP headers (which are always in ASCII.) Or some OS metadata attached to the file.

But people do (incorrectly) put BOMs in UTF-8 anyway.

friedo · July 24, 2013, 3:16pm

As for “U+” that’s just the notation for writing the numeric value of Unicode codepoints. It has nothing to do with how unicode data is encoded. Look up the various UTF standards for that. (UTF-8 as mentioned above is the most common.)

Trinopus · July 24, 2013, 7:33pm

Cool, and thanks! I see this is a lot deeper than I’d realized.

(I’m old enough to remember when “RUN” was the entirety of a working job control language…)

friedo · July 24, 2013, 8:08pm

When it comes to Unicode, this is the understatement of the century. It’s a big topic and even very experienced programmers can get tripped up easily. Just get me started and I’ll talk your ear off about combining diacriticals, canonical equivalence, and UTF-16 surrogate pairs.

Derleth · July 24, 2013, 8:38pm

This is probably the safest way, but one of the nice features of UTF-8 is that the byte sequences have enough internal structure to them that anything which validates as UTF-8 is almost certainly actually a UTF-8 file. The odds of it successfully parsing without being UTF-8 are minuscule.

Ah, surrogate pairs. Another thing UTF-8 doesn’t have or need.

Anyway, Unicode is inherently more complex than any encoding system which came before it because it is the first family of standards which are, together, complex enough to attempt to do a good job of representing text in more than one or a small number of human languages at a time. ASCII couldn’t even handle all of English and you’re naïve if you think it could; other character encoding schemes did certain things better, most others worse, and weren’t any more universal than ASCII was. Unicode unifies a lot of experience and both practical and theoretical work to get a single body of standards that finally come to grips with text in a coherent fashion.

BigT · July 24, 2013, 8:46pm

ASCII could handle English just fine because there are no required letters other than A-Z. Sure, you can use diacritics and other such markings, but they are not strictly necessary. You might not be able to handle every symbols that books tended to use at the time, but that doesn’t mean you couldn’t handle all of English.

My only problem with Unicode is that it seems like a lie. A system that can represent everything–except that most fonts aren’t going to have everything in them. And there are still multiple ways to encode everything. So now you just have the same problems you had before.

It surprises me that such a messed up system works as well as it does.

Derleth · July 24, 2013, 9:07pm

But only after English had been hammered down to fit on, first, typewriters and cheap printing presses and then pre-Unicode computers. You’re saying that a reduced version of the written language, which was influenced by technical limitations, could fit on technically limited systems. Well, no duh.

So? Most fonts have the characters which are most-used, and text rendering software will mix multiple fonts in the same document to get the glyphs it needs. (Yes, it might not look the best, but it’s readable.)

Obviously not. This is precisely what Unicode solves, so it isn’t a problem with Unicode.

It isn’t nearly as messed-up as you seem to think.

Trinopus · July 24, 2013, 11:00pm

Heh! Ask the database administrator… Mr. Cheré gets outraged when we list him as Mr. Chere. But if you try to search for his name, you have to spell it properly in the search field. Or try to alphabetize a list that includes Mr. Étranger. He won’t fall between Dangerfield and Fitzhugh! Or, how about people with an invisible hard “space” in their names? Robert Boise Herbert wants to be listed as “Boise Herbert,” and not as “Herbert.” And then there are the “Juniors” and “The Thirds.” Oh, yeah, and the “AKA” gang. Mr. Richards, AKA Mr. Fantastic. Oy vey…

What I was wanting was a straightforward hex depiction of a Unicode string, where a prefix would alert whatever interpreter might see it, “Hey, Unicode follows.” I was wondering what the Unicode code might be for “Italics Begin Here” and “Italics End Here.”

Ah, how I admired “Reveal Codes” in old Word Perfect!

Unnecessary complexity seems to be a thing with our species! Just look at what they did with Dungeons and Dragons!

ETA: by the way, thank you all for your cites and links! Very educational!

friedo · July 24, 2013, 11:32pm

Just to be clear, there’s no hex depiction of a “Unicode string” because Unicode is just a list of numbers (called codepoints) and what those numbers mean, and a set of semantics for how they work together. Encoding is a separate issue; there are multiple ways to encode Unicode text, with UTF-8 being the de facto standard because it does almost everything very well. (The only real disadvantage is that multiple-length encodings can make string-processing a bit hairy, but UTF-8 is self-synchronizing so it’s not that hard. But you can always convert to UTF-32 if you’re lazy. Memory is cheap.)

Unicode also doesn’t say anything about formatting (italics and such.) Unless it can be argued that an italic character is semantically distinguishable from the regular version. For example, the Mathematics block contains italic versions of all the Latin letters, but you wouldn’t use those to write English.

Trinopus · July 25, 2013, 12:07am

Is there an “industry standard” set of codes for start/stop italics, or bold, or strikethrough? I know the HTML tags, but is there a recognized binary/hex set of codes? For instance, what do Kindle/Nook use to start/stop italics?

(Last question, I promise! And, yes, I have Googled, and not found answers. Googling was where I found the “U+0067” notation in the first place.)

(Mein Gott, how I love the information age!)

friedo · July 25, 2013, 12:37am

There’s no commonly-used standard for formatting text. HTML and PDF are de facto standards because they’re so common, and they support Unicode. I suppose MS Word is also a de facto today since it can be read by a lot of things.

Kindle uses the Amazon AZW format which is proprietary; Nook can use EPub which is an open standard, and also supports some other formats.

Trinopus · July 25, 2013, 1:10am

Thank’ee! (I do have the Kindle .mobi converter, which seems to work pretty well. I just have to remember to convert underlining to Italics, as the Kindle doesn’t display underlining very well.)

Cheers! Off to Google-land!

eastcheap · July 25, 2013, 7:10am

Technically, it’s a UTF-8 representation of the BOM (U+FEFF). It’s permitted by the standard as a way to identify UTF-8 text but, I agree, it’s an aBOMination.

Topic		Replies	Views
How do I go from a chracter to unicode number? Factual Questions	6	1466	January 1, 2010
Test for Unicode characters/odd symbols About This Message Board	11	820	January 27, 2020
What do these symbols mean "Ø = Ü M"? Factual Questions	9	13496	February 1, 2016
Keypad Unicode entry Factual Questions	7	684	August 28, 2021
Help with typing ascii codes Factual Questions	8	2086	March 2, 2011

The "U" in Unicode

Related topics