Alphabets, and what constitutes a letter

Napier · March 29, 2009, 12:06am

Musing over letters and other symbols for their use in computers raises some questions:

Different languages have somewhat overlapping alphabets. In Spanish there’s the “n” with a tilda, but quite a lot of overlap with English. In Russian there are many letters that don’t appear in English. In Greek an even larger proportion don’t. Clearly a Greek lambda is a separate letter from any English letter including “l”. Is the “a” in Spanish the same letter as the “a” in English? Is an “n” with a tilda a separate letter from an “n” without?

In German there is the “u” with the double dot over it. They tell me it is perfectly correct to write that as “ue”, for example when the double dot is not printable. So is the double dot “u” a full fledged letter or not?

How about the German double “s” that looks somewhat like an English uppercase “B”? Is that a letter?

What about ligatures in English, like “oe” only printed as one larger connected symbol - is that a letter, or two? What about “fl” printed together? What about the older form of “s” that looks like an “f”?

What’s the argument for and against calling the apostrophe in English a letter? I have never heard it called a letter but there are real words that can’t be spelled without it. Seems like the hard sign or the soft sign in Russian, that don’t have any sound of their own but only modify the letters around them.

I guess what I’m getting at is what rules might one use to decide how to provide letters for a multilingual but alphabetic crowd. Is there a name for this topic? Where does one even start?

There is a related topic about mixing up fonts and alphabets, so the letter “l” looks wider in Courier than it does in Times New Roman (a font issue) but has an entirely different name and shape in Symbol (because that is changing the alphabet and isn’t really a font at all). But I’m not asking about that unfortunate and cheesy stopgap measure. I’m interested in letters per se.

friedo · March 29, 2009, 12:28am

One name for this topic is typography. It’s worth noting that most of what you write was not really an issue until modern type was developed: when everything was handwritten, it doesn’t make a difference if a ligature is counted as one letter or two, as long as everyone can read it.

One thing to be aware of is that glyphs that look similar to ours in languages like Greek and Russian shouldn’t be confused with the glyphs that we know and love. The Romance and Germanic languages use variations of the Latin alphabet. English is among the simplest since we don’t use any diacriticals (those doohickeys above letters.) Though they’re related to the Latin alphabet, the Greek alphabet and Cyrillic alphabet are quite distinct from our own.

Whether n is considered a different letter from ñ or just a variation is a matter of local custom and no particular scientific classification.

However, there is tremendous complexity in the Unicode standard, which seeks to unify all commonly used (and some rather uncommon) writing systems in the world into a single system of glyphs.

Here are some problems one runs into. “A” and “a” are clearly the same letter, but they are different glyphs because they have different meanings. In Arabic and some other scripts, some letters are rendered differently if they appear on the end or middle of a word. Are these the same glyph or different? Unicode allows for backwards compatibility with older type systems by retaining some codepoints for each letter form, but also has separate codepoints that treat them as combining diacriticals (a base glyph plus a distinct codepoint that modifies it.)

So to answer your question, there really aren’t any answers to your question.

friedo · March 29, 2009, 12:34am

For example, Norwegian considers “O” and “Ø” to be separate letters. They also use the ligature Æ as a glyph entirely separate from A and E. In English, we may sometimes render the grapheme ae as a ligature for aesthetic (heh) purposes, but we still consider them separate letters.

OTOH, in French, the alphabet is exactly the same as ours, and “e”, “é”, “è”, “ê”, and “ë” are all considered the same letter but modified with different marks.

hibernicus · March 29, 2009, 12:47am

What do you mean by “provide letters for”? Are you talking about enabling people to type in their own language? In that case you would have to provide for all the symbols (e.g. upper and lower-case letters and punctuation marks) required to write that language. The question of what constitutes a letter is irrelevant.

The main context in which it is important to know what is considered a separate letter is in alphabetising, for example when using a dictionary or telephone book. For these purposes, it’s useful to know that in Lithuanian, “i”, “į” and “y” share dictionary real estate. In German “ä” is intermingled with “a”, while in Swedish you’ll find it right at the back of the book. However, such decisions are largely conventional, as shown by the fact that the Spanish letters “ch” and “ll” used to be alphabetised separately, but have now been integrated with “c” and “l” respectively.

Balthisar · March 29, 2009, 2:25am

As is/was rr, and I think (don’t have a modern dictionary) that maybe ñ falls in there, too.

Which makes me wonder… words like jalapeño, sometimes written jalapeno, in English (not Spanish), does this introduce a new letter into the English alphabet, or is it simply an “accented” US character. In Spanish, the letter W is part of the alphabet, but as far as I know, it’s only used for imported loan words.

suranyi · March 29, 2009, 2:46am

As has been said, “what constitutues a letter” depends on the context. In computer processing, the sequence of characters “gy” is two letters in Hungarian just as it is in English. But for looking up words in a Hungarian dictionary, it is considered one letter. That is, there is a section headed “GY” right after the section “G”.

Ed

sailor · March 29, 2009, 2:49am

I grew up protesting that considering Ch and ll as individual letters was utterly stupid and made no sense. Finally the RAEL came to their senses and did the right thing.

Jeff_Lichtman · March 29, 2009, 5:19am

This isn’t always true. For example, Java has support for localized string comparisons, in which character sequences such as “gy” in Hungarian are treated as single characters.

Cat_Jones · March 29, 2009, 10:03am

The number of key-strokes isn’t really indicative, on a French keyboard I have individual keys for é è ç à ù but need to use two for ê or ô. So this is another vote for following dictionary usage - Welsh dictionaries consider the double consonants “dd”, “ff” and “ll” as separate letters; “ch” and “rh” also have their own entries. (Again these all need two key-strokes.)

Almost so it doesn’t feel left out “th” is also separate altho’ all the words in this section are foreign in origin. My 1953 dictionary only has 4 “th” words thema, thermomedr, thus (frankincense) and thuser (incense burner). Fast forward to my 2002 learner’s dictionary and this has expanded to therapiwteg, thermostat, thesis and thrombosis ! A sad indictment of the times - not to mention that thuser has disappeared.

Monty · March 29, 2009, 11:49am

The OP may wish to peruse this site.

LSLGuy · March 29, 2009, 1:33pm

If the question is really: “How do computers deal with all this stuff?”, the answer (at least nowadays & for the near future) is “Unicode”. Wiki (Unicode - Wikipedia) & the Unicode site (http://www.unicode.org/) have some good intros into the complexities involved.

If the question is really “What is the formal definition of the terms alphabet and letter?”, then I don’t think there is a universally accepted, bright line, physics-quality defnition. Like any other terminology (or really anything-ology) working with human cultural artifacts, the situation in any given culture is about 90% rules & 10% arbitrary exceptions.

When carried out to the full scale of planet-wide human culture it becomes more like 90% exceptions and 10% rules.

All categories naturally have fuzzy boundaries and for any two apparently disjoint categories there is probably at least one example which overlaps them both. For any category you can find some members which seem less category-like than some other non-members. & vice versa.

Polycarp · March 29, 2009, 2:23pm

Set ligatures – a typesetting convention – distinct from the concept of letter. If I say “the coast of Niue has a lot of cliffland”, I surely am not using a word with -ffl- as a letter in the mdidle of it, despite what a typefounder might think.

A useful technical term is grapheme – meaning, roughly, “the unit you put on paper that symbolizes a phoneme, including affricates.” The symbol é is a different grapheme from e without accent mark, irrespective of whether they’re alphabetized as distinct letters.

I’m reasonably sure that Spanish still regards ñ as a distinct letter coming between n and o in the alphabet. (Humor from our college Spanish teacher illustrating this point: Hace catorce años means “He/she’s 14 years old” but Hace catorce anos means “He/she does 14 anuses.”)

sailor · March 29, 2009, 2:29pm

It means 14 years ago.

And yes, ñ is a letter distinct from n.

Napier · March 29, 2009, 2:42pm

Good question. I hadn’t thought about the distinction between letters and punctuation. I did guess that sorting order would vary so that the relative order of two letters could change between languages but hadn’t thought much about how to deal with that.

It seems like there is an important concept in having a definite set of symbols that are used in ordered groups to write words, with a definite set of words used in ordered groups to write arbitrarily complex thoughts and statements and questions and so forth. I find this somehow immensely appealing to ponder. Likewise I find it appealing to ponder how computers should work, so of course there is some common ground between these topics.

Unicode is interesting and I have read a couple of short references about it, but for some reason find several aspects of Unicode frustrating. All the weird little quirks of ASCII remain (the first 32 unprintable characters, the multipurpose symbols like the “-” which could be a long or short dash or a minus sign). And, if you have a perfectly clear image of a symbol or glyph or whatever I should be calling it, there may be many different Unicode characters it could be (because Unicode provides separate sets of these for each language and the sets overlap). It is also messy in the sense that there isn’t necessarily one to one correspondence between 16 bit codes and symbols.

Well, this is just interest and curiosity, but it is very much that nonetheless. Thanks!

Napier · March 29, 2009, 2:46pm

In the olden days the spellings of words was not so standardized, but as writing and printing grew more common, they became much more so. Perhaps one effect computers will have on communication will be that the question of what constitutes a letter will become more formalized, both in theory and in practice.

LSLGuy · March 29, 2009, 3:21pm

I tend to agree. But it will be a slow process.

Ref your ealier comment about Unicode preserving some ASCII oddities –

One of the challenges of computing is backwards compatibility. We tend to value that and try to accomodate it where practical. In fact, in many cases we sacrifice potentially hundreds of years of future simplicity and clarity on the altar of expediency with respect to preserving the last 2 years’ worth of v1.0 goofs.

When printing presses were invented, they didn’t really have that problem.

So I suggest that backwards compatibility concerns will, if anything, slow the regularization of these concepts you mention. As an example, look how thoroughly XHTML has failed to push out crap non-compliant HTML.
Ultimately, Unicode and other computer standards and systems are normative for the developers, but should (must?) not be normative for the end-user.

An end-user who knows or cares about the difference can use the hypen, the math-subtract-symbol, the em-dash, the en-dash, and the other ten kinds of non-US dash-like characters I’ve never heard of, all correctly in a single document.

Or the ignorant goof can hit the keyboard button next to the zero for each of them.

Any computer system has to be able to accomodate both kinds of users.
I think you’re old enough to remember when typewriters ruled the office. Traditional typewriters didn’t have a key for the numeral one; people just used the lowercase letter ell even though they understood (had they thought about it), that one and ell were distinct concepts.

As keypunches and teletypes were invented, the engineers needed to bring that conceptual difference into the machine’s realm, and so a one key was added to the keyboard. Many early operators had a hard time learning to use the one key when entering numbers; they’d used the ell key for years.

If we develop a deep need to use the ten kinds of dash correctly, we’ll expect to see keyboards grow a set of keys for them. Followed by cultural awareness that using an em-dash when you meant an en-dash is a sign of ignorance. (FTR, except for the width of the glyph I have no idea what that grammatical difference might actually be.)

Sadly, I think we’re driving the other way.
Aside: The APL programming language was famous for having unique glyphs to represent unique concepts within the language. One neat idea Iverson had was that the math operator subtraction, math operator negation, and the notation for negative numbers were three distinct ideas needing three distinct symbols.

Apparently that was a bridge too far because by the time it was actually reduced to real code running on real computers math negation and math subtraction had been folded together using the same symbol & keystroke, which looked like a typical ASCII hypen/dash/minus.

Meanwhile, negative nmbers were kept as a separate idea and written with an elegant raised dash-like character which sat about 10% down from the top of the lead numeral’s bounding rectangle.

At the end of the day, it’s all pretty arbitrary.

Napier · March 29, 2009, 5:31pm

LSL, yes, I do remember typewriters. One of my favorites was an IBM selectric I bought for $140. Before that I had a nonelectric with an extra wide carriage. When Olivetti offered a typewriter that had an electronic buffer maybe a dozen characters long, so you could correct errors without erasing if you noticed them quickly enough, my first wife and I had to have one. And just a few years ago at work somebody requested a copy of an old memo of mine, which turned out to be stored only on 8" Wang word processor disks.

When I learned to type I started having keystrokes running through my head the way an annoying song might, and could not help but notice particular patterns of key rows. For example, “ducking indochina in dickenson” requires going through the rows upward one step at a time for all the letters. Enough of this tedious nonsense can make even the most insipid Top 40 tunes a welcome substitute. The point is, these things still run through my head, and exclamation points are a period, a backspace, and an apostrophy. I still don’t have the “1” character completely integrated into my reflexes.

I also did the whole APL thing. At the computing center there were only two card punch machines in the entire room that had the APL characters, and invariably when you went to do some APL programming, there would be 15 unused machines but non-APL programmers would happen to be using the APL machines.

A great pleasure of learning Mathematica was that they went to the trouble of making their notation for calculus operations much more logical and consistent than the paper notations are, and (poor mathematician that I am) I gained a great deal of clarity just by grokking that improvement.

So, there is a blend of the arbitrary and the potentially helpful, and the goofy as well. Sometimes when I am falling asleep, my thoughts drift from “the US finally adopts SI units” to a much more optimistic “I get to assemble linguists and logicians and various others to reinvent the symbology we all live by”.

Napier · March 29, 2009, 5:33pm

Hmmm. That is pretty funny. Somehow the spelling of “apostrophe” migrated towards that of “atrophy”. I’m sure there is no significance to it, though.

not_alice · March 29, 2009, 6:01pm

I spent about 10 years as a minor associate of Unicode and one of the earliest proponents of software internationalization.

The bit about “lower ascii” is because it is meant to preserve backwards compatibility. A lot of effort has gone into preserving a semblance of compatibility for a lot of earlier standards, not just ascii.

But there is more to this question in the OP then could fill 1000 books. Read the archives of the Unicode mailing list (available at unicode.org), all the technical reports there, etc. and you will start to get an idea of the scope of the matter.

Also keep in mind there are no hard and fast answers, and not all writing systems are “alphabets” by any stretch of the imagination, nor are sorting orders defined in most languages, and for most of the rest, they are more complex and conditional then in English.

One of the big goals of Unicode was to Unify the existng national character encoding standards that were causing interoperability of software and sharing of data to fail profoundly.

In simple terms, is the letter we use in English and know as “the letter a” the same as a letter which looks the same in Dutch? French? Japanese?

Is “A” the same as “a”? what if it is a different font?

Slightly more complex - Is the Japanese kanji for “sun” the same as the Chinese hanzi for “sun”? Is it the same character with a slightly different way of writing it? How do you reconcile that with your answer abut different fonts for “the letter A”?

What about languages where you can write a character more then one way - you might have “e with umlaut” and “e”+“backspace”+ “umlaut” (a “composed character”). Are those the same?

What about spaces? some languages don’t have spaces. Some languages spacing changes based on the content around a particular character.

What about direction? Left to right, top to bottom, right to left, and so on. all happen, an din real world documents, they get mixed. How do you specify all that?

This doesn’t even begin to scrape the issues involved, and the OP is filled with many excellent observations indeed.

Most of this is built into all modern operating systems and applications now, essentially for free, because of the open standardization work that was done, and implemented. I like to say, only semi-jokingly, that that effort, by mostly anonymous people around the globe is at least two thirds responsible for the “World Wide Web”.

sailor · March 29, 2009, 6:50pm

Yes, it is more complicated than it seems at first sight because it is not so simple to define what is a character.

Unicode - Wikipedia

Unicode, in intent, encodes the underlying characters — graphemes and grapheme-like units — rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).

In text processing, Unicode takes the role of providing a unique code point — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. This simple aim becomes complicated, however, by concessions made by Unicode’s designers in the hope of encouraging a more rapid adoption of Unicode.

The first 256 code points were made identical to the content of ISO 8859-1 so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore allow conversion from those encodings to Unicode (and back) without losing any information. For example, the “fullwidth forms” section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. For other examples, see Duplicate characters in Unicode.

Read the entire article for a discussion of many problems and controversies.

Topic		Replies	Views
L, I and 1 Miscellaneous and Personal Stuff I Must Share	53	453	November 20, 2024
How did this change in the Spanish alphabet affect software? Factual Questions	20	759	April 6, 2022
What Language Has The Longest Alphabet? Factual Questions	28	13746	August 1, 2009
The one symbol not on the QWERTY keyboard that ought to be In My Humble Opinion computer-hardware	132	3320	February 25, 2024
Why no diacriticals in English? Factual Questions	41	2069	January 16, 2019

Alphabets, and what constitutes a letter

Related topics