Alphabets, and what constitutes a letter

Musing over letters and other symbols for their use in computers raises some questions:

Different languages have somewhat overlapping alphabets. In Spanish there’s the “n” with a tilda, but quite a lot of overlap with English. In Russian there are many letters that don’t appear in English. In Greek an even larger proportion don’t. Clearly a Greek lambda is a separate letter from any English letter including “l”. Is the “a” in Spanish the same letter as the “a” in English? Is an “n” with a tilda a separate letter from an “n” without?

In German there is the “u” with the double dot over it. They tell me it is perfectly correct to write that as “ue”, for example when the double dot is not printable. So is the double dot “u” a full fledged letter or not?

How about the German double “s” that looks somewhat like an English uppercase “B”? Is that a letter?

What about ligatures in English, like “oe” only printed as one larger connected symbol - is that a letter, or two? What about “fl” printed together? What about the older form of “s” that looks like an “f”?

What’s the argument for and against calling the apostrophe in English a letter? I have never heard it called a letter but there are real words that can’t be spelled without it. Seems like the hard sign or the soft sign in Russian, that don’t have any sound of their own but only modify the letters around them.

I guess what I’m getting at is what rules might one use to decide how to provide letters for a multilingual but alphabetic crowd. Is there a name for this topic? Where does one even start?

There is a related topic about mixing up fonts and alphabets, so the letter “l” looks wider in Courier than it does in Times New Roman (a font issue) but has an entirely different name and shape in Symbol (because that is changing the alphabet and isn’t really a font at all). But I’m not asking about that unfortunate and cheesy stopgap measure. I’m interested in letters per se.

One name for this topic is typography. It’s worth noting that most of what you write was not really an issue until modern type was developed: when everything was handwritten, it doesn’t make a difference if a ligature is counted as one letter or two, as long as everyone can read it.

One thing to be aware of is that glyphs that look similar to ours in languages like Greek and Russian shouldn’t be confused with the glyphs that we know and love. The Romance and Germanic languages use variations of the Latin alphabet. English is among the simplest since we don’t use any diacriticals (those doohickeys above letters.) Though they’re related to the Latin alphabet, the Greek alphabet and Cyrillic alphabet are quite distinct from our own.

Whether n is considered a different letter from ñ or just a variation is a matter of local custom and no particular scientific classification.

However, there is tremendous complexity in the Unicode standard, which seeks to unify all commonly used (and some rather uncommon) writing systems in the world into a single system of glyphs.

Here are some problems one runs into. “A” and “a” are clearly the same letter, but they are different glyphs because they have different meanings. In Arabic and some other scripts, some letters are rendered differently if they appear on the end or middle of a word. Are these the same glyph or different? Unicode allows for backwards compatibility with older type systems by retaining some codepoints for each letter form, but also has separate codepoints that treat them as combining diacriticals (a base glyph plus a distinct codepoint that modifies it.)

So to answer your question, there really aren’t any answers to your question.

For example, Norwegian considers “O” and “Ø” to be separate letters. They also use the ligature Æ as a glyph entirely separate from A and E. In English, we may sometimes render the grapheme ae as a ligature for aesthetic (heh) purposes, but we still consider them separate letters.

OTOH, in French, the alphabet is exactly the same as ours, and “e”, “é”, “è”, “ê”, and “ë” are all considered the same letter but modified with different marks.

What do you mean by “provide letters for”? Are you talking about enabling people to type in their own language? In that case you would have to provide for all the symbols (e.g. upper and lower-case letters and punctuation marks) required to write that language. The question of what constitutes a letter is irrelevant.

The main context in which it is important to know what is considered a separate letter is in alphabetising, for example when using a dictionary or telephone book. For these purposes, it’s useful to know that in Lithuanian, “i”, “į” and “y” share dictionary real estate. In German “ä” is intermingled with “a”, while in Swedish you’ll find it right at the back of the book. However, such decisions are largely conventional, as shown by the fact that the Spanish letters “ch” and “ll” used to be alphabetised separately, but have now been integrated with “c” and “l” respectively.

As is/was rr, and I think (don’t have a modern dictionary) that maybe ñ falls in there, too.

Which makes me wonder… words like jalapeño, sometimes written jalapeno, in English (not Spanish), does this introduce a new letter into the English alphabet, or is it simply an “accented” US character. In Spanish, the letter W is part of the alphabet, but as far as I know, it’s only used for imported loan words.

As has been said, “what constitutues a letter” depends on the context. In computer processing, the sequence of characters “gy” is two letters in Hungarian just as it is in English. But for looking up words in a Hungarian dictionary, it is considered one letter. That is, there is a section headed “GY” right after the section “G”.

Ed

I grew up protesting that considering Ch and ll as individual letters was utterly stupid and made no sense. Finally the RAEL came to their senses and did the right thing.

This isn’t always true. For example, Java has support for localized string comparisons, in which character sequences such as “gy” in Hungarian are treated as single characters.

The number of key-strokes isn’t really indicative, on a French keyboard I have individual keys for é è ç à ù but need to use two for ê or ô. So this is another vote for following dictionary usage - Welsh dictionaries consider the double consonants “dd”, “ff” and “ll” as separate letters; “ch” and “rh” also have their own entries. (Again these all need two key-strokes.)

Almost so it doesn’t feel left out “th” is also separate altho’ all the words in this section are foreign in origin. My 1953 dictionary only has 4 “th” words thema, thermomedr, thus (frankincense) and thuser (incense burner). Fast forward to my 2002 learner’s dictionary and this has expanded to therapiwteg, thermostat, thesis and thrombosis ! A sad indictment of the times - not to mention that thuser has disappeared.

The OP may wish to peruse this site.

If the question is really: “How do computers deal with all this stuff?”, the answer (at least nowadays & for the near future) is “Unicode”. Wiki (Unicode - Wikipedia) & the Unicode site (http://www.unicode.org/) have some good intros into the complexities involved.

If the question is really “What is the formal definition of the terms alphabet and letter?”, then I don’t think there is a universally accepted, bright line, physics-quality defnition. Like any other terminology (or really anything-ology) working with human cultural artifacts, the situation in any given culture is about 90% rules & 10% arbitrary exceptions.

When carried out to the full scale of planet-wide human culture it becomes more like 90% exceptions and 10% rules.

All categories naturally have fuzzy boundaries and for any two apparently disjoint categories there is probably at least one example which overlaps them both. For any category you can find some members which seem less category-like than some other non-members. & vice versa.

Set ligatures – a typesetting convention – distinct from the concept of letter. If I say “the coast of Niue has a lot of cliffland”, I surely am not using a word with -ffl- as a letter in the mdidle of it, despite what a typefounder might think.

A useful technical term is grapheme – meaning, roughly, “the unit you put on paper that symbolizes a phoneme, including affricates.” The symbol é is a different grapheme from e without accent mark, irrespective of whether they’re alphabetized as distinct letters.

I’m reasonably sure that Spanish still regards ñ as a distinct letter coming between n and o in the alphabet. (Humor from our college Spanish teacher illustrating this point: Hace catorce años means “He/she’s 14 years old” but Hace catorce anos means “He/she does 14 anuses.”)

It means 14 years ago.

And yes, ñ is a letter distinct from n.

Good question. I hadn’t thought about the distinction between letters and punctuation. I did guess that sorting order would vary so that the relative order of two letters could change between languages but hadn’t thought much about how to deal with that.

It seems like there is an important concept in having a definite set of symbols that are used in ordered groups to write words, with a definite set of words used in ordered groups to write arbitrarily complex thoughts and statements and questions and so forth. I find this somehow immensely appealing to ponder. Likewise I find it appealing to ponder how computers should work, so of course there is some common ground between these topics.

Unicode is interesting and I have read a couple of short references about it, but for some reason find several aspects of Unicode frustrating. All the weird little quirks of ASCII remain (the first 32 unprintable characters, the multipurpose symbols like the “-” which could be a long or short dash or a minus sign). And, if you have a perfectly clear image of a symbol or glyph or whatever I should be calling it, there may be many different Unicode characters it could be (because Unicode provides separate sets of these for each language and the sets overlap). It is also messy in the sense that there isn’t necessarily one to one correspondence between 16 bit codes and symbols.

Well, this is just interest and curiosity, but it is very much that nonetheless. Thanks!

In the olden days the spellings of words was not so standardized, but as writing and printing grew more common, they became much more so. Perhaps one effect computers will have on communication will be that the question of what constitutes a letter will become more formalized, both in theory and in practice.

I tend to agree. But it will be a slow process.

Ref your ealier comment about Unicode preserving some ASCII oddities –

One of the challenges of computing is backwards compatibility. We tend to value that and try to accomodate it where practical. In fact, in many cases we sacrifice potentially hundreds of years of future simplicity and clarity on the altar of expediency with respect to preserving the last 2 years’ worth of v1.0 goofs.

When printing presses were invented, they didn’t really have that problem.

So I suggest that backwards compatibility concerns will, if anything, slow the regularization of these concepts you mention. As an example, look how thoroughly XHTML has failed to push out crap non-compliant HTML.
Ultimately, Unicode and other computer standards and systems are normative for the developers, but should (must?) not be normative for the end-user.

An end-user who knows or cares about the difference can use the hypen, the math-subtract-symbol, the em-dash, the en-dash, and the other ten kinds of non-US dash-like characters I’ve never heard of, all correctly in a single document.

Or the ignorant goof can hit the keyboard button next to the zero for each of them.

Any computer system has to be able to accomodate both kinds of users.
I think you’re old enough to remember when typewriters ruled the office. Traditional typewriters didn’t have a key for the numeral one; people just used the lowercase letter ell even though they understood (had they thought about it), that one and ell were distinct concepts.

As keypunches and teletypes were invented, the engineers needed to bring that conceptual difference into the machine’s realm, and so a one key was added to the keyboard. Many early operators had a hard time learning to use the one key when entering numbers; they’d used the ell key for years.

If we develop a deep need to use the ten kinds of dash correctly, we’ll expect to see keyboards grow a set of keys for them. Followed by cultural awareness that using an em-dash when you meant an en-dash is a sign of ignorance. (FTR, except for the width of the glyph I have no idea what that grammatical difference might actually be.)

Sadly, I think we’re driving the other way.
Aside: The APL programming language was famous for having unique glyphs to represent unique concepts within the language. One neat idea Iverson had was that the math operator subtraction, math operator negation, and the notation for negative numbers were three distinct ideas needing three distinct symbols.

Apparently that was a bridge too far because by the time it was actually reduced to real code running on real computers math negation and math subtraction had been folded together using the same symbol & keystroke, which looked like a typical ASCII hypen/dash/minus.

Meanwhile, negative nmbers were kept as a separate idea and written with an elegant raised dash-like character which sat about 10% down from the top of the lead numeral’s bounding rectangle.

At the end of the day, it’s all pretty arbitrary.

LSL, yes, I do remember typewriters. One of my favorites was an IBM selectric I bought for $140. Before that I had a nonelectric with an extra wide carriage. When Olivetti offered a typewriter that had an electronic buffer maybe a dozen characters long, so you could correct errors without erasing if you noticed them quickly enough, my first wife and I had to have one. And just a few years ago at work somebody requested a copy of an old memo of mine, which turned out to be stored only on 8" Wang word processor disks.

When I learned to type I started having keystrokes running through my head the way an annoying song might, and could not help but notice particular patterns of key rows. For example, “ducking indochina in dickenson” requires going through the rows upward one step at a time for all the letters. Enough of this tedious nonsense can make even the most insipid Top 40 tunes a welcome substitute. The point is, these things still run through my head, and exclamation points are a period, a backspace, and an apostrophy. I still don’t have the “1” character completely integrated into my reflexes.

I also did the whole APL thing. At the computing center there were only two card punch machines in the entire room that had the APL characters, and invariably when you went to do some APL programming, there would be 15 unused machines but non-APL programmers would happen to be using the APL machines.

A great pleasure of learning Mathematica was that they went to the trouble of making their notation for calculus operations much more logical and consistent than the paper notations are, and (poor mathematician that I am) I gained a great deal of clarity just by grokking that improvement.

So, there is a blend of the arbitrary and the potentially helpful, and the goofy as well. Sometimes when I am falling asleep, my thoughts drift from “the US finally adopts SI units” to a much more optimistic “I get to assemble linguists and logicians and various others to reinvent the symbology we all live by”.

Hmmm. That is pretty funny. Somehow the spelling of “apostrophe” migrated towards that of “atrophy”. I’m sure there is no significance to it, though.

I spent about 10 years as a minor associate of Unicode and one of the earliest proponents of software internationalization.

The bit about “lower ascii” is because it is meant to preserve backwards compatibility. A lot of effort has gone into preserving a semblance of compatibility for a lot of earlier standards, not just ascii.

But there is more to this question in the OP then could fill 1000 books. Read the archives of the Unicode mailing list (available at unicode.org), all the technical reports there, etc. and you will start to get an idea of the scope of the matter.

Also keep in mind there are no hard and fast answers, and not all writing systems are “alphabets” by any stretch of the imagination, nor are sorting orders defined in most languages, and for most of the rest, they are more complex and conditional then in English.

One of the big goals of Unicode was to Unify the existng national character encoding standards that were causing interoperability of software and sharing of data to fail profoundly.

In simple terms, is the letter we use in English and know as “the letter a” the same as a letter which looks the same in Dutch? French? Japanese?

Is “A” the same as “a”? what if it is a different font?

Slightly more complex - Is the Japanese kanji for “sun” the same as the Chinese hanzi for “sun”? Is it the same character with a slightly different way of writing it? How do you reconcile that with your answer abut different fonts for “the letter A”?

What about languages where you can write a character more then one way - you might have “e with umlaut” and “e”+“backspace”+ “umlaut” (a “composed character”). Are those the same?

What about spaces? some languages don’t have spaces. Some languages spacing changes based on the content around a particular character.

What about direction? Left to right, top to bottom, right to left, and so on. all happen, an din real world documents, they get mixed. How do you specify all that?

This doesn’t even begin to scrape the issues involved, and the OP is filled with many excellent observations indeed.

Most of this is built into all modern operating systems and applications now, essentially for free, because of the open standardization work that was done, and implemented. I like to say, only semi-jokingly, that that effort, by mostly anonymous people around the globe is at least two thirds responsible for the “World Wide Web”.

Yes, it is more complicated than it seems at first sight because it is not so simple to define what is a character.

Read the entire article for a discussion of many problems and controversies.