I see this term thrown around all the time. It’s an extra special extended bonus character that I see in character sets. I know how to make it. But I don’t know what it’s actually for. So what’s a Zero Width Joiner? And what’s a Zero Width Non-Joiner, for that matter?
A Zero-Width Joiner (ZWJ) is not a Lego.
It is a special character, to be placed between two other characters, which indicates that they are to be joined together in situations where they might otherwise be separated.
A text-display program (such as a publishing program) that encounters a ZWJ and properly interprets it will not display anything. Rather, it will change its regular text-handling rules to connect the two surrounding characters together.
A Zero-Width Non-Joiner (ZWNJ) is a similar formatting character that has the opposite effect: it keeps disconnected two characters that might otherwise be regularly joined.
These may not seem particularly useful in Latin-based scripts, but in scripts where characters are regularly connected, such as Arabic, I gather it’s quite useful.
The Unicode site http://www.unicode.org/ has a mind-boggling amount of information on stuff like this. The code number for the ZWJ is hex 200D, and for the ZWNJ is hex 200C. A PDF chart of ZWJ, ZWNJ, and their neighbours is at http://www.unicode.org/charts/PDF/U2000.pdf.
Okay thank you very much. That’s what I suspected it was, sort of. Let me ask a specific question, because I’m having trouble finding information on that Unicode site. Suppose I’m writing an HTML document, and I have a really long, hyphenated word, like omnium-gatherum, that has a tendency to get wordwrapped and makes the previous line look really short. Can I put, instead, omnium-gatherum, so that the “gatherum” alone can be wrapped?
That’s a good question, Achernar.
Looking at the descriptions on the Unicode site, my impression is that the ZWJ character isn’t really intended to control hyphenation and breaking as you describe.
Rather, the ZWJ seems to be intended to override situations where the type renderer would actually draw the two characters as separate symbols or glyphs; it would instead force the renderer to represent them as one glyph, such as a ligature.
I’m not sure how even the best Unicode-capable browsers would deal with these sorts of typographical things.
It sounds like what you need is a non-breaking hyphen in the middle of your word to prevent it from breaking there, and then a character to signal an optional breaking point. I’m not sure whether these exist. I know there’s a non-breaking space which has the descriptive code .
I’m sure an actual expert will be along at some point to clarify this…
…but there are indeed Unicode characters for nonbreaking and optional hyphens, as well as nonbreaking spaces, as Sunspace suggested.
nonbreaking hyphen - Unicode 0x2011 (8209 decimal)
optional hyphen - Unicode 0xAD (173 decimal)
nonbreaking space - Unicode 0xA0 (160 decimal)
(The easiest way to find these characters, and others like them, is to go to the Unicode Character Name Index, and search for the character name or part of it.)
As for browsers: in my testing, IE 5.5, at least, does indeed recognize these three characters and interpret them appropriately. I doubt there are HTML names for these characters, but you can always refer to them using the decimal value - $#8209;, for example.
Thanks for the clarification! I know that in HTML, the optional hypen, or soft hyphen, is , and the non-breaking space is . I don’t know about a non-breaking hyphen, though. I see that Netscape treats a regular hyphen as non-breaking, but IE and Opera do not. I think in this case that it’s Netscape that follows the specs correctly, and that a regular hyphen should be treated as non-breaking. (HTML 4.01 specs paragraph 9.3.3: “The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.”) I see, though, that of those three, only IE interprets the soft hyphen. Still searching for the elusive optional line-break…