Unicode and Han Unification

Around two weeks ago, I had the privilege of seeing Ken Lunde give a presentation on Ten Mincho, a new Japanese font from Adobe. One part of the presentation — talking about the font’s ability to maintain differences between base characters and CJK Compatibility Ideographs — seemed a little beyond a few people, including myself.

So, over the past week I dove into research to understand it, and I thought it might be nice to share with everyone what I learned about a little thing called Han Unification and one of Ten Mincho’s best features.

The character 漢 in seal script
Image source: Wikipedia

First, Unicode

At the center of all this is Unicode. Known in the news as the emoji people, they set the standard for character encoding, aiming to be as universal as possible. The goal is that you should be able to have arbitrary combinations of languages in one document without having to switch how you interpret their encoding to display them.

Unicode is huge, and currently includes 139 scripts. A good chunk of its code points are the Chinese characters common to Chinese, Japanese, Korean, and, in some contexts, Vietnamese. Unicode calls these CJK Ideographs, and while I know they’re not really all ideographs, I’m going to be using that language as well in this post.

Han Unification:

These characters, the CJK Ideographs, are common to multiple countries and languages, and there are also thousands upon thousands of them. Using as few code points as possible to encode them all makes sense. After all, the Latin alphabet isn’t encoded twice for English and Icelandic just because the visually similar but unrelated p and þ must be separate.

Unfortunately, it’s not quite as simple as that. Letters like “A” are shared between Latin, Greek, and Cyrillic, but are encoded multiple times because each one is truly not quite the same. In the CJK Ideographs, the line between stylistic variation and a meaningful difference can get similarly blurry. There are clear differences, like simplified and traditional characters used in Chinese, or shinjitai and kyūjitai (literally new and old forms) in Japanese. Then there are the more subtle variations that have nothing to do with simplification, like 內 and 内 which are the same character and mean the same thing, but one is constructed with 入 and the other with 人. There’s 說 and 説, both being the same traditional character, but with the first form being more common. There are characters like 令, which often appears as in simplified Chinese text, in traditional Chinese text, and in Japanese text (though this appearance is not unheard of in Chinese). There are characters with the 言 radical, which in Chinese usually begin with a diagonal stroke on top, while in Japanese that first stroke usually appears as a flat line, forming “” (also not unheard of in Chinese).  Then there are characters with the 示 radical, like 社, which can appear that way or as 社. In Chinese this variation is equivalent but feels stylized and stuffy, but in Japanese it is considered the traditional (kyūjitai) form.

To continue to use imperfect metaphors from the world of the Latin alphabet, there are variants that are completely equivalent from a general perspective, and are decided by fonts, like variations of a or g; historical variants that you might want to insert without changing every normal instance of that letter, like ſ (the out-of-use, historical long s, as in “Congreſs” or “the purſuit of happineſs”); or variations for different contexts, like capital letters. Then there are things like spelling reforms. Think color/colour, or skillful/skilful. This last part is relevant because it’s not the strokes or the radicals that are encoded and entered, it’s the whole character. When it comes to English, I can think of at least one country that wouldn’t be too happy to depend on specific fonts to spell things “right.”

So how are they all encoded?

In some cases, variants get separate code points entirely. This is the case for almost all Chinese simplified characters and the majority of Japanese shinjitai characters. This is also the case for 內 (U+5167) and 内 (U+5185), as well as 說 (U+8AAA) and 説 (U+8AAC).  In these circumstances, it becomes easy to select variants directly through your own input. Many fonts support both variations, but some will choose only one to support, and sometimes even make that one appear like the other!

In the far opposite side of cases, Unicode makes no distinction between variants, as with 言. In these cases, it is entirely up to fonts to decide the appearance of these characters. Sometimes this means choosing a font that displays the character the way you need, and in some contexts this means specifying which appearance you need by tagging text by language (like in HTML). On this post I have used language tags to control how these characters display, but it is up to your browser to select appropriate fonts (so if you saw no difference between the 言’s and the 令’s, that’s why).

In between these two extremes is the situation that is most relevant to the Ten Mincho presentation.

Unicode is not the first encoding that’s ever been invented for Chinese characters, and older encodings are still used in some places in East Asia. To ensure that Unicode is compatible with these systems, separate code points exist for the same character or for variants, called “CJK Compatibility Ideographs” by Unicode. This ensures that going from one system to the other and then back again doesn’t lead to the loss of characters. What this also means, however, is that you might eventually paste such a character (a CJK compatibility ideograph) and end up with the base unified character, potentially losing the appearance you may have specifically chosen. This is the case for 社 (U+FA4C), which is a CJK Compatibility Ideograph that might get turned into 社 (U+793E), which by itself won’t look the same as U+FA4C unless it’s being displayed by a Korean font (like so: ). It is interesting to note that 令 (U+4EE4) also has a Compatability Ideograph defined in the Japanese style (U+F9A8, 令), but since Japanese fonts display U+4EE4 with that appearance anyway, it is unlikely to change appearance even if it gets changed into the base character.

Unicode’s “Standardized variation sequences” provide a solution to this loss of information. By adding a variation selector (from FE00 to FE15) to the preceding character, variants can be specified by the author and preserved. So for 社 (U+793E), the variant defined as matching 社 (U+FA4C) is 793E FE00. A complete list of variants is available from Unicode here. Their correct appearance just depends on fonts and input methods/software supporting this function.

This process is exactly how Ten Mincho deals with the problem of the variant forms of CJK Compatability Ideographs being lost by reverting to their base characters. It provides glyphs for the base character’s SVSs with the same appearance as the compatibility character someone might otherwise be tempted to use. This makes Ten Mincho a very useful font for Japanese indeed, because multiple versions of one character can be used in one document without needing to switch fonts. It also means the distinction between versions isn’t tied to the document but to the character itself, and will survive being copy and pasted between other documents and back again.

What it all means

The complexity of this all means that any kind of program or function that needs to deal with these characters has to take all these little complications into account. Search functions or machine translation engines need to know which characters are equivalent. Fonts need to decide which code points they need include in order to be useful, and whether they will support enough variations for one region and language, or many.

And for the rest of us:

Choosing your font matters, since for a good number of characters, it remains the main way to choose the way they appear. For Chinese, fonts often support traditional and simplified characters alike, but will still be available in “TC” and “SC” varieties where the difference is in the appearance of unified characters like / or /. It also means that you need to have a font that supports all the characters in your document. If your font is missing a glyph you might be able to switch the character to an variant that the font does support, but there’s no guarantee that variant will be equivalent to the author of the document. (Of course, with Ten Mincho you would worry about this much less.) As is the case with so many things, paying attention to fonts early can avoid complications later.

It also means that we should notice and take advantage of things like variation selectors and the fonts and software that allows for their use! I for one am hoping that this becomes more widely known, used, and available.

If this has gotten your interest like it has mine, I would definitely recommend Ken Lunde’s CJK Type Blog for further reading!