| encoding name | encoding tag |
| Latin1 | 0 |
| JISX0208 | 1 |
| GB2312 | 2 |
| KSX1001 | 3 |
| JISX0208 | 4 |
| Japanese (U) | 5 |
| Simp'd Cn (U) | 6 |
| Korean (U) | 7 |
| GB2312 | 8 |
| Trad. Cn (U) | 9 |
| Vietnamese (U) | 10 |
| KSX1001 | 12 |
| LatinExtended (U) | 17 |
| IPA (U) | 18 |
| MusicalSymbols (U) | 89 |
| MathAlnumSymbols (U) | 90 |
| Tags (U) | 91 |
| Generic (U) | 255 |
|
As written in subsection 2.1, we have added a one word per character representation to the original Squeak. In this section, we describe the details of that implementation.
To represent this extended character, a class named MultiCharacter was added as a subclass of Character. Because the value instance variable of Character is already a SmallInteger, MultiCharacter doesn't have to have any additional instance variable. While it is possible to assign any object to value instance, so far we stick to the positive value range of SmallInteger to avoid large integer arithmetic and confusion with negative value. Because one bit is used for the SmallInteger tag and another is for the sign bit, 30 bit out of 32 bit word is actually available for positive integer character codes. Because most of the methods of Character are compatible with this new subclass, we only needed to override about ten methods in MultiCharacter.
How do we use this 30 bit data? In the current m17n Squeak implementation, 8 bits are used for the ``encoding tag'', which is often refered to as the ``leading char'', and remaining 22 bits represent the code point in the language/script.
Basically, the 22 bit part is identical to the Unicode code point and the encoding tag is based on the Unicode script definition. Namely, characters in a script defined in Unicode has its own encoding tag value. However, there are exceptions. A character in the unified han area can have different encoding tags to denote its ``source standard''.
The encoding tag is used for three purposes. One purpose is to switch the various methods that depend on the script. For example, the character scanning rule varies from a script to another, as we describe in section 6.
Another exception is introduced to maintain the compatibility with the previous m17n implementation. In the previous implementation, the 30 bits were devided into a 6 bit encoding tag and a 24 bit code point. Also, CJK characters are encoded as the domestic standard such as GB 2312, JIS X 0208, and KS X 1001. To make it possible to use the existing old instances in those representation, the encoding tags for those old encodings retain the same bit pattern even though the boundary between the code point and encoding tags has changed. Table 1 summerize the current encoding tag allocation. In the table, ``(U)'' after the name represents the script based on Unicode. Non-Unicode encodings appears twice to make it possible to use existing non-Unicode instances created before the boundary was shifted. Before this boundary change, there were 4 encodings are defined, so the new Unicode based encoding starts after 16.
As written in section 2, the definition of Character was changed to Latin-1 based encoding. While most of the characters in Latin-1 were in MacRoman and glyphs were already available, there are several characters and symbols missing from MacRoman. The glyphs for those characters and symbols were bit editted and the existing StrikeFont are modified to have such glyphs.
The ISO-2022 based multi octet character sets such as GB 2312, JIS X
0208, KS X 1001 have 8836 (
) code point. The
original code assignment is somewhat sparse; for every 94 characters
in a row, there are 34 unused values. As we will discuss in section
7, the StrikeFont expects consecutive code
points. We simply pack the characters to make all of the characters
consecutive.
The encoding tag for a Latin-1 character as MultiCharacter is zero. This means that comparing two characters, no matter they are Character or MultiCharacter, can be done simply by comparing the value instance variables. Unicode standard defines the equality conformance for composited characters. While it can be done at the higher level, we don't provide the full composited character equality in basic MultiCharacter.