In the original Squeak, a character, represented as an instance of Character class, holds an 8-bit quantity (``octet''). Obviously, that representation is not sufficient for the extended character set.
What kind of representation do we need for the extended character set? One might think that a 16-bit fixed representation would be enough, but even the ``industrial standard'', Unicode version 3.2 [2][3], now defines a character set that needs as many as 21 bits per character.
The plain Unicode has another problem; ``han-unification''. The idea behind han-unification is that the standard disregard the glyph difference of certain Kanji characters and let the implementation choose the actual glyph. This abstraction contradicts the philosophy of Squeak; namely, it becomes impossible to ensure pixel identical execution and layout across the platforms.
On the other hand, Unicode seems to be good enough for scripts other than CJKV (in Unicode terminology, CJKV refers to ``Chinese, Japanese, Korean and Vietnamese, that use the Chinese origin characters) unified characters. They are well defined and contain many scripts.
Obviously, we need more than 16-bit for a character. What is the upper limit? Actually we don't have to decide the upper limit. Thanks to the late-bound nature of Squeak, we can always change the internal representation without affecting the other parts of system much, even when what is changed is as basic as the Character class. This decision let us start with a 32-bit word for a character. We have added a kind of ``encoding tag'' to each character to discriminate the unified characters, and also to switch the underlying font and scanner method implementation.