Next: Text Scanning Performance and
Up: Overall Design
Previous: Universal Character Set
To represent text in any form of extended character set, there must
be a character entity that can represent more than an 8-bit quantity
and a type of string that can store these
characters. One way to adapt this new representation is to change
Character and String uniformly so that all instances of
Character or String represent this new wide character and
string (uniform approach). One of the most advanced
multilingualized system, Emacs after version 20, uses this approach
[4]. Another way is to add new
representations and let them co-exist with the exisiting default ones
(mixed approach).
The uniform wide character representation is cleaner, but takes much
space. In original version 3.2 image, The total size of the String subinstances occupy is about 1.5MB. If we use unsatisfying
16-bit uniform representation or 32-bit representation, the image size
would grow a few megabytes.
We decided to use the mixed approach. The best representation is
selected appropriately and implicitly converted to another
representation if necessary. In Smalltalk, this kind of implicit
conversion is easy to do. Also, migrating from original Squeak to
m17n Squeak is easier this way.
We discuss the detail of the representation in section
3.
Next: Text Scanning Performance and
Up: Overall Design
Previous: Universal Character Set
Owner
2003-02-08