Resume

System It

Location:

Montreal, QC, Canada

Posted:

January 24, 2013

Contact this candidate

Resume:

F on t S e l e ct i o n a n d F o n t Co mp o si t i o n fo r Un i co d e

Marc-Antoine Parent1

Martin J. D rst and

MultiMedia-Laboratory, Institut f r Informatik der Universit t Z rich

Winterthurerstrasse 190, CH-8057 Z rich, Switzerland, e-mail: abqckg@r.postjobfree.com

and

Centre de Recherche Informatique de Montr al

1801, avenue McGill College, Bureau 800, Montr al (Qu bec) Canada, H3A 2N4

e-mail: abqckg@r.postjobfree.com

Note: This is a prepublication version of a paper that will be published in the Proceed-

ings of the 7th International Unicode Conference. Copyright for this and the final ver-

sion is held jointly by the authors and by Unicode, Inc. This version is not intended for

wide dissemination.

The Proceedings with the final version of this paper will be available at the confer-

ence, to be held in San Jose, CA, on Sept. 14/15, 1995 (for further information, contact

Global Meeting Services, 3627 Princess Avenue, North Vancouver, B.C., Canada V7N

2E4, email: abqckg@r.postjobfree.com, voice: +1-604-***-****, fax: +1-604-***-****) or af-

ter the conference from Unicode Inc. (P.O. Box 700519, San Jose, CA 95170-0519,

U.S.A., email: abqckg@r.postjobfree.com, voice: +1-408-***-****, fax: +1-408-***-****).

There are some differences between this and the final version, which are mainly mo-

tivated by the problem of printing Japanese characters on printers without Japanese

fonts. The current version of the paper contains EPSF bitmaps for some Japanese char-

acters. These EPSF are optimized for 24dpmm (600dpi) printers. With other printers,

dropouts or antialiasing effects may lead to suboptimal representation of the Japanese

characters.

1.The second author s work was part of Alis Technologies Inc s Internet en Fran ais project

to develop Lys, a multilingual mail agent. For information on Lys, please contact Alis: 3410 rue

Griffith, Montr al, Qu bec, Canada, H4T 1A7; Fax: 514-***-****; e-mail: abqckg@r.postjobfree.com

F on t S e l e ct i o n a n d F o n t Co mp o si t i o n fo r Un i co d e

Marc-Antoine Parent1

Martin J. D rst and

MultiMedia-Laboratory, Institut f r Informatik der Universit t Z rich

Winterthurerstrasse 190, CH-8057 Z rich, Switzerland, e-mail: abqckg@r.postjobfree.com

and

Centre de Recherche Informatique de Montr al

1801, avenue McGill College, Bureau 800, Montr al (Qu bec) Canada, H3A 2N4

e-mail: abqckg@r.postjobfree.com

Abstract The integration of the current scripts of the world into a single character en-

coding standard (Unicode/ISO 10646) poses new challenges to system software and

user interface designers, typographers, and font providers. Constructing or designing

all-encompassing Unicode fonts is not feasible for several reasons in most cases;

much more flexible solutions are necessary.

The paper analyses the requirements for flexible font selection and composition

mechanisms from the viewpoints of typography, user interface, programmer interface

and resource usage. Based on an object-oriented application framework, an architec-

ture to satisfy these requirements in an extensible way is proposed and implemented.

The tasks of font selection and glyph mapping are reduced to the same basic concepts,

which also lead to interesting solutions for problems such as CJKV glyph disambigua-

tion, and allow the easy integration of proprietary algorithms.

1 Introduction

For most people dealing with text processing systems, from users to programmers, the

relation between character codes and glyphs in a font was simple: For every character

code, there was a glyph, and vice versa, and every font contained glyphs for all char-

acters. The great majority of non-specialists expect the same for Unicode: The all-en-

compassing Unicode font .

For three reasons, this expectation is not going to be fulfilled: First, the relation be-

tween characters and glyphs is not one-to-one; for some scripts, complex operations

are necessary. Second, the resources needed for a single complete Unicode font, and

even more for a reasonable variety of Unicode fonts, are not available. This is true both

in terms of human work by font designers and, for the near future, in terms of storage

space. Third, from a typographic viewpoint, a single large font is not flexible enough

to address the requirements of multiscript documents.

However, once a user knows about Unicode, he is not very interested in additional

explanations, and expects the system to behave as if there were one or several Unicode

fonts, unless he gives more detailled specifications. To achieve such a behaviour, this

paper proposes a general scheme of font composition and font and glyph selection,

1.The second author s work was part of Alis Technologies Inc s Internet en Fran ais project

to develop Lys, a multilingual mail agent. For information on Lys, please contact Alis: 3410 rue

Griffith, Montr al, Qu bec, Canada, H4T 1A7; Fax: 514-***-****; e-mail: abqckg@r.postjobfree.com

7th International Unicode Conference -1- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

which can be used in circumstances where only very few fonts are available, as well as

for very high quality typography.

In Section 2, this paper gives an overview of the requirements for a font selection

and composition scheme. Section 3 presents the new concepts of composite font and

base font, and the three main kinds of composite fonts, namely font arrays, font cas-

cades, and font sets. Several interesting applications of composite fonts are explained

in Section 4. Section 5 gives background and details of our implementation, and Sec-

tion 6 compares our concepts with existing font composition schemes.

2 Requirements

Due to the vast differences in available resources, the requirements for font selection

and font composition schemes cover a wide range from low end to high end. At the

low end, we should try to assure basic readability, or at least make the user aware of

the fact that glyphs are missing. At the high end, we should provide mechanisms to

implement optimal typographic solutions for many kinds of multilingual documents.

2.1 Requirements of Multilingual and Multiscript Typography

To understand the requirements for multiscript typography, it is important to realize

that multiscript documents come in a very wide variety. This goes without saying for

the number of possible combinations of scripts, as well as for the basic variety that ex-

ists even for monolingual documents. Less known, but equally important is the ratio

of usage of characters from different scripts, and the length of stretches of each script.

In the past, typesetting multiscript documents was difficult, and there exist only few

examples. Typographic theory and practice for multiscript documents are still in their

infancy, and the increasing possibilities to compose true multiscript documents will

will help them growing up.

The considerations made by Bigelow and Holmes [BH93] about their Unicode font

already show that an attempt towards a full Unicode font faces many problems that

have to be addressed very carefully. They mention the choice of a sans-serif style, with

less cultural associations, the use of more differentiated diacritics, or the adjustment of

Hebrew to a size between Latin capital height and x-height. The general objective is a

harmonized design, regularizing basic weights and alignments, but preserving es-

sential and meaningful differences.

Doing this for the Latin alphabet with all the Unicode extensions, and for alphabets

such as Greek and Cyrillic, which share a large part of their history and typographic

tradition with the Latin alphabet, is already a formidable task. Trying this for less re-

lated scripts becomes even more difficult. Very fundamental design concepts of West-

ern typography, such as the uniform gray value of a line of text, cannot be used e.g. for

Japanese character design.

Not only do the problems get larger if character history and typographic traditions

are less related, but also problems increase dramatically as the number of scripts in-

creases. While it may be relatively easy to find a solution for scripts A and B, and an-

other for scripts B and C, this does not provide a solution for all three scripts together.

7th International Unicode Conference -2- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

As an example, consider Hebrew, Latin, and Arabic. Matching Latin glyphs with He-

brew glyphs, the Latin glyphs should have rather large x-height and small ascenders

and descenders to account for the fact that ascenders and descenders are virtually non-

existent in Hebrew. On the other hand, combining Latin with Arabic, a Latin font with

small x-height and large ascenders and descenders may be preferred to match similar

features in the Arabic script. For a single Unicode font with uniform design, a compro-

mise is necessary, which will be suboptimal for many texts with restricted script usage.

Another problem in high-quality typography is the ratio of usage of characters from

different scripts. It may very well be that a different size ratio is necessary depending

on whether some Hebrew words appear in a text written with Latin letters, or vice ver-

sa. This may even depend on the context. In an English text where Hebrew words are

the subject of the discussion, Hebrew might preferably be somewhat larger or bolder

than in a text where Hebrew equivalents are just given for reference, and may be read

over. The role of the different scripts in the document influences their visual relation.

The boundary between the composition of matching fonts (according to whatever cri-

teria) and the explicit markup of foreign script portions, e.g. as bold, is of course not

very well defined, and has to be chosen depending on the text.

The combination of different scripts can even lead to problems that can only be

solved on the level of microtypography. The most important line of reference in the

Latin script is the baseline, whereas ideographs are aligned to their center. For longer

stretches of ideographs, it is best to try to align their center to that of the Latin text. On

the other hand, for short sequences of ideographs and especially for single ideographs

with a clearly visible baseline (such as or, as opposed to or ), the relation to

the baseline of the Latin characters becomes more and more important. Even neigh-

bouring characters can have an influence; ideographs that look right between standard

text can look vertically deplaced between parentheses.

With all these problems, it should be clear that configurability and flexibility are

very important. This applies both to the composition of Unicode fonts with a wide cov-

erage from fonts for different scripts, as well as to the flexibility with respect to aspects

such as size combinations, glyph selection, and glyph placement, where new schemes

and algorithms should be easy to integrate into a system whenever desired.

The above explanations tried to show that a fixed, all-encompassing Unicode font is

not the final solution for multiscript typography. However, this in no way meant to

discourage the important design efforts towards Unicode fonts, such as those of

Bigelow/Holmes [BH93], Everson (his shareware fonts are available from ftp://dku-

ug.dk/CEN/TC304/EversonMono10646 or ftp://midir.ucd.ie/mgunn/Everson/

EversonMono 10646), Haralambous [HP95], and hopefully others.

Few users have a selection of fonts for several scripts wide enough to find good

combinations easily. In these situations, a general design is by far preferable to a bad

match of individual fonts. Also, Unicode for most scripts contains more characters

than a standard font for that script; this applies particularly to the Latin script, even

more so because most of the general symbols belong more to the Latin script than to

7th International Unicode Conference -3- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

any other. Extending existing fonts to cover the whole glyph set necessary in a single

script for Unicode is therefore also very important.

Last but not least, trying to design well-matching type of different scripts, and

adopting design ideas from other scripts, is an interesting artistic undertaking. It can

lead to completely new designs of great excellence.

Whereas currently, uniform designs or the combinations of matching fonts have to

be done by the type designer or the user, it might become possible to some extent in

the future to automatize this process, and to provide higher quality in many specific

situations with technologies such as Multiple Masters and others [MB93]. Similarly,

with the number of characters increasing, more structured approaches to font design

[D r93], which are currently in the research stage only, may become more viable.

2.2 User Interface Requirements

From an user interface viewpoint, it is very important that interventions by the user

are minimal. For a text with frequent small portions of foreign script, the user does not

want to specify input method, font, and additional attributes such as language, size,

and vertical adjustment over and over again. In general, the input method can be

changed most easily with a keyboard shortcut, and has to be changed anyway. The

font, on the other hand, should not have to be changed explicitly.

What is also important is that the system reacts in a predictive way and does not

mess around with the user s specifications. If he selects Helvetica, he does not want

that to be changed to anything else just because the rendering system has no way to

render Arabic characters with a Helvetica setting.

2.3 Implementation Requirements

Implementation requirements can be split into two categories: General requirements

and the specific requirements for the implementation platform we are using, the appli-

cation framework ET++ [WGM89, WG94]. A more detailed discussion of our general

approach to software globalization is contained in [D r95]; this stands in contrast to

earlier and more narrow attempts at localizing ET++ [Sat95].

Generally, it is very important to see that an average application programmer does

not have the knowledge nor the motivation to care about multilingual issues, nor is it

easy for her to test correct behaviour. This means that besides programs and compo-

nents that specifically treat language-dependent aspects of text, no additional pro-

gramming effort should be necessary. This of course includes font specifications; in

this respect, there is little difference between a programmer and a final user. On the

other hand, it is important that when really needed, an application programmer can

easily introduce completely new functionality and is not bound to the limitations of

fixed APIs or configuration files.

Another implementation requirement is efficiency, both in space and time. For ap-

plications with graphical user interfaces that contain many short text items, these items

should be as lightweight as possible. Having to specify runs for different fonts on such

7th International Unicode Conference -4- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

texts makes them more heavy than if only a single font can be specified that can render

whatever characters are used.

The use an object-oriented application framework such as ET++ both provides the

base to realize the above requirements, and defines requirements of its own. The most

important aspect is the general portability designed into ET++ [Wei92]2. This means

that the rendering model has to be very general and on a rather low level. In fact, for

text rendering there is a single function that connects the application framework with

system-specific code; this function draws a single glyph. Another aspect is that as a

publicly available framework, ET++ has to rely on openly available fonts, which come

in a very wide variety of glyph sets and encodings. On the other hand, ET++ should

not disallow the use of proprietary fonts by users who have acquired them.

3 Composite Fonts

The requirements of the previous section can be fulfilled only by a system that makes

a basic distinction between two kinds of fonts, namely base fonts and composite fonts.

The use of composition is a very important principle in object-oriented software

[GHJV95], and here again shows its strengths. Base fonts are the fonts that we all know

from conventional systems; they cover a usually rather small subset of Unicode, with

a consistent design. Composite fonts, on the other hand, are responsible to cover Uni-

code as generally as possible.

3.1 Glyph Mapping and Font Selection

Both base fonts and composite fonts can incorporate table-based and algorithmic func-

tionality. Base fonts will care for glyph selection and mapping, whereas composite

fonts have to select appropriate subfonts.

The simplest base font will just do a one-to-one mapping from the Unicode charac-

ters it covers to the glyph encoding it uses; more sophisticated base fonts will take into

account the script-specific and font-specific character-to-glyph mappings. Assigning

this functionality to fonts, and not building it into the core text rendering routines of a

system, allows to use fonts with different encodings and different ligaturing mecha-

nisms for the same script in the same system. This makes the system easily extensible

for new scripts and new glyph mapping schemes. Even proprietary glyph mapping

schemes for highest-quality typography can be added.

Composite fonts also incorporate functionality. They have to decide what portions

of a text are rendered with what base font. Many ways to do this can be imagined; we

will introduce three important ones starting in Section 3.3. Some interesting and not

immediately obvious applications will be presented in Section 4.

3.2 Virtual Fonts and Real Fonts

With the above distinction between composite fonts and base fonts, how can we assure

that when a user specifies Helvetica, which obviously exists as base font, he gets Japa-

2. At present, there exist ports for X11, SunView, OpenGL, the Macintosh, and Windows NT.

7th International Unicode Conference -5- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

nese drawn with Japanese letters, which are not usually contained in Helvetica, and

which therefore have to be selected by a composite font? To achieve this, a font can

play both the role of a composite font and of a base font. When Helvetica is used for

rendering a text, it is addressed as a composite font. The composition is constructed so

that it will address Helvetica again as a base font for those characters that can indeed

be rendered with the base font. For the other characters, the composition will reference

other fonts as base fonts, maybe with intermediate composite fonts.

We call the fonts that play both roles real fonts. Of course, a user or programmer may

not be satisfied to always have the same fixed combinations. Additional combinations

can be created, e.g. a font called MyNewSansSerif. Such fonts are adressable only as

composite fonts, but not as base fonts, and are therefore called virtual fonts.

Also, it is possible that due to system limitations (e.g. only 256 glyphs per font), font

composition is used below the level accessible to the user. Such fonts are not adressa-

ble as composite fonts, and therefore are called hidden fonts.

3.3 Font Arrays

A first kind of composite font is the font array. A font array is a very deterministic com-

posite font; it knows by itself which of its components can draw what characters. Usu-

ally, components are responsible for a single range of Unicode characters. Font arrays

can be used to simulate large fonts if these are not available due to limitations of the

underlying rendering system. Another use is the splitting of fonts designed or rear-

ranged for Unicode into smaller parts to avoid loading the whole font if only small

parts of it are used. As Section 6 shows, schemes similar to font arrays are available in

several display or printing systems.

3.4 Font Cascades

A font cascade is a composite font with clear priorities, for use in situations that are not

as clearcut as in the case of a font array. A font cascade does not know by itself what

glyphs its components can render. It starts by giving fonts with higher order priority

a chance to render the text. Text portions that cannot be rendered by fonts with higher

priority are passed to fonts with lower priority in turn. This procedure guarantees that

the font specified by the user is used whenever possible. With the correct sequence,

this also makes separate masking unnecessary. Masking otherwise has to take care that

e.g. Latin characters are not taken from a Greek font, or Kana from a Korean font, even

if they are available there.

Long cascades can lead to inefficiencies. But if the fonts of more frequently used

scripts are arranged before the less used scripts in the cascade, this is not a problem.

Finding the position of the next character that can be rendered with a given base font,

so as to pass the intermediate stretch down the cascade, is very fast.

3.5 Font Sets

A font set treats its component fonts in a more equal way than a cascade, incorporates

more functionality, and gives a wider selection of possibilities. It can also address

7th International Unicode Conference -6- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

some very specific problems, such as the selection of glyph variants from different East

Asian typographic traditions (see Section 4.2). The basic implementation of a font set

looks for the component font that can render the longest stretch of text. This can pro-

vide well-matching glyphs and can be more efficient on a display system that has a

large overhead for font switching.

Other strategies for deciding between different fonts can be implemented, for exam-

ple any kind of quality rating. However, the more complex the strategy, the more time

will be used for evaluation, even if in many cases, most fonts drop out very early be-

cause there is a character they cannot render. For higher efficiency, it is possible to re-

strict lookahead to a given number of characters.

As a special case, a font set can decide whether in the rendering of combining char-

acters, priority is given to a separate rendering with a high-quality font, or integrated

rendering, maybe with a font that does not exactly match the font selected by the user.

More crucial than efficiency is stability. A font set treats its component fonts in a

more equal way than a font cascade. It is therefore more volatile to changes in a text,

and to the way text stretches are used for rendering. A single insertion or deletion may

lead to a longer run for a certain font, or give it a better quality rating. Also, if render-

ing is done on units of lines or paragraphs for efficiency reasons, or even parts of lines,

a different font may be chosen if the font selection process starts at a different charac-

ter. Such changes are highly undesirable for the user. Without care, this can even lead

to nasty loops during reformatting. Thus font sets should be implemented and used

with great care.

4 Applications of Composite Fonts

Besides standard font and glyph selection, composite fonts in their various forms al-

low to cover a wide range of related problems. In the following, some particularly in-

teresting examples are given.

4.1 Last Resort Font

A special position is taken by the last font in a font cascade; it is a kind of lender of

last resort . If it is asked to render a text portion, this means that the character cannot

be rendered as desired, and that there is most probably no appropriate glyph. Still,

there are many things that can be done, one after another or alternatively. The selection

can be made by the system implementor, or in more elaborate systems by the applica-

tion programmer or the user.

A composite font may have been defined to include a font of each of the scripts

used, with the understanding that this covers all the necessary characters. Still, it is

possible that a given font for a script does not contain a character, whereas another font

in a different style, and maybe from a different supplier, contains that character. There-

fore, a first thing the last resort font can do is to try to search all the fonts loaded in the

application, defined in the application, or available on the system. As this uses pro-

gressively more resources, it has to be carefully implemented, if at all.

7th International Unicode Conference -7- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

Another possibility, maybe after the above has failed, is to display some mechanically

generate identification of the character. One variant is the use of four hexadecimal dig-

its, arranged in a square, as on the Macintosh. Depending on the context, readable

glyph sequences for unavailable combinations of combining characters (e.g. u" for )

can be provided.

Another variant is to display a glyph indicating the script or category (e.g. symbols,

dingbats, ) of the character only. It has been argued that such glyphs could be added

to the character set of Unicode, but this should not be done3, for two reasons: First,

these are not characters themselves, they are only used on the meta-level to speak about

characters and glyphs (respectively their absence). Second, they should not be provid-

ed in a font, and much less in a character set, but should be part of the application

framework itself to assure display in any situation.

An even simpler solution is to display the same shape for all characters that cannot

be rendered. In some cases, width information about the corresponding glyph may be

available, and the shape may be adapted appropriately. It may be similar to, but

should be distinguishable from, the glyph for U+FFFD, the replacement character for

characters that cannot be represented (as opposed to displayed) in Unicode. In a config-

uration debugging mode, a popup dialog might also appear, trying to give installation

hints to the user.

Additional resources may be made available through the network. In the context of

the worldwide web, for example, unavailable glyphs might be replaced by inline im-

ages obtained from a special glyph server, or by a single active inline image repre-

senting an unknown character, which on activation would display additional

information or trigger further attempts at higher-quality rendering, going as far as

having the server send the whole document in bitmap form. Having the user trigger

additional work, instead of doing this as a default, is reasonable because the demand

on the system can by heavy, and in many cases, a user will have downloaded a docu-

ment accidentally and will understand as much with replacement characters as he

would with a high-quality rendering.

Similar solutions can also be envisioned for cases where transmission is requested

in a form that cannot directly encode all characters of the document. ISO 10646, code-

by-code identical to Unicode, is very likely to become the document character set for

HTML [BC95, work in progress]. The document character set mainly defines the inter-

pretation of numeric character references (&#nnn;, where nnn is the decimal repre-

sentation of the character); it will take some more time until it is accepted as the default

character encoding for documents that go beyond ISO 8859-1 (Latin-1).

4.2 CJKV Glyph Disabiguation

A font set can provide a simple and efficient solution to choose between different

glyph variants for the typographic traditions of East Asia. The basic idea for this solu-

tion is due to Glenn Adams [personal communication, 1994]. Chinese (traditional or

3. For ideographs, the Geta mark (U+3013) is available for historic reasons.

7th International Unicode Conference -8- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

simplified), Japanese, Korean, and historical Vietnamese4 all use Han ideographs.

These characters have been unified in Unicode, as this has been done for all other

scripts. Because of the great number of characters and the variety of similarities and

historical developments, an explicit, well-defined model [Uni92, p. 15] was intro-

duced, based on criteria from existing Japanese standards. This model assures basic

readability and round-trip conversion. However, for high quality rendering, different

glyphs are frequently necessary even in the same typeface [Lun94], although the main-

stream typefaces are different for each typographic tradition.

With a small amount of lookahead, disambiguation is easily possible in an appro-

priately configured font set. If fonts for the corresponding local standards are used,

disambiguation will come at no additional programming cost. A font of Japanese ori-

gin, containing the glyphs of the Japanese standard JIS X 208 [Lun93], will on average

not be able to render more than two or three subsequent characters in a Chinese text,

and vice versa.

In general, the following features can be used for disambiguation: First, simplifica-

tions above a certain degree have separate codepoints in Unicode; these are used in

distinguishing simplified Chinese and Japanese. Second, additional characters have

been invented locally; this applies for Vietnamese and Japanese. Third, the frequency

distribution of certain characters is widely differing, to the extent that some characters

are well-used somewhere, whereas they are obsolete somewhere else. Fourth, phonet-

ic scripts of different nature complement the ideographs; this applies for Japanese (Ka-

na) and Korean (Hangul).

This scheme works correctly for single-language documents and documents with

different languages in each paragraph or line. On the other hand, very short foreign

language pieces, such as person or place names, may not be detected, but in these cas-

es, native glyphs are used anyway in standard typographic practice, such as newspa-

pers. More problems arise with isolated entries in menus and similar places, where

localization mechanisms [D r95] can provide appropriate solutions.

The integration of CJKV glyph disambiguation into the general font selection mech-

anism by means of a font set demonstrates the general usefulness of composite fonts,

assures that glyph disambiguation is available for a wide range of cases automatically,

and hopefully removes some of the mostly unsubstantiated concerns against Unicode

in this point.

4.3 Character Replacement

Besides trying to get the best possible result with the available font resources, compos-

ite fonts can also easily be used for more fancy effects. Transliteration, e.g. from

Cyrillic to Latin, can be a last resort solution when a Cyrillic font is not available, but

it can also be a means to display phonetically readable characters to an user who does

not know the Cyrillic alphabet, even if corresponding glyphs are available. Transliter-

4.Contemporary Vietnamese does not use ideograms any more. Also, Unicode currently does

not cover some ideograms particular to Vietnamese; their addition is planned for a future ver-

sion.

7th International Unicode Conference -9- San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

ation is in many cases not very easy, and may interfere with the user s interest of seeing

the base data, so that on-demand transliteration into a separate window, and imple-

mentation of transliteration as a higher-level text-changing process, similar to spell

checking, hyphenation, and so on, may be more desirable. Nevertheless, that font com-

position can be used in this context shows the versatility of this concept.

Another application of composite fonts is the visualization of usually invisible con-

trol characters. Such a function, visualizing tabulators, paragraph breaks, and so on, is

available in many text editors. Both an implementation in the application code and an

implementation in the main rendering code are complicated and can be rather slow.

The solution is to prepend a special font to the font cascade set by the user so that the

characters are remapped appropriately.

5 Implementation

The font composition and selection scheme described above has been successfully im-

plemented as part of the ongoing effort of globalizing the application framework ET++

[WGM89]. An application framework is a collection of cooperating object-oriented

software components providing most of the functionality for applications with graph-

ical user interfaces. ET++ itself has pioneered many object-oriented concepts

[GHJV95], comes with a wide variety of sample implementations [WG94] and multi-

media extensions [Ack94], and has also been used in commercial applications.

The current effort of globalizing ET++, known under the name of UNET++ (pro-

nounced unity++ ), is aimed at researching new concepts in the area of multiscript

processing and localization while exploiting the features of application frameworks

and demonstrating the special suitability of an application framework for the imple-

mentation of multiscript support.

5.1 Basic Multiscript Support in UNET++

Here, we give a short overview only of the base of UNET++; a detailed description is

available [DW94]. A solution for compiler support for wide string constants is de-

scribed in [D r94]. All characters and character strings in UNET++ are uniformly en-

coded in Unicode, but for storage efficiency reasons, both wide (16 bit) and narrow (8

bit) string implementations are used, which are hidden from the application program-

mer by a common String class. A flexible Mapping class is responsible for on-de-

mand loading and efficient storage of conversion tables. A wide range of Converters

has been implemented for easy conversion from and to external character encodings.

The class KeyboardFrontend cares for keyboard input conversions ranging from

simple remapping to aggregation and server-based input conversion for East-Asian

languages. KeyboardFrontends are composable in much the same manner as com-

posite fonts to achieve flexibility with a small selection of basic components. The var-

ious input methods can be selected from a Keyboard menu to the left of the help

menu, in a similar way to the keyboard menu icon on the Macintosh [App92]. From

the same menu, a dialog is available that allows to change the language of the user in-

terface during runtime, either for the whole application or for certain windows indi-

7th International Unicode Conference - 10 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

vidually [D r95]. Even in this case, although this is a major source of programming

work in other approaches (see [ODo94], p. 281), by using the inherent advantages of

the application framework, in the general case the programmer does not have to

change a single line of her code to make the application localizable.

5.2 Text Rendering Classes in ET++

ET++ uses object-oriented abstractions mainly for high-level concepts, such as Text

(the actual text model, including stable information for styles and so on), TextView

(high-level text display and control functionality), TextFormatter (line breaking and

other formatting tasks), TextPainter (intermediate level display and control func-

tionality), and Port (low-level abstraction of the rendering system capabilities).

In contrast to other systems, the text is not directly represented with objects for in-

dividual glyphs5 [GHJV95, especially p. 34] or layout components such as runs of uni-

form directionality [How94]. The former approach is very flexible as long as there is a

one-to-one relation between characters and glyphs, but not easily extended for more

complicated cases. The later approach is using composition (of text from smaller text

portions) for a minor aspect that is not familiar to the general programmer and can

change very dynamically.

ET++ stores intermediate formatting information such as line divisions as runs that

apply to a contiguous sequence of characters, similar to font and style information.

Rendering and other operations that convert from internal representation to display

representation work on stretches of text, usually lines, in the TextPainter. This has

the advantage that the necessary operations, which are one of the major performance

bottlenecks of an interactive system, can be implemented with small and fast loops.

5.3 Fonts, Font Families, and Glyph Mappers

In the area of fonts, ET++ uses the classes Font, FontFamily, and FontManager

[Wei92]. A Font originally was a base font with implicit glyph mapping only. For sim-

ple font combinations, we introduced a single-step cascade in the form of one backup

font for each font [DW94]. Now, as described in Section 3.2, a Font has to play both

the role of a composite font and of a base font. This is achieved by relegating the actual

font and glyph selection functionality to a new object, the GlyphMapper. Every Font

can return two different GlyphMappers that correspond to its roles as composite and

base font.

Some readers may think of glyph mapping (in base fonts) and font selection (in

composite fonts) as two rather unrelated tasks. The reason to have them done by one

and the same object is twofold. First, flexible composition requires that both leaf com-

ponents and composite components share the same superclass. Second, the function-

ality, seen on a more abstract level, is indeed the same, namely to map a sequence of

characters to a sequence of font-glyph pairs.

5. Note that the meaning of the term glyph here is different from its use in the field of character

encoding.

7th International Unicode Conference - 11 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

The central method of GlyphMapper is TranslateText. It maps an array of charac-

ters, or part of it, and fills an array of font identifiers, an array of glyph indices, and an

array of glyph-to-character indices, which allows correct cursor placement and similar

functions in the case of complex character-to-glyph relations. Each subclass of Glyph-

Mapper also has to provide a method that decides if it can map a character or not. Fig-

ure 1 shows the class hierarchy of GlyphMapper. Figure 2 shows an example of an

instance hierarchy. Methods to find the next character that cannot be rendered in a giv-

en font, or the next character that can again be rendered, can be implemented for in-

creased efficiency, but default implementations are provided by the base

GlyphMapper class. In addition, there are some methods used on initialization.

GlyphMapper

BaseMapper CascadeMapper CJKVSetMapper

One2OneMapper SimpleArabic ProprietaryArabic

Figure 1: Part of the class hierarchy of GlyphMapper.

Cascade

Helvetica

Cascade

CJKV Set

Cascade

Greek

Chinese Japanese

Last Resort

Figure 2: A configuration of composite and base fonts

for Helvetica with generic fonts for non-Latin scripts.

Font (together with FontManager) is subclassed for each display system, so that rel-

egating font selection to GlyphMapper has the additional advantage of separating dis-

play system dependent subclassing from subclassing for the implementation of new

font and glyph selection schemes (this is the bridge pattern from [GHWV95]). Al-

7th International Unicode Conference - 12 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

though it is possible that a certain form of font selection is necessary or available only

on a single system, in general similar font selection schemes will be used on all display

systems. An additional advantage of separating GlyphMapper and Font is that differ-

ent GlyphMapper strategies can be used with the same Font. Different strategies can

be used either in the same backup cascade, especially in connection with last resort

schemes, or in different cascades, e.g. masking out different parts or amounts of a font

depending on what other fonts it is combined with.

The actual references to GlyphMappers are not stored in Font, but in FontFamily.

Whereas Font represents font instances with individual styles (Roman, Italic,

Bold, ), and individual sizes for bitmap fonts, FontFamily represents a single con-

sistent typeface such as Times or Helvetica in all its appearances. Although there might

be situations where a backup cascade or a glyph mapping algorithm is different for dif-

ferent styles (e.g. an Italic version containing more ligatures than a Roman version of

a font), this is not the general case. If necessary, additional FontFamilies (sub-

families) can be introduced.

5.4 Lazy Evaluation

Another important aspect of associating GlyphMappers with FontFamilies instead

of Fonts is that loading of actual fonts is delayed, in a kind of lazy evaluation. In the

average case, with complete backup chains but only very few scripts used, this can

save large amounts of system memory.

Lazy evaluation also reduces errors on document transfer. Future document for-

mats may include definitions of composite fonts. Often, a font will be included in a

backup chain as part of a general definition, but not actually used in the document, be-

cause the originator knows very well that some of his local scripts will not be readable

in other locations. A system that tries to build up a full backup chain of Fonts from the

definition in the document will quickly produce an error, whereas this will not happen

if the fonts are evaluated lazily.

Another small advantage of lazy evaluation can appear depending on how the

height of a text is calculated. Evaluating the height of each character individually is of-

ten too slow, or impossible because the necessary information is not available from the

rendering system. On the other hand, calculating the maximal vertical extension of a

composite font may lead to unnecessarily large line spacing. Doing height calculations

by base font can be a very reasonable solution, as there is not so much height variation

is an average base font.

5.5 Advanced Features

The parameters passed to GlyphMapper::TranslateText as explained so far can

handle basic font selection and character-to-glyph mapping including many-to-many

cases. This interface can be extended to include more sophisticated features. One line

of extension would be to pass a reference to the overall context, usually the text to be

rendered. This can be used by a font set to obtain whatever additional information is

necessary.

7th International Unicode Conference - 13 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

Specifically, language information could be obtained by a font set for use in fully de-

terministic CJKV disambiguation, instead of the more ad-hoc schemes described in

Section 4.2. Language information can also be used for other decisions in highest qual-

ity rendering (see [Bai94] for some examples). However, care should be taken not to

assume the availability of language information throughout a framework, for several

reasons: First, the average user is unwilling to supply language information for every

single bit of text. Second, languages do not form a single level partition of the idioms

spoken and written; there are hierarchies as well as mixtures. Third, because there is a

constant exchange of new words between languages, it is often difficult to decide to

what language a word belongs. Fourth, language is often not the information needed

for rendering purposes. The distinction between traditional and simplified Chinese,

for example, is only marginally a distinction of language; it is much more a distinction

of writing system and typographic tradition. On the other hand, the difference be-

tween Mandarin and Cantonese is a language difference, but is not relevant for CJKV

glyph disambiguation.

Passing a general context parameter might be difficult because of the varieties of

contexts; a full-fledged Text and a simple String do not have common access meth-

ods. So more specific parameters can be passed, and corresponding arrays can filled.

A first possibility is to pass a size parameter, which can be used in various composite

fonts to choose fonts of different nominal, but matching visual size. If an array of glyph

positioning offsets is also provided, baseline adjustment between different base fonts

and sophisticated diacritics processing (Arabic) can be integrated, and scripts that

stack characters vertically such as Tibetan could use a single glyph for the same shape

at different vertical offsets.

Another problem is the passing of context before and after the characters that actu-

ally should be rendered for ligaturing scripts. The solution we are currently pursuing

is to allow TranslateText to look at characters before and after those it has to trans-

late, and to limit this range with null characters. This is not only a problem of program-

ming technique, but also of desired functionality. Whereas there is no reason to break

context-dependent rendering on a change of attributes such as color, for a font or size

change this is less clear because it may not be possible to link the characters nicely.

6 Comparisons

Adobe for their PostScript language and printers provides a wide variety of function-

ality. Type 0 fonts [Ado90] available in Level 2 implementations also provide compos-

ite and base fonts. Although the terms coincide, the functionality is different. The main

aim is to interpret a stream of bytes to select characters from base fonts limited to a size

of 256 glyphs. Switching schemes or font arrays as defined in Section 3.3 are possible.

The newer CID/CMAP technology [Lun94] in addition allows easy sharing of com-

mon glyphs, e.g. between horizontal and vertical versions of a font.

In both cases, more sophisticated font selection schemes, e.g. using lookahead, are

not possible. This may not be that important because it can be assumed that the appli-

cation decides on exactly what glyphs from what fonts should be used. On the other

7th International Unicode Conference - 14 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

hand, as PostScript is a full programming language, it is also possible to send a pro-

gram to the printer that implements the composition techniques described in this pa-

per. In the ET++ PostScript PrinterPort, we will most probably follow the former

alternative, but we are currently waiting for fonts from various scripts to become avail-

able at low cost.

Release 5 of version 11 of the X Window System provides the concept of a FontSet,

which corresponds again to a font array in our terminology, and is linked to the inter-

pretation of character strings in a locale-dependent manner [Fla91]. The Unicode envi-

ronment on AIX, based on XPG4, contains a universal font set, with a fixed association

of characters to fonts and subsequent glyph mapping [Kun94]. Plan 9 [PT93, PTH94]

also implements font arrays, mainly for sharing fonts of rare scripts, and in its 8 1 win-

dow system contains a very efficient caching mechanism. Unfortunately, character-to-

glyph mappings seem to be limited to one-to-one.

Mule, the multilingual extension of emacs [NHT93], uses fontsets to assign fonts to

character sets. A character set in a fixed way links a character tag to the number of

bytes for internal representation, the character width in display cells, and the character

encoding. Nonproportional display in not possible; it is simulated in the case of Arabic

by separating the Arabic glyph set into wide and narrow glyphs and using two fonts

and two encodings. Some backup is provided by using a font from the default fontset

if the corresponding font is not available.

The Macintosh uses fonts to determine scripts, which comprise encoding interpre-

tation, display behaviour, and associated input software [App92, App93]. Conforming

applications can easily handle multiscript text, but as the concept of a composite font

is lacking, in mixed text frequent explicit font changes are necessary. QuickDraw GX

[App94] provides a very wide range of typographic functionality with many ways to

control glyph selection and layout, but the function-based interface results in a rather

static approach.

7 Conclusions and Future Work

The concepts of font selection and font composition presented in Section 3 to 5 of this

paper largely satisfy the requirements put forward in Section 2. The flexible and ex-

pandible architecture we have designed and implemented allows to address issues on

many different levels of typographic quality, leads to an efficient use of resources

wherever necessary, and gives the programmer and the end user the benefits of a

Unicode font while avoiding its problems. The architecture also allows to easily in-

tegrate solutions for the missing glyph problem or CJKV glyph disambiguation, as

well as high-quality proprietary rendering algorithms.

To fully exploit the advantages of our approach, we plan to implement some new

GlyphMappers, e.g. for Indic scripts or Tibetan, and to further expand the pragmatic

aspects, e.g. GUI support for virtual font construction. Ideally, there should be some

program code with typographic intelligence, advising the user on optimal configu-

rations or creating them automatically, but as long as there is not even a consistent

metric for character height, this idea remains largely a dream.

7th International Unicode Conference - 15 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

With respect to multilingualization of ET++ in general, working solutions are available

for most major aspects (input, conversion, display, ), and existing ET++ applications

as well as new applications successfully use these concepts. The main areas where

work still has to be done are the Unicode bidirectionality algorithm, search with regu-

lar expressions, a Unicode OS interface, character properties, collation, and so on. In

addition, programming for the world means that there are always new scripts, lan-

guages, and cultures that require some adaptions or additions.

The multilingualization effort of ET++, known as UNET++, has for some time now

been carried out separated from the main versions of ET++, but the features described

in this and previous papers are gradually integrated back into the main version of

ET++. In the meantime, we would be willing to share our code with other researchers

and developers interested in collaboration and experimentation. Interested parties

should contact the first author.

Acknowledgements

The first author thanks Andr Weinand for his continuous cooperation, Glenn Adams

for his advice on Vietnamese, and Peter Stucki for his continuous support. The second

author wishes to acknowledge that this work was jointly funded by Alis Technologies

Inc. and the Canadian government s CANARIE program.

References

[Ack94] Ph. Ackermann, Direct manipulation of temporal structures in an object-oriented multime-

dia application framework, in: ACM Multimedia 94 Conference Proceedings, ACM, 1994.

Adobe Systems Incorporated, PostScript Language Reference Manual, Second Edition, Addi-

[Ado90]

son-Wesley, Reading, MA, 1990.

Apple Computer, Inc., Guide to Macintosh Software Localization, Addison-Wesley, Reading,

[App92]

MA, 1992.

Apple Computer, Inc., Inside Macintosh Text, Addison-Wesley, Reading, MA, 1993.

[App93]

Apple Computer, Inc., QuickDraw GX Typography, Addison-Wesley, Reading, MA, 1994.

[App94]

B. Bailey, Unicode as a glyph identification system, Proc. Unicode Implementers workshop 6,

[Bai94]

Unicode, Inc., San Jose, CA, 1994.

[BC95] T. Berners-Lee and D. Connolly, Hypertext Markup Language 2.0, Internet-Draft, June 16,

1995. (available as ftp://nic.nordu.net/internet-drafts/draft-ietf-html-spec-04.txt)

Ch. Bigelow and K. Holmes, The design of a Unicode font, Electronic Publishing Origination,

[BH93]

Dissemination, and Design, Vol. 6, No. 3, Sept. 1993 (Proc. RIDT 94), pp. 289-305. (Also con-

tained in Proc. Unicode Implementers workshop 6, Unicode, Inc., San Jose, CA, 1994.)

M.J. D rst, Coordinate-independent font description using Kanji as an example, Electronic

[D r93]

Publishing Origination, Dissemination, and Design, Vol. 6, No. 3, Sept. 1993 (Proc. RIDT 94),

pp. 133-143.

M.J. D rst, Uniprep Preparing a C/C++ Compiler for Unicode, ACM SIGPLAN Notices,

[D r94]

Vol. 29, No. 1, Jan. 1994, p. 53.

M.J. D rst and Andr Weinand, Introducing Unicode into an Application Framework, Proc.

[DW94]

Unicode Implementers workshop 6, Unicode, Inc., San Jose, CA, 1994.

M.J. D rst, Localization Facilities for ET++, Proc. ET++ Workshop on Developing Building

[D r95]

Blocks and Frameworks, Dept. of Computer Science, University of Zurich, Switzerland, July

1995.

7th International Unicode Conference - 16 - San Jose, September 1995

Font Selection and Font Composition M.J. D rst and M.-A. Parent

D. Flanagan, Programmer s Supplement for Release 5, O Reilly & Associates, Inc., Sebastopol,

[Fla91]

CA, 1991.

[GHJV95] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns Elements of Reusable Ob-

ject-Oriented Software, Addison-Wesley, Reading, MA, 1995.

Y. Haralambous and J. Plaice, + virtual METAFONT = Unicode + Typography, to appear

[HP95]

in Cahiers GUTemberg, 1995.

[How94] W.H. Howry, Bidirectional text in an object oriented environment, Proc. Unicode Implement-

ers workshop 6, Unicode, Inc., San Jose, CA, 1994.

[Kun94] M. Kung, Unicode and XPG4, Proc.Unicode Implementers Workshop 6, Unicode, Inc., San Jose,

CA, 1994.

[Lun93] K. Lunde, Understanding Japanese Information Processing, O Reilly & Associates, Inc., Sebas-

topol, CA, 1993.

[Lun94] K. Lunde, Creating Fonts for the Unicode Kanji Set: Problems & Solutions, Proc.Unicode Im-

plementers Workshop 6, Unicode, Inc., San Jose, CA, 1994.

C.D. McQueen III and R.G. Beausoleil, Infinifont: a parametric font generation system, Elec-

[MB93]

tronic Publishing Origination, Dissemination, and Design, Vol. 6, No. 3, Sept. 1993 (Proc.

RIDT 94), pp. 117-132.

[NHT93] M. Nishikimi, K. Handa, and S. Tomura, Mule: MULtilingual enhancement to GNU Emacs,

Proc. INET 93 (Internet Workshop 93). (available as ftp://etlport.etl.go.jp/pub/mule/pa-

pers/INET93.ps.gz)

[ODo94] S.M. O Donnell, Programming for the World: A Guide to Internationalization, Prentice Hall, En-

glewood Cliffs, NJ, 1994.

R. Pike and K. Thomson, Hello world or o or, Proceedings of

[PT93]

the Winter 1993 USENIX Conference, USENIX Association, Berkeley, CA, 1993, pp. 43-50. (Al-

so contained in Proc. Unicode Implementers workshop 6, Unicode, Inc., San Jose, CA, 1994.)

[PTH94] R. Pike, K. Thomson, and H. Trickey, Unicode in Plan 9, Proc. Unicode Implementers workshop

6, Unicode, Inc., San Jose, CA, 1994.

K. Sato, Class Libraries Unrestricted Introduction to Application Frameworks and Design Pat-

[Sat95]

terns, Toppan, Ltd., Tokyo, Japan, 1995 (in Japanese).

The Unicode Consortium, The Unicode Standard Worldwide Character Encoding, Version 1.0,

[Uni92]

Volume 2, Addison-Wesley, Reading, MA, 1992.

[WGM89] A. Weinand, E. Gamma, and R. Marty, Design and Implementation of ET++, a Seamless Ob-

ject-Oriented Application Framework, Structured Programming, Vol. 10, No. 2, 1989, pp. 63-

87.

[Wei92] A. Weinand, Objektorientierte Architektur f r graphische Benutzeroberfl chen (in German),

Springer-Verlag, Berlin, 1992.

[WG94] A. Weinand and E. Gamma, ET++ a Portable, Homogenous Class Library and Application

Framework, Computer Science at UBILAB Strategy and Projects (Proc. UBILAB Conference

94, Zurich), W.R. Bischofberger and H.-P. Frei, Eds., Universit tsverlag Konstanz, Kon-

stanz, 1994, pp. 66-92.

7th International Unicode Conference - 17 - San Jose, September 1995

Proceed-

ings of the 7th International Unicode Conference. Copyright for this and the final ver-

Contact this candidate