The FASTY Language Component

Overall Specification

FASTY's core prediction functionality is provided by a statistic language model based on word n-grams and part-of-speech tag n-grams in conjunction. Moreover, the possibility to create user specific dictionaries both during a session and on the basis of previous entered texts, serves as a method for further increasing the prediction accuracy. Further, the use of topic specific lexica will be considered.

The variety of languages to be supported and methods to be integrated into the FASTY system demands a modular architecture. The combination and integration of prediction components needs to be handled in a very flexible way, for a variety of reasons:

Different languages may put different emphasis on different modules, so it must be possible to arrange these modules in a different way.
The effects of the different prediction methods are not yet known precisely, experimental adjustments as well as parameter tuning have to be possible.
Different application scenarios (varying from text writing up to spontaneous dialogs) may require different combination and weighting of components.
Adaptation to users with different degrees and types of disabilities will also be required.

Thus the backbone of the linguistic prediction component of the FASTY system will be a controller that is flexibly driving the different prediction modules and combining their results. Thus it will be easy to optimise the overall prediction behaviour and also adaptation of FASTY to another language without modifying the whole system will be made possible.

Methods used in existing prediction systems

When one considers methods for saving keystrokes in text typing, one has to differentiate between keystroke-saving methods in the UI and methods involving linguistic or statistical prediction of words, word sequences and phrases. Here we will put our emphasis on the latter. However, for the sake of completeness a short listing of methods belonging to the User Interface side will be given.

automatically inserting a space after every predicted word accepted by the user. This method compensates for the extra keystroke needed for selecting the prediction.
automatically removing preceding whitespace immediately before punctuation characters (and inserting the appropriate amount of spaces afterwards). This method complements the previous one, as the need for the user to backspace the automatically inserted whitespace is alleviated.
auto-capitalisation. This method in fact also needs at least some linguistic knowledge, it is listed here just because of the requirement to be able to change characters the user already has typed. Auto-capitalisation may occur after sentence ending periods, on words recognised as proper names or (in some languages, e.g. German) on nouns in general.

String-based statistical methods

All systems on the market that we are aware of use some kind of frequency statistics on words and (sometimes) word combinations. Given a prefix of a word, with a frequency annotated lexicon the most probable continuation(s) of that word can be retrieved easily. Sometimes, not only word-based frequency counts are maintained (unigrams), but also bigrams and even trigrams are used for enhancing prediction accuracy. N-gram language models are widely used in speech recognition systems, and their benefits are also exploited in some predictive typing systems. The key observation behind this kind of models is, that the probability of a word occurring in the text depends on the context.

Syntactically motivated statistics

The superiority of n-gram based predictions over simple frequency lexicons stems from the fact, that n-grams are able to capture some of the syntactic and semantic regularities intrinsic to language. However, a severe drawback of word- based n-grams is, that, even with very large training texts, the data still is rather sparse, and thus in many actual cases during prediction no information is available. The usual technique to cope with syntactic regularities uses class- based n-grams (usually n=3), the classes being defined by the part-of-speech information of a tagged corpus. Copestake reports on an improvement in KSR of 2.7 percent points by just taking PoS-bigrams into account.

Capturing semantics with statistics

For a human language user it is obvious that in a given context some words are more probable than others just because of their semantic content. Factors influencing word probability due to semantics are (among others):

the user and the type and topic of the text s/he writes (global factors)
constraints due to the lexical semantics of words (e.g. subcategorisation requirements); these are local factors that mostly operate on sentence level.

Collocation analysis (in a broader sense, not reduced to idioms only) can reveal some of these dependencies. However, very large corpora are needed. Rosenfeld uses the concept of "trigger pairs" to capture these relationships statistically (basically these are bi-grams of words occurring together in a window of a certain size in a corpus). If a word that has been recently entered occurs in such a trigger pair the probability of the other word of the pair should be increased. Recency, as implemented as a heuristic in some prediction systems, can be seen as a self trigger and is a (rather crude) measure to exploit semantic or topical appropriateness of a word.

Rule-based approaches

Several methods of integrating grammar rules into statistics based prediction have been tried, but none of them had made it into a commercially successful product. Such integration, however, is seen as a major challenge in the FASTY system.

Linguistic components and resources for text prediction

Basic to our approach is the modular architecture of our system. In addition to the flexibility such an approach provides for the adaptation to different languages, application scenarios and users - as described in the introduction - it also ensures robustness and graceful degradation in the case one module should be missing or fail. Furthermore, this type of architecture allows for the possibility of exploring various more advanced - and albeit more risky - methods without endangering the successful implementation of the language component in case some of these methods should not prove successful.

The core of the system will be a module based on the prediction of word forms due to their absolute frequency and the probability of their associated PoS. Such a module is state-of-the-art and guarantees a basic performance. A number of other methods to improve prediction quality will be investigated. All methods will be evaluated with respect to their performance for different target languages and language specific phenomena (e.g., compounding). Those that prove to be successful for one or more of the target languages will be integrated with the core component - either alone or in combination with others.

General word n-gram-based Prediction

As stated above, predictions based on frequencies of word sequences are usually more reliable than predictions solely based on simple word frequencies. Frequency tables of word bigrams are thus used as a base in the FASTY language model. As is customary, the bigram model is however supplemented by simple word frequencies. This is due to that no matter how much data is used for extracting bigram frequencies, there will always be a problem of sparse data - most bigrams will have low frequencies and many possible word sequences will not be attested in the training material. A common solution, also implemented in the FASTY language model, is to interpolate the probabilities obtained from using larger n-grams (here: bigrams) with the probabilities acquired from smaller n-grams (here: unigrams, i.e. simple word frequencies).

During the second project year, software has been developed for easy extraction and compilation of word form uni- and bigrams. By means of this software, frequency lists have been extracted for all FASTY languages from large collections of text (corpora comprised of several millions of words).

Part-of-Speech n-gram-based Prediction

The FASTY language model is further based on part-of-speech frequencies. Since a part-of-speech tag captures a lot of different word forms in one single formula it is possible to represent contextual dependencies in a smaller set of n-grams. One major advantage of making use of part-of-speech frequencies is thus that the problem of sparse data is reduced and a larger context may be taken into consideration. The FASTY language model uses frequencies of part-of-speech tag trigrams, which are supplemented with frequency data of smaller part-of- speech n-grams (uni- and bigrams).

As for the word n-grams, part-of-speech n-grams have been collected for all FASTY languages during the second project year. Special software was designed to extract the relevant data from large corpora and to compile it in a uniform format suitable for integration with the rest of the language components. The corpora used have been tagged, i.e. all word forms have been annotated with part-of-speech information.

User- and Topic-specific n-gram-based Prediction

The FASTY language model is adjustable to the language of specific users in two respects. It applies short-term learning whereby the words from the current text are dynamically added to user-specific uni- and bigram frequency lists. During prediction the user-specific frequency lists and the general frequency lists are combined using relative weights. In this way, new words that are repeated (e.g. proper names) can be predicted on their second occurrence. Further, the system will provide for long-term learning, by making it possible to permanently save changed user dictionaries from time to time.

Topic-specific words and expressions will be possible to generate from previously written, and electronically readable, texts.

The use of several user- and topic-specific lexicons at the same time will be allowed and all activated lexicons will be searched. Words and expressions with highest probabilities (naturally occurring as in the case of word bigrams, or adjusted as in the case of longer expressions and words coming from user- and topic-specific lexicons) will be offered in a prediction list.

Morphological processing and Backup Lexicon:

As stated above, the part-of-speech n-grams provide means to account for larger contexts by representing the distribution of word forms at a generalized level. For this to work though, the language model requires information about the part-of-speech of each word form. Further, the grammar module, described below, bases its analysis on a morpho-syntactic description of the input word forms. Put together, the FASTY language model needs some kind of lexicon, providing all relevant information. Retrieving information from a huge lexicon may be very time-consuming, even though it is done automatically. Special care has therefore been taken to provide a storage format that is easy to search and further, that makes it possible to compress the enormous amount of data to a manageable size. Such an implementation was provided as a prototype at the initiation of the project. During the second project year it has been upgraded and adjusted to suit the FASTY language component. Further, the morphological data required has been gathered for all FASTY languages. This was done with very different tools, whatever was available for each language.

Abbreviation Expansion

Abbreviation expansion is a technique in which a combination of characters, an abbreviation, is used to represent a word, a phrase or a command sequence. Whenever the user types a predefined abbreviation, it is expanded to the assigned word, phrase or command sequence. During the second project year the abbreviation module has been integrated in the prototype language component. The integrated version has the following basic functionality:

given a prefix, all abbreviation codes starting with that prefix are returned
given an abbreviation, the expansion string is returned
given an abbreviation and its expansion the system stores that abbreviation in an abbreviation table

Lists of useful abbreviations are to be entered by the user or its care person.

Grammar checking as a filter of suggestions

The FASTY language component is further based on a grammar-checking module. While the n-gram based prediction modules never consider contexts exceeding a limited number of words, the grammar-based module may take an arbitrarily large sentence fragment into consideration. The grammar module does not by itself generate any prediction suggestions, rather it filters the suggestions produced by the n-gram model so that the grammatically correct word forms are presented to the user prior to any ungrammatical ones.

Input to the grammar module is a ranked list of the most probable word forms according to the other language components. The grammar module will assign a value to each word form based on whether the word form is confirmed (grammatical), turned down (ungrammatical) or outside the scope of the grammar description. Based on these three values the word forms are then reranked whereby the grammatical suggestions are ranked the highest and the ungrammatical are ranked the lowest. Since only a subset of the reranked suggestions will be presented to the user, the lowest ranked word forms will not be displayed. This way grammatically impossible suggestions will hopefully not be presented at all, leaving room for possibly intended continuations.

The grammar descriptions are however not complete but cover selected constructions that are identified as crucial from a prediction point of view for the individual languages of the project. Typically, they contain constructions with fairly fixed word order and feature constraints. Examples of such constructions are nominal phrases, prepositional phrases and verbal clusters. Sentence initial position and the placement of the finite verb are further focal points. Grammar rules have been written for all FASTY languages, albeit grammar rules will be further developed during the third project year.

For a system to analyze input texts in relation to a grammar description, special software is required; a parser. This was supplied by the Swedish partner at the initiation of the project. In most applications though, a parser takes whole language structures as input (usually sentences). In the context of word prediction, the parser must allow for language structures that are about to be produced and thereby only are fragmentary. In other words the parsing process must be step-wise and there must be means to dynamically output the analysis made so far. During the second project year, the parser has been adjusted to these conditions and made compatible with the rest of the FASTY system.

Compound prediction

In three of the FASTY languages: German, Dutch and Swedish, compounds constitute a group of words that is particularly hard to predict within a word prediction system. In these languages compounds can be productively formed to fill a contextual need. It is of course impossible to predict such a word formation by means of traditional n-gram frequency counts. On the other hand, compounds tend to be long words, which means that successful prediction would save a great deal of keystrokes. Within the FASTY language model, compounds have hence been given a special treatment. Since compound prediction is a true innovation in word prediction systems, the way how to infer new compounds from the input evidence, has been subject to research. The solution has been not to predict a productively formed compound as a whole, but to predict its parts separately. More specifically, the current implementation supports the prediction of right-headed nominal compounds, since these, according to a corpus study of German corpus data, are by far most common.

The split compound model provides two quite different mechanisms for predicting the respective parts of a compound, i.e. modifier (the left hand side of a compound) prediction and head prediction (the right hand side of a compound). Below we will give a simplified description of how the model functions. Since the system has no means of knowing when a user wants to initiate a compound, the prediction of modifiers is integrated with the prediction of ordinary words. If the user selects a noun that has higher probability of being a compound modifier, the system assumes this use was intended and starts the prediction of the head part instead of inserting the default white space after the selected noun. The head of a compound determines the syntactic behaviour, and the basic meaning of the compound as a whole. Hence, we may expect a productively formed compound to appear in the same type of contexts as the head does when it functions as an independent word. When predicting the head, the system therefore makes use of the word preceding the modifier, as if the modifier wasn't there. Assume the user has written en god äppel (a tasty apple), and intends to write en god äppelpaj (a tasty apple pie). When searching for possible compound continuations, the system will then search for bigrams with the first position held by god, and if the training corpora contained instances enough of the sequence god paj, paj is suggested as a possible head of the compound. Further, the head prediction model gives precedence to words that, in the training material, functioned as heads in many compounds. According to studies of German and Swedish compounds, some words occur much more often in compounds as heads, than other words. A secondary feature that has been explored, is the semantic relation between the modifier and the head. If the semantic class of the modifier is known (for instance apple above may be assigned to a class containing fruits and berries), this information may be used to search for probable heads (following the given example these may be words belonging to classes of baked and cooked things).

During the second project year this approach has been settled and specified, software has been implemented to extract the statistical material and the statistical data has been gathered for German, Dutch and Swedish. The semantic classes have been approximated by automatically extracting information on how words co-occur in corpora, and the probabilities of semantic classes to join in compounds have been estimated on the bases of how they co-occur in attested compounds. The semantic classes have, so far, only been obtained for German and Swedish, but a tool for automatically deriving these statistics has been implemented.

Interaction of components and control structure

The operation of the Language Component (LC) is driven by the User Interface (UI). Depending on the request type, the LC returns one or more values as a response to the UI (e.g. the n most likely predictions).

A central part of the interaction between LC and UI is the Context Box. The role of the context box is to provide a repository for the context data needed by the prediction component. The context box contains textual data that may be manipulated by the UI via interface functions and LC-internal data structures that are not visible to the outside. The size of the context box is limited, it is only big enough to store the context needed by the predictor. It is not intended to be a cache of the whole file the user is editing. (If such caches are needed for some reason they should be handled separately by the user interface).

To manipulate the content of the context box, interface functions are provided to:

extend the context box by a character (or a string)
remove the last n characters from the context box. Note that for syntactic constraints on prediction the current sentence is sufficient as context.
clear the context box
replace the whole context by a string
insert an accepted prediction into the context box. This amounts to removing the prefix and inserting the selected string.

At any time the content of the context box changes, the relevant portion of it's text buffer is (re)tokenized.

The Controller receives requests from the User Interface and is responsible for:

Extracting the input data required for the different prediction components from the Context Box
Selecting which prediction components to use (depending on the current parameter settings)
Feeding the different prediction components with the appropriate input
Possibly enriching the Context Box with data returned from some components (e.g. part-of-speech information)

The Prediction Generator receives the predictions made by the different components, together with their probabilities, and combines them to a prediction list which is delivered to the User Interface. How the Prediction Generator comes up with the combined prediction list depends on:

Parameter settings, which may be user and language specific. E.g. compound prediction may not be useful in French and every user may want to use his or her own abbreviation table.
Interpolation weights that have to be determined empirically.

Each of the components relies on language specific resources, where some are shared between different components. Also the possibility exists, that a component uses the results of other components, e.g. grammar-based prediction uses compound analysis and morphological processing.

Speech Synthesis

For users with dyslexia or other language impairments, it may be hard to recognize an intended word form among the prediction suggestions, since there may be problems distinguishing similar words from each other. The same may hold for users with bad eyesight. Therefore most of the current word prediction systems on the market make use of a speech synthesizer, providing an audible presentation of the suggested word forms.

The speech synthesizer used in the FASTY system is a concatenation of a grapheme to phoneme converter (a program translating letters to a phonemic representation) and a phonetiser that converts phonemes into sound.

The conversion from letters to phonemes, is based on a so-called decision tree; a machine-learning algorithm stemming from the field of Information Theory. By means of this technique, rules stating how letters should be mapped to phonemes may be inferred automatically from a training dictionary in which word forms are listed along with their phonemic descriptions.

The award winning MBROLA phonetiser, made available through the Multitel partner, performs the second conversion, from phonemes to actual sound. MBROLA bases its speech synthesis on diphones, which means that it takes into account how the pronunciation of a phoneme is influenced by the preceding and succeeding phonemes. More information on MBROLA synthesizer may be found at http://tcts.fpms.ac.be/synthesis/mbrola.