Foreword

Stephen Johnson

Columbia University

 

The contributions to this volume reflect the impact of Zellig Harris on the study of language pertaining to formal systems, computability, and computer applications.

1.         Development of the field

The effect of Harris’s work in these fields stretches back almost 45 years, to 1957, when the first computer program to perform syntactic analysis of English sentences was developed on a Univac computer. This parser, based on string theory, is described in this volume by Sager and Nhan, and in greater detail by Joshi. Sager and Nhan also provide a detailed description of the Linguistic String Project, which began in 1965 and continues today. The LSP system inspired several others, including PROTEUS (Grishman & Hirschman 1986), PUNDIT (Hirschman et al.), KERNEL (Palmer et al.) and MedLEE (Friedman 1994), which continue to be active research projects.

As Harris’s theories evolved over time, so did the range of computational devices inspired by them. These were naturally influenced by concurrent developments in mathematics and computer science. String grammar emerged around the time of finite state automata. The Univac parsing program was based on a formalism that would today be called cascading finite state transducers (see the paper by Joshi). A different type of finite state automaton was able to determine sentence well-formedness using a cycling process of cancellations (Harris 1963).

With the expansion of work on formal languages in the 1960s, context-free grammars became central in natural language research, in large part because of their simplicity and tractability. However, these formalisms obscure some properties of natural language that are directly available in string grammar. This led Sager to augment context-free rules with attributes and constraints written in a special-purpose restriction language. The resulting system also had the power to implement transformational grammar, since strings define the domains and ranges of transformations. Similar approaches are used in PROTEUS, PUNDIT, and KERNEL.

During this period, it was well established that individual word choices affect the acceptability of strings, and also restrict the ability of transformations to operate. The need to lexicalize context-free grammar, as well as a desire to combine the complementary strengths of context-free grammars and string grammars, led Joshi to develop tree-adjoining grammar, as described in this volume. Maurice Gross investigated the lexicalization of string grammar itself, resulting in Lexicon Grammar (Gross 1984). In this approach, for each string a table is constructed in which the columns are the word types of the string and the rows are words of the language.

The phenomenon of sublanguage is one of Harris’s major discoveries. A direct representation of sublanguage grammar is difficult, because of the large variation in surface forms. Early approaches pioneered by Sager dealt with the problem in stages: surface forms parsed by string grammar were transformed to elementary sentences, on which sublanguage constraints were applied. This led to systems in which syntax, semantics, and pragmatics were handled by separate ‘modules’, an architecture emulated by most other NLP systems. More recently, Friedman (1994) has pursued a different method, using a single grammar that parses sublanguage sentences directly, without the use of a general syntactic component.

The various formalisms for surface forms, transformations, sublanguage, and the lexicon can be related, but the resulting system is extremely complex. Harris devoted the last decades of his life to the creation of a unified theory of language, which culminated in operator grammar (1982, 1991). The theory is intrinsically lexical, driven by the argument requirements of each operator word. Surface forms are obtained through reductions, which are simpler than transformations and which have lexicalized domains. Sublanguage is recast using the dependency structures and reductions of the operator grammar framework. Operator grammar is discussed in several places in this volume: Pereira contrasts Harris’s approach to ‘mainstream’ linguistics, Johnson presents the beginnings of a computational system based on this theory, Habert and Zweigenbaum explore automatic classification of words, and Smaby extends some of the methods to computer user interfaces.

2.         Mathematical roots

The volume is introduced by André Lentin’s essay, “Reflections on references to mathematics in the work of Zellig Harris”. This brief piece articulates the mathematical philosophy that permeates Harris’s work and informs subsequent work by others on formal systems, computability and computer applications. Lentin describes the principal mathematical ingredients in the linguistic theory of Harris, and connects these to the giants of mathematics who must have influenced Harris’s thinking. Constructivism is seen in Harris’s use of finite elements and explicitly stated operations. Type theory, linked to the work of Russell, appears in the typing of operators by the arguments to which they apply. Algebra was clearly the primary tool of choice for Harris, enabling definition and manipulation of symbolic structures. Finally, Lentin suggests that just as Gödel used metalanguage to find cracks in the foundations of mathematics, Harris showed the impossibility of an external foundation for language.

Lentin indicates that Harris’s approach was to select from mathematics the appropriate tools for examining language. This philosophy leads to creation of new mathematical systems, rather than forcing language into the confines of an existing system. Moreover, there is no ‘best’ or ‘true’ system, but instead multiple ways of viewing the same mathematical object.

3.         Mathematics and formal systems

The section on mathematics and formal systems provides a clear illustration of Lentin’s observations. The three articles explore three very different formal systems. Pereira reviews Harris’s theory of operator grammar and relates it to learning theory. Oehrle focuses on formalization of the morphology of Semitic words. Langendoen looks at the mathematical properties of sequences.

Fernando Pereira’s “Formal grammar and information theory: together again?” examines the great divide between information theory (in the Shannon tradition) and linguistics (in the Chomsky tradition). He asks whether Harris’s theory can provide the bridge between these disciplines, and notes that Harris’s principle of least grammar is not only a methodological constraint for the linguist, but also a crucial component of language acquisition. Harris discussed this point (1991:404-405), but other than hints about the use of classifiers for analogic extension of selection domains left open exactly how language learners generalize from a limited sample. Recent work on learning theory shows that models can generalize from a finite sample, e.g. to assign probabilities to unseen events. This phenomenon is essential for the creative aspect of language use. Such models must represent the internal state of the learner, e.g. as ‘hidden variables’ that capture uncertainty about events observed so far.

These techniques provide a way of quantifying Harris’s principle of grammatical simplicity. Pereira also points out that other aspects of Harris’s theory are helpful in developing models of learning. For example, Harris’s focus on lexicalized elements makes it easier to factor interactions among the hidden variables.

Pereira reminds us that language learning is grounded in sensory experience, which greatly constrains the learning process. He notes as well that the linguistic environment (e.g. neighboring sentences) provides an additional set of constraints on interpretation. This discourse restriction goes beyond the likelihood constraint in Harris’s theory. Pereira suggests that ‘language understanding’ can be recast in this framework, by relating linguistic events to one another, e.g. to judge whether a sentence follows from a discourse, or to answer a question about the discourse.

In “Logics for intercalation”, Richard Oehrle develops a mathematical approach to discontinuities. He examines the problem of intercalative morphology of Arabic, in which a pattern of vowels is inserted into a sequence of consonants (the lexical root). Oehrle presents a multi-modal categorial grammar that allows one to escape the simple concatenation operations of ‘classical’ categorial grammar. Categorial grammar is mentioned in Harris’s work (1991:60), and also in this volume in the articles by Pereira and by Johnson.

The basic idea is to assign each vowel pattern to a pattern of modal operators, which controls how the vowels get ‘shuffled’ into the sequence of consonants. Oehrle creates discontinuous structures through a variety of distributivity postulates, which enable an element to distribute through another expression, just as multiplication distributes over addition in arithmetic: x*(y+z) = x*y + x*z. The distributivity postulates are combined with a pattern of modal operators, resulting in a ‘controlled shuffle’ of one list of elements into another.

While the focus of the article is on morphology, Oehrle supplies many examples in English syntax. He shows how controlled discontinuity can be used to analyze structures such as the relative clause, establishing a relation between the relative pronoun and missing argument.

D. Terrence Langendoen in “Sequence Structure” defines a language as a set of morpheme sequences. There is a similarity to Harris’s methods in that the formalism deals only with relations between morphemes. For a given sequence of morphemes, a ‘subsequence’ consists of some of the morphemes in the same order. Because a subsequence does not have to be continuous, there are some rough parallels to the discontinuous structures examined in the preceding article by Oehrle.

Langendoen defines the ‘sequence structure’ of a language as a partial ordering induced by the subsequence relation. The ‘conjunction’ of two sequences is their greatest lower bound in the partial order. He uses this operation to obtain structural analyses about sequences. For example, a sequence is ambiguous if there is more than one way to obtain it by conjunction. He shows several examples of sequence structures applied to morphology and to syntax. In particular, he shows how his approach can account for both local and unbounded dependencies.

4.         Computability of language

The central section, “Computability of Language”, considers how to implement mathematical theories as computer programs. Sager and Joshi each describe formalisms based on Harris’s earlier theories (strings, transformations, and sublanguage), while Johnson considers the later operator grammar.

In “The computability of strings, transformations and sublanguage” Naomi Sager and Ngo Thanh Nhan describe research of the Linguistic String Project (LSP). The original motivation for this research was to assess the computability of language structure. The success of the approach led to practical applications of natural language processing for patient care and related activities.

The project began with the computerization of String Grammar. The LSP system was implemented using context-free grammar, represented in Backus-Naur Format. Sager notes that CFG resembles immediate constituent analysis, and obscures the underlying character of linguistic strings. This limitation was addressed by assigning rules of the grammar to string types. The key issue for computability lies in the fact that grammatical restrictions can be encapsulated within defined strings. (Joshi also addresses the property of local constraints in the following article.) In terms of implementation, each restriction can be associated with a grammar rule, and the restrictions can be expressed using string types and operations on strings.  

Harris represented the transformational decomposition of sentences as a semilattice of operations. Because the vast details of the transformations and the lexicon impose, as Sager observes, ‘formidable’ requirements for implementation, the LSP system performs transformations using the same operations that were employed to express restrictions.

Sublanguage was a major focus of LSP research. A chief use of sublanguage was to reduce syntactic ambiguities; the medical domain required several thousand patterns used in selection restrictions. Sublanguage classes were determined by manual analysis of patterns, and words were assigned to classes using “diagnostic frames”. A discovery process was not automated because it relied on obtaining a syntactic parse and transformations, which rely on sublanguage constraints. The research identified interesting phenomena, such as unique syntactic usage (“sublanguage idioms”), and fragments, which could be described using familiar components of grammatical sentences. This approach yields a “grammar of the ungrammatical”.

Aravind Joshi addresses similar issues in “Hierarchical structure and sentence description”.  He describes three formalisms based on Harris’s theories: cascaded finite state automata (Uniparse), string grammar, and tree adjoining grammar. Joshi identifies the common methodology as the separation of a finite set of elementary units from the process of derivation that relates these units. The elementary units encapsulate the dependencies among word categories (and words), making all dependencies ‘local’. The derivation process is recursive, and creates what Joshi terms the ‘hierarchical structure’ of sentences.

In the Uniparse system and string grammar, this hierarchy is implicit. Context-free grammar can describe the hierarchical structure, but cannot encapsulate dependencies because each rule describes only one level of the hierarchy. Moreover, context-free grammars cannot be lexicalized while preserving the strong generative capacity of a grammar. To address these shortcomings, tree adjoining grammar is presented as a formal description of sentences that can define multi-level dependence structures (elementary trees). Substitution and adjunction operations are applied recursively to elementary trees to generate derived trees.

Stephen Johnson considers the linguistic theory that followed the work on strings and transformations in “The computability of operator grammar”. In place of strings, the theory has simple dependency structures in which operator words predicate on argument words. Instead of transformations, reductions map the predication structures into compact forms. Johnson observes that the complexity of the reduction system is a significant barrier to computerization, a point echoed by the statement of Sager and Nhan that operator grammar analysis is “deeper than what we are in a position to compute today.”

Johnson notes (along with Pereira) that Operator Grammar makes a radical departure from other theories by establishing semantics on a statistical model rather than through interpretation in a logical model. His paper proposes a synergy of several different grammatical formalisms into a new type of grammar: Categorial grammar provides a system of types; rewrite grammars formalize the process of sentence generation; lexicon grammar manages the vast body of details relating lexical items to syntactic forms; information theory enables description of predication as a mathematical distribution; and fuzzy logic provides a foundation for semantics based on these distributions. Johnson suggests that the resulting grammar can be implemented in a variety of ways, with the possibility of extremely efficient algorithms using finite state transducers.

5.         Computer applications

The final section, “Computer applications”, examines several different applications of the method and formalisms described in the previous sections: building dictionaries for corpus linguistics, automatic discovery of sublanguage categories, generation of sublanguage texts, and analysis of user interfaces.

 

“Distributional syntactic analysis and valency: Basic notions, procedures, and applications of the Pronominal Approach” by Karel van den Eynde, Piet Mertens, Sabine Kirchmeier-Andersen, and Lene Schøsler presents work on the development of dictionaries (readable by machines as well as by humans) and their application in corpus linguistics. Their methodology, called the Pronominal Approach, is related to Harris's substitution procedures, his concept of pro-forms, his string analysis, and his proposals regarding the use of classifier vocabulary to relate novel sentences to other sentences whose likelihood or acceptability (in a given subject-matter domain) is known. They discuss their methodology and its roots in the linguistics of Harris and of Togeby, and in the constructivism of Goodman (1951). Then they describe two dictionaries being developed along these lines, the PROTON dictionary of French verbs (Leuven, Belgium) and the Odense Valency Dictionary (OVD) of Danish verbs. They conclude with proposals for further refinements and applications for the methodology.

In “Contextual acquisition of information categories: what has been done and what can be done automatically?” Benoit Habert and Pierre Zweigenbaum examine the problem of determining a semantic classification of lexical items. Semantic classes are important for the scientific study of sublanguage, for developing natural language processing systems, and for performing a wide variety of practical applications in information retrieval. Habert and Zweigenbaum firmly ground the problem in Harris’s theory of operator grammar: in general language, the selection of an operator for its argument is fuzzy, while in sublanguage selection is crisp. This theory offers the possibility of automatic classification in sublanguage using distributional methods.

Two major projects (Sager et al 1987, Harris et al 1989) developed classifications for several biomedical domains, but this work was largely manual. While classes could in theory be obtained solely on a distributional basis, in practice external resources (e.g. thesauri and expert opinion) were required to expedite the process.  As discussed by Sager in this volume, automatic classification is extremely difficult due to the need for high-quality parsing and transformation of sentences to obtain underlying operator-argument pairs for analysis.

Habert and Zweigenbaum review the literature on machine learning and statistical approaches for semantic classification, and conclude that robust parsers that produce partial analyses are preferable to more complete parsers that frequently fail. In particular, they find that useful results can be obtained by examining a local ‘window’ of words around a given word. However, none of the current approaches succeeds in fully automating the method described by Hirschman et al. (1975). They suggest that a new paradigm is required to implement and evaluate distributional semantics.

Since most of the contributions in this volume are concerned with analysis of text, Richard Kittredge chose to focus on its synthesis in “Text generation within sublanguages”. Kittredge observes that despite a great deal of research on sublanguages and their restricted grammars, few have been found amenable to generation. Success has been achieved with stereotypical technical reports, such as weather reports, summaries of financial market activity, and descriptions of sporting events. For each of these, there is an alternative representation of the information, such as numerical measurements. However, for scientific articles and even for instructional materials generation is difficult because there is no independent representation of the source knowledge.  Information formats (described by Sager and Nhan) can be used as an interlingua to represent the information to be generated. This work raises intriguing issues about the relation between information and language.

 

In contrast with text analysis, text generation must focus on whole texts instead of single sentences. Kittredge describes distributional analysis carried out at the sentence level, which assigns sentences to classes. Members of a class are informationally equivalent; each can be an answer to the same question. This approach is reminiscent of the discourse dependencies invoked by Pereira in the first section of this book. Recent work by Kittredge employs meaning-text theory (Mel’chuk 1987). Although Kittredge does not go into detail, this theory has some interesting similarities to Harris’s operator grammar, such as the use of dependency representations and paraphrastic operations.

 

In the final piece, Richard Smaby goes beyond traditional natural-language applications in “A distributional semantics applied to computer user interfaces”. He examines the interaction between humans and computers using multiple modalities, such as graphical displays, pointing devices (e.g., the mouse), keyboard, speech input, and speech output.  He shows that interaction with a graphical user interface of a software application such as a word processor can be represented as sequences of discrete ‘events’. Collections of sequences form a corpus suitable for distributional analysis. The corpus is similar to that employed in studies of dialog, in mixing user events with application events (computer responses). 

In Smaby’s distributional analysis of the interaction data, he identifies the dependence of the occurrence of a sequence X on the occurrence of another sequence Y. For example, X may be a sequence of user operations that select a portion of text in a document, and Y a sequence of computer operations that applies some command to the selected text, such as making the font bold. Using distributional techniques, the structure X Y can be shown to have a linguistic interpretation, in which Y is an operator that predicates on the argument X.

Smaby finds that user interfaces have many linguistic properties, whether the modality is mouse clicks or speech and whether the lexical elements are icons or words. Interfaces have features reminiscent of pronouns, paraphrase (e.g., several ways to select a piece of text), and conjunction (e.g., multiple operations performed on one piece of text).

Smaby describes a user interface as a form of sublanguage that pares language down to the minimum needed to interact effectively. This approach is intriguing because the sublanguage must be learnable by the user through feedback from the system. Smaby concludes that a distributional approach can help with the design of effective user interfaces.

References

 

Friedman Carol, Phil Alderson, John Austin, James Cimino, & Stephen Johnson. 1994. “A general natural language text processor for clinical radiology”. Journal of American Medical Informatics Association 1(2):161–174.

Grishman, Ralph & Lynette Hirschman. 1986. “PROTEUS and PUNDIT: Research in Text Understanding”. Computational Linguistics 12 (2), 141–145.

Gross, Maurice. 1984. “Lexicon-Grammar and the Syntactic Analysis of French”. Proceedings of the 10th International Conference on Computational Linguistics (COLING'84), Stanford, California.

Harris, Zellig S. 1963. “A Cycling Cancellation-Automaton for Sentence Well-Formedness”. Transformations and Discourse Analysis Papers; 51. (Repr. as National Bureau of Standards #6320427; in International Computation Centre Bulletin (1966) 5.69–94; and in Zellig S. Harris, Papers in Structural and Transformational Linguistics. Dordrecht/ Holland: D. Reidel 1970, pp. 286–309.)

Harris, Zellig S., Michael Gottfried, Thomas Ryckman, Paul Mattick Jr, Ann Daladier, T.N. Harris and S. Harris. 1989. The Form of Information in Science, Analysis of Immunology Sublanguage, Dordrecht, The Netherlands: Kluwer Academic. Boston studies in the philosophy of science, 104.

Harris, Zellig S. 1991. A Theory of Language and Information: A Mathematical Approach. Oxford: Clarendon Press.

Hirschman, Lynette, Ralph Grishman & Naomi Sager. 1975. “Grammatically-based automatic word class formation”. Information Processing and Management 11:39-57.

Hirschman, Lynette, Martha Palmer, John Dowding, Deborah Dahl, Marcia Linebarger, Rebecca Passonneau, Francois-Michel Lang, Catherine Ball, & Carl Weir. 1989. “The PUNDIT NLP System”. In: AI Systems in Government Conference, Computer Society of IEEE, 27–31 March, 1989.

Mel’chuk, Igor & Nikolai Pertsov. 1987. Surface Syntax of English: a formal model within the meaning-text framework. Philadelphia: John Benjamins.

Palmer, Martha, Rebecca Passonneau, Carl Weir, & Tim Finin. 1994. “The KERNEL text understanding system”. In Pereira, Fernando C. N. & Barbara J. Grosz, editors, Natural Language Processing, MIT Press, Cambridge, Massachusetts.

Sager, Naomi, Carol Friedman & Margaret Lyman, eds. 1987. Medical language processing: computer management of narrative data. Reading, Mass: Addison Wesley.

Shapiro, P. A. 1967. “ACORN—an automated coder of report narrative”. Methods of Information in Medicine 6(4):153–162.