Language and information

Lecture 1. A formal theory of syntax

1.2. Procedures yielding primitive elements

[Audio recording—Opens a separate, blank window. Return to this window to read the transcript.]

People have known for most of this century that the sounds that people use in language are special for each language, that is to say, there are only certain sounds differences which are recognized as usage of the language. These are called phonemic distinctions. They are not really things, they are not phonemes, they are phonemic distinctions, they are differences between things. And every language has certain differences between sounds with which it works. These differences, that is to say the phonemic distinctions, can be established in a very precise way. They can also be judged in an imprecise way, and one sees something that works and what doesn't work, and so forth, but it is important to know that there are also testable procedures which make it possible to say for a language which are the phonemic distinctions that that language works with, and these testable procedures, it is important to know, do not require any prior knowledge of that language and do not require any prior knowledge of the meanings of the words. In fact there are words that have utterly different meanings, like hart and heart, which have the same phonemes, which are not distinguishable, and in fact are not distinguished as sound elements.

Now the letters of the alphabet are a different thing that does the same work. By chance, we have available to us a system that is an approximation to the phonemes of the language. Given either the phonemes or the letters—it doesn't make that much difference which of these we use—if we take sequences of these, take sequences which are utterances of the language and not sequences which are not the language—we are able to find a stochastic method—really, a method which checks the n+1th item on the basis of the preceding n items—it is possible to establish where are the word boundaries. This is done entirely on combinatorial grounds, sequentially, and we do not know what the words mean, but we know what the words are. We know which are the words.

Of course, one might say well, we know what the words are anyhow, given our languages, and not precisely—The question whether a in a book is a word or some kind of prefix, and so forth, and there are bigger problems in other languages. But even if we know what the words mean, roughly, it is very important to know that there is available a precise procedure if one has to argue a point, if one has to know for sure what an element is.

Now, if we have the data of the language in the form of words, if now we have words, the sequences of every utterance of the the language are sequences of words. The question now that not all combinations exist, it is very easy to show, many combinations of words which are not English, even though the words are English, but they are not English sentences, we couldn't know what it means. We don't know what because from because means, or something like that, and one can get similar examples. The truth is this is therefore all one has to do, is to find the combinations of words. However, there are difficulties.

First of all, the data is very vast. There are very, very many combinations of words. Secondly, grammar is fuzzy. Not all sentences … are well defined, the set is not well defined. There are many marginal sentences, things that people will disagree whether something is a sentence or not, or will not be certain, and so forth. Furthermore, words come in and out of the language; not very fast, but at a reasonable speed, nevertheless. It is impossible to give a list of all the words; it is impossible to make sure that some combination or another does not exist.

In a situation of this sort, there is something else that can be done. One can look for the constraints on combinations, meaning what it it is that precludes certain combinations from occurring, what prevents the randomness of combination of words. If it turns out that we can do this, that we can find constraints that are describable precisely, the fuzziness of grammar is located in a particular section of this, because it is not in the definition of the constraints, it's in the domains of the constraints, so that this makes it possible to have at least part of the structure of language can be completely precise.

It further turns out that one can describe language with four constraints. These four constraints give both the form and the meaning, together with the word choices, of all sentences. I want to present the four constraints today. I will give first, just to fix the ideas, a list of what they are.

One constraint, the first, is a partial order that creates sentences, a partial order on word-occurrences that creates sentences.
The second is an inequality of frequency, a likelihood inequality, a frequency inequality, that allows for meaning in words, for meaning differences in words.
The third is a reduction of the phonemic shape, just the physical shape of words in sentences, and a reduction of the information in sentences.
And the last is a linearization of the partial order.

That is all, and that will be found to be sufficient.