The history of natural language processing describes the advances of

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(Outline of natural language processing). There is some overlap with the

history of machine translation Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. In the 1950s, machine translation became a reality in research, although ref ...

, the

history of speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

, and the

history of artificial intelligence The history of artificial intelligence (AI) began in ancient history, antiquity, with myths, stories and rumors of artificial beings endowed with intelligence or consciousness by master craftsmen. The seeds of modern AI were planted by philoso ...

Research and development

The history of machine translation dates back to the seventeenth century, when philosophers such as

Leibniz Gottfried Wilhelm (von) Leibniz . ( – 14 November 1716) was a German polymath active as a mathematician, philosopher, scientist and diplomat. He is one of the most prominent figures in both the history of philosophy and the history of mat ...

and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine. The first patents for "translating machines" were applied for in the mid-1930s. One proposal, by Georges Artsrouni was simply an automatic bilingual dictionary using

paper tape Five- and eight-hole punched paper tape Paper tape reader on the Harwell computer with a small piece of five-hole tape connected in a circle – creating a physical program loop Punched tape or perforated paper tape is a form of data storage ...

. The other proposal, by Peter Troyanskii, a Russian, was more detailed. It included both the bilingual dictionary, and a method for dealing with grammatical roles between languages, based on

Esperanto Esperanto ( or ) is the world's most widely spoken constructed international auxiliary language. Created by the Warsaw-based ophthalmologist L. L. Zamenhof in 1887, it was intended to be a universal second language for international communi ...

. In 1950,

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical c ...

published his famous article " Computing Machinery and Intelligence" which proposed what is now called the

Turing test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...

as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human. In 1957,

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...

’s

Syntactic Structures ''Syntactic Structures'' is an influential work in linguistics by American linguist Noam Chomsky, originally published in 1957. It is an elaboration of his teacher Zellig Harris's model of transformational generative grammar. A short monograph ...

revolutionized Linguistics with '

universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible h ...

', a rule based system of syntactic structures. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first

statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrast ...

systems were developed. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted " blocks worlds" with restricted vocabularies. In 1969 Roger Schank introduced the conceptual dependency theory for natural language understanding. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at

Yale University Yale University is a Private university, private research university in New Haven, Connecticut. Established in 1701 as the Collegiate School, it is the List of Colonial Colleges, third-oldest institution of higher education in the United Sta ...

, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the

augmented transition network An augmented transition network or ATN is a type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelli ...

(ATN) to represent natural language input. Instead of ''

phrase structure rules Phrase structure rules are a type of rewrite rule used to describe a given language's syntax and are closely associated with the early stages of transformational grammar, proposed by Noam Chomsky in 1957. They are used to break down a natural lang ...

'' ATNs used an equivalent set of finite state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including

PARRY PARRY was an early example of a chatbot, implemented in 1972 by psychiatrist Kenneth Colby. History PARRY was written in 1972 by psychiatrist Kenneth Colby, then at Stanford University. While ELIZA was a tongue-in-cheek simulation of a Rog ...

, Racter, and

Jabberwacky Jabberwacky is a chatterbot created by British programmer Rollo Carpenter. Its stated aim is to "simulate natural human chat in an interesting, entertaining and humorous manner". It is an early attempt at creating an artificial intelligence th ...

. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

algorithms for language processing. This was due both to the steady increase in computational power resulting from

Moore's Law Moore's law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years. Moore's law is an observation and projection of a historical trend. Rather than a law of physics, it is an empi ...

and the gradual lessening of the dominance of

Chomskyan Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...

theories of linguistics (e.g.

transformational grammar In linguistics, transformational grammar (TG) or transformational-generative grammar (TGG) is part of the theory of generative grammar, especially of natural languages. It considers grammar to be a system of rules that generate exactly those combin ...

), whose theoretical underpinnings discouraged the sort of

corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...

that underlies the machine-learning approach to language processing.Chomskyan linguistics encourages the investigation of "

corner case In engineering, a corner case (or pathological case) involves a problem or situation that occurs only outside normal operating parameters—specifically one that manifests itself when multiple environmental variables or conditions are simultaneou ...

s" that stress the limits of its theoretical models (comparable to pathological phenomena in mathematics), typically created using

thought experiment A thought experiment is a hypothetical situation in which a hypothesis, theory, or principle is laid out for the purpose of thinking through its consequences. History The ancient Greek ''deiknymi'' (), or thought experiment, "was the most anci ...

s, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in

. The creation and use of such

corpora Corpus is Latin language, Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of lingu ...

of real-world data is a fundamental part of machine-learning algorithms for NLP. In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "

poverty of the stimulus Poverty of the stimulus (POS) is the controversial argument from linguistics that children are not exposed to rich enough data within their linguistic environments to acquire every feature of their language. This is considered evidence contrary to ...

" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing. Some of the earliest-used machine learning algorithms, such as

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains co ...

s, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on

statistical models A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...

, which make soft,

probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...

decisions based on attaching

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...

weights to the features making up the input data. The

cache language model A cache language model is a type of statistical language model. These occur in the natural language processing subfield of computer science and assign probabilities to given sequences of words by means of a probability distribution. Statistical lan ...

s upon which many

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. Many of the notable early successes occurred in the field of

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the

European Union The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been ...

as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the

World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...

), which can often make up for the inferior results.

Software

References

Bibliography

* * . * {{Russell Norvig 2003. History of artificial intelligence Natural language processing History of linguistics History of software