logic Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises based on the structure o ...

mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, and

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, a formal language is a set of strings whose symbols are taken from a set called "

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

". The alphabet of a formal language consists of symbols that concatenate into strings (also called "words"). Words that belong to a particular formal language are sometimes called ''well-formed words''. A formal language is often defined by means of a

formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...

such as a regular grammar or context-free grammar. In computer science, formal languages are used, among others, as the basis for defining the grammar of

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s and formalized versions of subsets of natural languages, in which the words of the language represent concepts that are associated with meanings or

semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

. In

computational complexity theory In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and explores the relationships between these classifications. A computational problem ...

decision problem In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question on a set of input values. An example of a decision problem is deciding whether a given natura ...

s are typically defined as formal languages, and

complexity class In computational complexity theory, a complexity class is a set (mathematics), set of computational problems "of related resource-based computational complexity, complexity". The two most commonly analyzed resources are time complexity, time and s ...

es are defined as the sets of the formal languages that can be parsed by machines with limited computational power. In

and the

foundations of mathematics Foundations of mathematics are the mathematical logic, logical and mathematics, mathematical framework that allows the development of mathematics without generating consistency, self-contradictory theories, and to have reliable concepts of theo ...

, formal languages are used to represent the syntax of axiomatic systems, and mathematical formalism is the philosophy that all of mathematics can be reduced to the syntactic manipulation of formal languages in this way. The field of formal language theory studies primarily the purely syntactic aspects of such languages—that is, their internal structural patterns. Formal language theory sprang out of linguistics, as a way of understanding the syntactic regularities of natural languages.

History

In the 17th century,

Gottfried Leibniz Gottfried Wilhelm Leibniz (or Leibnitz; – 14 November 1716) was a German polymath active as a mathematician, philosopher, scientist and diplomat who is credited, alongside Isaac Newton, Sir Isaac Newton, with the creation of calculus in ad ...

imagined and described the characteristica universalis, a universal and formal language which utilised pictographs. Later,

Carl Friedrich Gauss Johann Carl Friedrich Gauss (; ; ; 30 April 177723 February 1855) was a German mathematician, astronomer, geodesist, and physicist, who contributed to many fields in mathematics and science. He was director of the Göttingen Observatory and ...

investigated the problem of Gauss codes.

Gottlob Frege Friedrich Ludwig Gottlob Frege (; ; 8 November 1848 – 26 July 1925) was a German philosopher, logician, and mathematician. He was a mathematics professor at the University of Jena, and is understood by many to be the father of analytic philos ...

attempted to realize Leibniz's ideas, through a notational system first outlined in '' Begriffsschrift'' (1879) and more fully developed in his 2-volume Grundgesetze der Arithmetik (1893/1903). This described a "formal language of pure language." In the first half of the 20th century, several developments were made with relevance to formal languages. Axel Thue published four papers relating to words and language between 1906 and 1914. The last of these introduced what

Emil Post Emil Leon Post (; February 11, 1897 – April 21, 1954) was an American mathematician and logician. He is best known for his work in the field that eventually became known as computability theory. Life Post was born in Augustów, Suwałki Govern ...

later termed 'Thue Systems', and gave an early example of an

undecidable problem In computability theory and computational complexity theory, an undecidable problem is a decision problem for which it is proved to be impossible to construct an algorithm that always leads to a correct yes-or-no answer. The halting problem is an ...

. Post would later use this paper as the basis for a 1947 proof "that the word problem for semigroups was recursively insoluble", and later devised the canonical system for the creation of formal languages. In 1907, Leonardo Torres Quevedo introduced a formal language for the description of mechanical drawings (mechanical devices), in

Vienna Vienna ( ; ; ) is the capital city, capital, List of largest cities in Austria, most populous city, and one of Federal states of Austria, nine federal states of Austria. It is Austria's primate city, with just over two million inhabitants. ...

. He published "Sobre un sistema de notaciones y símbolos destinados a facilitar la descripción de las máquinas" ("On a system of notations and symbols intended to facilitate the description of machines"). Heinz Zemanek rated it as an equivalent to a

for the numerical control of machine tools.

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American professor and public intellectual known for his work in linguistics, political activism, and social criticism. Sometimes called "the father of modern linguistics", Chomsky is also a ...

devised an abstract representation of formal and natural languages, known as the Chomsky hierarchy. In 1959 John Backus developed the Backus-Naur form to describe the syntax of a high level programming language, following his work in the creation of FORTRAN. Peter Naur was the secretary/editor for the ALGOL60 Report in which he used

Backus–Naur form In computer science, Backus–Naur form (BNF, pronounced ), also known as Backus normal form, is a notation system for defining the Syntax (programming languages), syntax of Programming language, programming languages and other Formal language, for ...

to describe the Formal part of ALGOL60.

Words over an alphabet

An ''alphabet'', in the context of formal languages, can be any set; its elements are called ''letters''. An alphabet may contain an infinite number of elements; however, most definitions in formal language theory specify alphabets with a finite number of elements, and many results apply only to them. It often makes sense to use an

in the usual sense of the word, or more generally any finite

character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...

such as

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

. A word over an alphabet can be any finite sequence (i.e., string) of letters. The set of all words over an alphabet Σ is usually denoted by Σ^* (using the

Kleene star In mathematical logic and theoretical computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation on a Set (mathematics), set to generate a set of all finite-length strings that are composed of zero or more repe ...

). The length of a word is the number of letters it is composed of. For any alphabet, there is only one word of length 0, the ''empty word'', which is often denoted by e, ε, λ or even Λ. By

concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalizations of concatenati ...

one can combine two words to form a new word, whose length is the sum of the lengths of the original words. The result of concatenating a word with the empty word is the original word. In some applications, especially in

, the alphabet is also known as the ''vocabulary'' and words are known as ''formulas'' or ''sentences''; this breaks the letter/word metaphor and replaces it by a word/sentence metaphor.

Definition

Given a non-empty set

\Sigma

, a formal language

L

over

\Sigma

is a

subset In mathematics, a Set (mathematics), set ''A'' is a subset of a set ''B'' if all Element (mathematics), elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they a ...

\Sigma^*

, which is the set of all possible finite-length words over

\Sigma

. We call the set

\Sigma

the alphabet of

L

. On the other hand, given a formal language

L

over

\Sigma

, a word

w \in \Sigma^*

is ''well-formed'' if

w\in L

. Similarly, an expression

E\subseteq\Sigma^*

is ''well-formed'' if

E\subseteq L

. Sometimes, a formal language

L

over

\Sigma

has a set of clear rules and constraints for the creation of all possible well-formed words from

\Sigma^*

. In computer science and mathematics, which do not usually deal with natural languages, the adjective "formal" is often omitted as redundant. On the other hand, we can just say "a formal language

L

" when its alphabet

\Sigma

is clear in the context. While formal language theory usually concerns itself with formal languages that are described by some syntactic rules, the actual definition of the concept "formal language" is only as above: a (possibly infinite) set of finite-length strings composed from a given alphabet, no more and no less. In practice, there are many languages that can be described by rules, such as

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

s or context-free languages. The notion of a

may be closer to the intuitive concept of a "language", one described by syntactic rules. By an abuse of the definition, a particular formal language is often thought of as being accompanied with a formal grammar that describes it.

Examples

The following rules describe a formal language over the alphabet Σ = : * Every nonempty string that does not contain "+" or "=" and does not start with "0" is in . * The string "0" is in . * A string containing "=" is in if and only if there is exactly one "=", and it separates two valid strings of . * A string containing "+" but not "=" is in if and only if every "+" in the string separates two valid strings of . * No string is in other than those implied by the previous rules. Under these rules, the string "23+4=555" is in , but the string "=234=+" is not. This formal language expresses

natural number In mathematics, the natural numbers are the numbers 0, 1, 2, 3, and so on, possibly excluding 0. Some start counting with 0, defining the natural numbers as the non-negative integers , while others start with 1, defining them as the positive in ...

s, well-formed additions, and well-formed addition equalities, but it expresses only what they look like (their

syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

), not what they mean (

). For instance, nowhere in these rules is there any indication that "0" means the number zero, "+" means addition, "23+4=555" is false, etc.

Constructions

For finite languages, one can explicitly enumerate all well-formed words. For example, we can describe a language as just = . The degenerate case of this construction is the empty language, which contains no words at all ( = ∅). However, even over a finite (non-empty) alphabet such as Σ = there are an infinite number of finite-length words that can potentially be expressed: "a", "abb", "ababba", "aaababbbbaab", .... Therefore, formal languages are typically infinite, and describing an infinite formal language is not as simple as writing ''L'' = . Here are some examples of formal languages: * = Σ^*, the set of ''all'' words over Σ; * = ^* = , where ''n'' ranges over the natural numbers and "a^''n''" means "a" repeated ''n'' times (this is the set of words consisting only of the symbol "a"); * the set of syntactically correct programs in a given programming language (the syntax of which is usually defined by a context-free grammar); * the set of inputs upon which a certain

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

halts; or * the set of maximal strings of

alphanumeric Alphanumericals or alphanumeric characters are any collection of number characters and letters in a certain language. Sometimes such characters may be mistaken one for the other. Merriam-Webster suggests that the term "alphanumeric" may often ...

characters on this line, i.e.,
the set .

Language-specification formalisms

Formal languages are used as tools in multiple disciplines. However, formal language theory rarely concerns itself with particular languages (except as examples), but is mainly concerned with the study of various types of formalisms to describe languages. For instance, a language can be given as * those strings generated by some

; * those strings described or matched by a particular

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

; * those strings accepted by some

automaton An automaton (; : automata or automatons) is a relatively self-operating machine, or control mechanism designed to automatically follow a sequence of operations, or respond to predetermined instructions. Some automata, such as bellstrikers i ...

, such as a

or finite-state automaton; * those strings for which some

decision procedure Decision may refer to: Law and politics *Judgment (law), as the outcome of a legal case * Landmark decision, the outcome of a case that sets a legal precedent * ''Per curiam'' decision, by a court with multiple judges Books * ''Decision'' (novel ...

(an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

that asks a sequence of related YES/NO questions) produces the answer YES. Typical questions asked about such formalisms include: * What is their expressive power? (Can formalism ''X'' describe every language that formalism ''Y'' can describe? Can it describe other languages?) * What is their recognizability? (How difficult is it to decide whether a given word belongs to a language described by formalism ''X''?) * What is their comparability? (How difficult is it to decide whether two languages, one described in formalism ''X'' and one in formalism ''Y'', or in ''X'' again, are actually the same language?). Surprisingly often, the answer to these decision problems is "it cannot be done at all", or "it is extremely expensive" (with a characterization of how expensive). Therefore, formal language theory is a major application area of

computability theory Computability theory, also known as recursion theory, is a branch of mathematical logic, computer science, and the theory of computation that originated in the 1930s with the study of computable functions and Turing degrees. The field has since ex ...

and complexity theory. Formal languages may be classified in the Chomsky hierarchy based on the expressive power of their generative grammar as well as the complexity of their recognizing

. Context-free grammars and regular grammars provide a good compromise between expressivity and ease of

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

, and are widely used in practical applications.

Operations on languages

Certain operations on languages are common. This includes the standard set operations, such as union, intersection, and complement. Another class of operation is the element-wise application of string operations. Examples: suppose

L_1

and

L_2

are languages over some common alphabet

\Sigma

. * The ''

L_1 \cdot L_2

consists of all strings of the form

vw

where

v

is a string from

L_1

and

w

is a string from

L_2

. * The ''intersection''

L_1 \cap L_2

L_1

and

L_2

consists of all strings that are contained in both languages * The ''complement''

\neg L_1

L_1

with respect to

\Sigma

consists of all strings over

\Sigma

that are not in

L_1

. * The

: the language consisting of all words that are concatenations of zero or more words in the original language; * ''Reversal'': ** Let ''ε'' be the empty word, then

\varepsilon^R = \varepsilon

, and ** for each non-empty word

w = \sigma_1 \cdots \sigma_n

(where

\sigma_1, \ldots, \sigma_n

are elements of some alphabet), let

w^R = \sigma_n \cdots \sigma_1

, ** then for a formal language

L

L^R = \

. * String homomorphism Such string operations are used to investigate closure properties of classes of languages. A class of languages is closed under a particular operation when the operation, applied to languages in the class, always produces a language in the same class again. For instance, the context-free languages are known to be closed under union, concatenation, and intersection with

s, but not closed under intersection or complement. The theory of trios and abstract families of languages studies the most common closure properties of language families in their own right., Chapter 11: Closure properties of families of languages. :

Applications

Programming languages

A compiler usually has two distinct components. A lexical analyzer, sometimes generated by a tool like lex, identifies the tokens of the programming language grammar, e.g. identifiers or keywords, numeric and string literals, punctuation and operator symbols, which are themselves specified by a simpler formal language, usually by means of

regular expressions A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of character (computing), characters that specifies a pattern matching, match pattern in string (computer science), text. Usually ...

. At the most basic conceptual level, a

parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...

, sometimes generated by a parser generator like

yacc 

Yacc (Yet Another Compiler-Compiler) is a computer program for the  Unix operating system developed by  Stephen C. Johnson.  It is a  lookahead left-to-right rightmost derivation (LALR) parser generator, generating a  LALR parser (the part of a co ...

, attempts to decide if the source program is syntactically valid, that is if it is well formed with respect to the programming language grammar for which the compiler was built. Of course, compilers do more than just parse the source code – they usually translate it into some executable format. Because of this, a parser usually outputs more than a yes/no answer, typically an abstract syntax tree. This is used by subsequent stages of the compiler to eventually generate an

executable In computer science, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instruction (computer science), in ...

containing machine code that runs directly on the hardware, or some intermediate code that requires a

virtual machine In computing, a virtual machine (VM) is the virtualization or emulator, emulation of a computer system. Virtual machines are based on computer architectures and provide the functionality of a physical computer. Their implementations may involve ...

to execute.

Formal theories, systems, and proofs

mathematical logic Mathematical logic is the study of Logic#Formal logic, formal logic within mathematics. Major subareas include model theory, proof theory, set theory, and recursion theory (also known as computability theory). Research in mathematical logic com ...

, a ''formal theory'' is a set of sentences expressed in a formal language. A ''formal system'' (also called a ''logical calculus'', or a ''logical system'') consists of a formal language together with a deductive apparatus (also called a ''deductive system''). The deductive apparatus may consist of a set of transformation rules, which may be interpreted as valid rules of inference, or a set of

axiom An axiom, postulate, or assumption is a statement that is taken to be true, to serve as a premise or starting point for further reasoning and arguments. The word comes from the Ancient Greek word (), meaning 'that which is thought worthy or ...

s, or have both. A formal system is used to derive one expression from one or more other expressions. Although a formal language can be identified with its formulas, a formal system cannot be likewise identified by its theorems. Two formal systems

\mathcal

and

\mathcal

may have all the same theorems and yet differ in some significant proof-theoretic way (a formula A may be a syntactic consequence of a formula B in one but not another for instance). A ''formal proof'' or ''derivation'' is a finite sequence of well-formed formulas (which may be interpreted as sentences, or

proposition A proposition is a statement that can be either true or false. It is a central concept in the philosophy of language, semantics, logic, and related fields. Propositions are the object s denoted by declarative sentences; for example, "The sky ...

s) each of which is an axiom or follows from the preceding formulas in the sequence by a rule of inference. The last sentence in the sequence is a theorem of a formal system. Formal proofs are useful because their theorems can be interpreted as true propositions.

Interpretations and models

Formal languages are entirely syntactic in nature, but may be given

that give meaning to the elements of the language. For instance, in mathematical

, the set of possible formulas of a particular logic is a formal language, and an interpretation assigns a meaning to each of the formulas—usually, a

truth value In logic and mathematics, a truth value, sometimes called a logical value, is a value indicating the relation of a proposition to truth, which in classical logic has only two possible values ('' true'' or '' false''). Truth values are used in ...

. The study of interpretations of formal languages is called formal semantics. In mathematical logic, this is often done in terms of

model theory In mathematical logic, model theory is the study of the relationship between theory (mathematical logic), formal theories (a collection of Sentence (mathematical logic), sentences in a formal language expressing statements about a Structure (mat ...

. In model theory, the terms that occur in a formula are interpreted as objects within mathematical structures, and fixed compositional interpretation rules determine how the truth value of the formula can be derived from the interpretation of its terms; a ''model'' for a formula is an interpretation of terms such that the formula becomes true.

Notes

References

Citations

Sources

; Works cited * ; General references * A. G. Hamilton, ''Logic for Mathematicians'',

Cambridge University Press Cambridge University Press was the university press of the University of Cambridge. Granted a letters patent by King Henry VIII in 1534, it was the oldest university press in the world. Cambridge University Press merged with Cambridge Assessme ...

, 1978, . * Seymour Ginsburg, ''Algebraic and automata theoretic properties of formal languages'', North-Holland, 1975, . * Michael A. Harrison, ''Introduction to Formal Language Theory'', Addison-Wesley, 1978. * * Grzegorz Rozenberg,

Arto Salomaa Arto Kustaa Salomaa (6 June 1934 – 26 January 2025) was a Finnish mathematician and computer scientist. His research career, which spanned over 40 years, was focused on formal languages and automata theory. Early life and education Salomaa ...

, ''Handbook of Formal Languages: Volume I-III'', Springer, 1997, . * Patrick Suppes, ''Introduction to Logic'', D. Van Nostrand, 1957, .

External links

* *

University of Maryland The University of Maryland, College Park (University of Maryland, UMD, or simply Maryland) is a public land-grant research university in College Park, Maryland, United States. Founded in 1856, UMD is the flagship institution of the Univ ...

Formal Language Definitions
* James Power
"Notes on Formal Language Theory and Parsing"
, 29 November 2002. * Drafts of some chapters in the "Handbook of Formal Language Theory", Vol. 1–3, G. Rozenberg and A. Salomaa (eds.),

Springer Verlag Springer Science+Business Media, commonly known as Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing. Originally founded in 1842 in ...

, (1997): ** Alexandru Mateescu and Arto Salomaa
"Preface" in Vol.1, pp. v–viii, and "Formal Languages: An Introduction and a Synopsis", Chapter 1 in Vol. 1, pp. 1–39
** Sheng Yu
"Regular Languages", Chapter 2 in Vol. 1
** Jean-Michel Autebert, Jean Berstel, Luc Boasson
"Context-Free Languages and Push-Down Automata", Chapter 3 in Vol. 1
** Christian Choffrut and Juhani Karhumäki
"Combinatorics of Words", Chapter 6 in Vol. 1
** Tero Harju and Juhani Karhumäki
"Morphisms", Chapter 7 in Vol. 1, pp. 439–510
** Jean-Eric Pin
"Syntactic semigroups", Chapter 10 in Vol. 1, pp. 679–746
** M. Crochemore and C. Hancart
"Automata for matching patterns", Chapter 9 in Vol. 2
** Dora Giammarresi, Antonio Restivo
"Two-dimensional Languages", Chapter 4 in Vol. 3, pp. 215–267
{{DEFAULTSORT:Formal Language Theoretical computer science Combinatorics on words Mathematical linguistics

History

Words over an alphabet

Definition

Examples

Constructions

Language-specification formalisms

Operations on languages

Applications

Programming languages

Formal theories, systems, and proofs

Interpretations and models

See also

Notes

References

Citations

Sources

External links