theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...

and

formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sy ...

, a regular language (also called a rational language) is a

formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of s ...

that can be defined by a

regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

, in the strict sense in theoretical computer science (as opposed to many modern regular expressions engines, which are augmented with features that allow recognition of non-regular languages). Alternatively, a regular language can be defined as a language recognized by a

finite automaton A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...

. The equivalence of regular expressions and finite automata is known as Kleene's theorem (after American mathematician

Stephen Cole Kleene Stephen Cole Kleene ( ; January 5, 1909 – January 25, 1994) was an American mathematician. One of the students of Alonzo Church, Kleene, along with Rózsa Péter, Alan Turing, Emil Post, and others, is best known as a founder of the branch of ...

). In the

Chomsky hierarchy In formal language theory, computer science and linguistics, the Chomsky hierarchy (also referred to as the Chomsky–Schützenberger hierarchy) is a containment hierarchy of classes of formal grammars. This hierarchy of grammars was described ...

, regular languages are the languages generated by Type-3 grammars.

Formal definition

The collection of regular languages over an

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a s ...

Σ is defined recursively as follows: * The empty language Ø is a regular language. * For each ''a'' ∈ Σ (''a'' belongs to Σ), the singleton language is a regular language. * If ''A'' is a regular language, ''A''* (

Kleene star In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...

) is a regular language. Due to this, the empty string language is also regular. * If ''A'' and ''B'' are regular languages, then ''A'' ∪ ''B'' (union) and ''A'' • ''B'' (concatenation) are regular languages. * No other languages over Σ are regular. See

for syntax and semantics of regular expressions.

Examples

All finite languages are regular; in particular the

empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special cas ...

language = Ø* is regular. Other typical examples include the language consisting of all strings over the alphabet which contain an even number of ''a''s, or the language consisting of all strings of the form: several ''a''s followed by several ''b''s. A simple example of a language that is not regular is the set of strings . Intuitively, it cannot be recognized with a finite automaton, since a finite automaton has finite memory and it cannot remember the exact number of a's. Techniques to prove this fact rigorously are given

below Below may refer to: *Earth * Ground (disambiguation) * Soil * Floor * Bottom (disambiguation) * Less than *Temperatures below freezing * Hell or underworld People with the surname * Ernst von Below (1863–1955), German World War I general * Fr ...

Equivalent formalisms

A regular language satisfies the following equivalent properties: # it is the language of a regular expression (by the above definition) # it is the language accepted by a

nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state t ...

(NFA)1. ⇒ 2. by

Thompson's construction algorithm In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used ...

2. ⇒ 1. by

Kleene's algorithm In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equival ...

or using

Arden's lemma In theoretical computer science, Arden's rule, also known as Arden's lemma, is a mathematical statement about a certain form of language equations. Background A (formal) language is simply a set of strings. Such sets can be specified by means o ...

# it is the language accepted by a

deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automa ...

(DFA)2. ⇒ 3. by the

powerset construction In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the sa ...

3. ⇒ 2. since the former

definition A definition is a statement of the meaning of a term (a word, phrase, or other set of symbols). Definitions can be classified into two large categories: intensional definitions (which try to give the sense of a term), and extensional definitio ...

is stronger than the latter # it can be generated by a

regular grammar In theoretical computer science and formal language theory, a regular grammar is a grammar that is ''right-regular'' or ''left-regular''. While their exact definition varies from textbook to textbook, they all require that * all production rules ...

2. ⇒ 4. see Hopcroft, Ullman (1979), Theorem 9.2, p.2194. ⇒ 2. see Hopcroft, Ullman (1979), Theorem 9.1, p.218 # it is the language accepted by an alternating finite automaton # it is the language accepted by a two-way finite automaton # it can be generated by a prefix grammar # it can be accepted by a read-only

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algor ...

# it can be defined in monadic

second-order logic In logic and mathematics, second-order logic is an extension of first-order logic, which itself is an extension of propositional logic. Second-order logic is in turn extended by higher-order logic and type theory. First-order logic quantifies ...

( Büchi–Elgot–Trakhtenbrot theorem) # it is recognized by some finite

syntactic monoid In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of ...

''M'', meaning it is the

preimage In mathematics, the image of a function is the set of all output values it may produce. More generally, evaluating a given function f at each element of a given subset A of its domain produces a set, called the "image of A under (or throug ...

of a subset ''S'' of a finite monoid ''M'' under a

monoid homomorphism In abstract algebra, a branch of mathematics, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being 0. Monoids a ...

''f'': Σ^* → ''M'' from the

free monoid In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences (or strings) of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elem ...

on its alphabet3. ⇔ 10. by the

Myhill–Nerode theorem In the theory of formal languages, the Myhill–Nerode theorem provides a necessary and sufficient condition for a language to be regular. The theorem is named for John Myhill and Anil Nerode, who proved it at the University of Chicago in 1958 . ...

# the number of equivalence classes of its syntactic congruence is finite.''u''~''v'' is defined as: ''uw''∈''L'' if and only if ''vw''∈''L'' for all ''w''∈Σ^*3. ⇔ 11. see the proof in the ''

Syntactic monoid In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of ...

'' article, and see p.160 in (This number equals the number of states of the minimal deterministic finite automaton accepting ''L''.) Properties 10. and 11. are purely algebraic approaches to define regular languages; a similar set of statements can be formulated for a monoid ''M'' ⊆ Σ^*. In this case, equivalence over ''M'' leads to the concept of a recognizable language. Some authors use one of the above properties different from "1." as an alternative definition of regular languages. Some of the equivalences above, particularly those among the first four formalisms, are called ''Kleene's theorem'' in textbooks. Precisely which one (or which subset) is called such varies between authors. One textbook calls the equivalence of regular expressions and NFAs ("1." and "2." above) "Kleene's theorem". Another textbook calls the equivalence of regular expressions and DFAs ("1." and "3." above) "Kleene's theorem". Two other textbooks first prove the expressive equivalence of NFAs and DFAs ("2." and "3.") and then state "Kleene's theorem" as the equivalence between regular expressions and finite automata (the latter said to describe "recognizable languages"). A linguistically oriented text first equates regular grammars ("4." above) with DFAs and NFAs, calls the languages generated by (any of) these "regular", after which it introduces regular expressions which it terms to describe "rational languages", and finally states "Kleene's theorem" as the coincidence of regular and rational languages. Other authors simply ''define'' "rational expression" and "regular expressions" as synonymous and do the same with "rational languages" and "regular languages". Apparently, the term ''"regular"'' originates from a 1951 technical report where Kleene introduced ''"regular events"'' and explicitly welcomed ''"any suggestions as to a more descriptive term"''.

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...

, in his 1959 seminal article, used the term ''"regular"'' in a different meaning at first (referring to what is called ''"

Chomsky normal form In formal language theory, a context-free grammar, ''G'', is said to be in Chomsky normal form (first described by Noam Chomsky) if all of its production rules are of the form: : ''A'' → ''BC'', or : ''A'' → ''a'', or : ''S'' → ...

"'' today), Here: Definition 8, p.149 but noticed that his ''"finite state languages"'' were equivalent to Kleene's ''"regular events"''.

Closure properties

The regular languages are closed under various operations, that is, if the languages ''K'' and ''L'' are regular, so is the result of the following operations: * the set-theoretic Boolean operations: union ,

intersection In mathematics, the intersection of two or more objects is another object consisting of everything that is contained in all of the objects simultaneously. For example, in Euclidean geometry, when two lines in a plane are not parallel, thei ...

, and

complement A complement is something that completes something else. Complement may refer specifically to: The arts * Complement (music), an interval that, when added to another, spans an octave ** Aggregate complementation, the separation of pitch-clas ...

, hence also relative complement .Salomaa (1981) p.28 * the regular operations: ,

concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...

, and

.Salomaa (1981) p.27 * the trio operations: string homomorphism, inverse string homomorphism, and intersection with regular languages. As a consequence they are closed under arbitrary finite state transductions, like

quotient In arithmetic, a quotient (from lat, quotiens 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics, and is commonly referred to as the integer part of a ...

''K'' / ''L'' with a regular language. Even more, regular languages are closed under quotients with ''arbitrary'' languages: If ''L'' is regular then ''L'' / ''K'' is regular for any ''K''. * the reverse (or mirror image) ''L''^R. Given a nondeterministic finite automaton to recognize ''L'', an automaton for ''L''^R can be obtained by reversing all transitions and interchanging starting and finishing states. This may result in multiple starting states; ε-transitions can be used to join them.

Decidability properties

Given two deterministic finite automata ''A'' and ''B'', it is decidable whether they accept the same language. As a consequence, using the above closure properties, the following problems are also decidable for arbitrarily given deterministic finite automata ''A'' and ''B'', with accepted languages ''L''_''A'' and ''L''_''B'', respectively: * Containment: is ''L''_''A'' ⊆ ''L''_''B'' ?Check if ''L''_''A'' ∩ ''L''_''B'' = ''L''_''A''. Deciding this property is

NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...

in general; see :File:RegSubsetNP.pdf for an illustration of the proof idea. * Disjointness: is ''L''_''A'' ∩ ''L''_''B'' = ? * Emptiness: is ''L''_''A'' = ? * Universality: is ''L''_''A'' = Σ^* ? * Membership: given ''a'' ∈ Σ^*, is ''a'' ∈ ''L''_''B'' ? For regular expressions, the universality problem is

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryin ...

already for a singleton alphabet. For larger alphabets, that problem is

PSPACE-complete In computational complexity theory, a decision problem is PSPACE-complete if it can be solved using an amount of memory that is polynomial in the input length ( polynomial space) and if every other problem that can be solved in polynomial space can ...

. If regular expressions are extended to allow also a ''squaring operator'', with "''A''²" denoting the same as "''AA''", still just regular languages can be described, but the universality problem has an exponential space lower bound, and is in fact complete for exponential space with respect to polynomial-time reduction. For a fixed finite alphabet, the theory of the set of all languages — together with strings, membership of a string in a language, and for each character, a function to append the character to a string (and no other operations) — is decidable, and its minimal

elementary substructure In model theory, a branch of mathematical logic, two structures ''M'' and ''N'' of the same signature ''σ'' are called elementarily equivalent if they satisfy the same first-order ''σ''-sentences. If ''N'' is a substructure of ''M'', one ofte ...

consists precisely of regular languages. For a binary alphabet, the theory is called S2S.

Complexity results

computational complexity theory In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and relating these classes to each other. A computational problem is a task solved ...

, the

complexity class In computational complexity theory, a complexity class is a set of computational problems of related resource-based complexity. The two most commonly analyzed resources are time and memory. In general, a complexity class is defined in terms ...

of all regular languages is sometimes referred to as REGULAR or REG and equals

DSPACE DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document managem ...

(O(1)), the

decision problem In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question of the input values. An example of a decision problem is deciding by means of an algorithm whethe ...

s that can be solved in constant space (the space used is independent of the input size). REGULAR ≠ AC⁰, since it (trivially) contains the parity problem of determining whether the number of 1 bits in the input is even or odd and this problem is not in AC⁰. On the other hand, REGULAR does not contain AC⁰, because the nonregular language of

palindrome A palindrome is a word, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as the words ''madam'' or ''racecar'', the date and time ''11/11/11 11:11,'' and the sentence: "A man, a plan, a canal – Panam ...

s, or the nonregular language

\

can both be recognized in AC⁰. If a language is ''not'' regular, it requires a machine with at least Ω(log log ''n'') space to recognize (where ''n'' is the input size). In other words, DSPACE( o(log log ''n'')) equals the class of regular languages. In practice, most nonregular problems are solved by machines taking at least logarithmic space.

Location in the Chomsky hierarchy

To locate the regular languages in the

, one notices that every regular language is context-free. The converse is not true: for example the language consisting of all strings having the same number of ''a'''s as ''b'''s is context-free but not regular. To prove that a language is not regular, one often uses the

and the

pumping lemma In the theory of formal languages, the pumping lemma may refer to: *Pumping lemma for regular languages, the fact that all sufficiently long strings in such a language have a substring that can be repeated arbitrarily many times, usually used to pro ...

. Other approaches include using the

closure properties Closure may refer to: Conceptual Psychology * Closure (psychology), the state of experiencing an emotional conclusion to a difficult life event Computer science * Closure (computer programming), an abstraction binding a function to its scope * ...

of regular languages or quantifying

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that prod ...

. Important subclasses of regular languages include * Finite languages, those containing only a finite number of words. These are regular languages, as one can create a

that is the union of every word in the language. *

Star-free language A regular language is said to be star-free if it can be described by a regular expression constructed from the letters of the alphabet, the empty set symbol, all boolean operators – including complementation – and concatenation but no K ...

s, those that can be described by a regular expression constructed from the empty symbol, letters, concatenation and all boolean operators (see

algebra of sets In mathematics, the algebra of sets, not to be confused with the mathematical structure of ''an'' algebra of sets, defines the properties and laws of sets, the set-theoretic operations of union, intersection, and complementation and the r ...

) including complementation but not the

: this class includes all finite languages.

The number of words in a regular language

Let

s_L(n)

denote the number of words of length

n

L

. The

ordinary generating function In mathematics, a generating function is a way of encoding an infinite sequence of numbers () by treating them as the coefficients of a formal power series. This series is called the generating function of the sequence. Unlike an ordinary ser ...

for ''L'' is the

formal power series In mathematics, a formal series is an infinite sum that is considered independently from any notion of convergence, and can be manipulated with the usual algebraic operations on series (addition, subtraction, multiplication, division, partial s ...

S_L(z) = \sum_ s_L(n) z^n \ .

The generating function of a language ''L'' is a

rational function In mathematics, a rational function is any function that can be defined by a rational fraction, which is an algebraic fraction such that both the numerator and the denominator are polynomials. The coefficients of the polynomials need not be ...

if ''L'' is regular. Hence for every regular language

L

the sequence

s_L(n)_

is constant-recursive; that is, there exist an integer constant

n_0

, complex constants

\lambda_1,\,\ldots,\,\lambda_k

and complex polynomials

p_1(x),\,\ldots,\,p_k(x)

such that for every

n \geq n_0

the number

s_L(n)

of words of length

n

L

s_L(n)=p_1(n)\lambda_1^n+\dotsb+p_k(n)\lambda_k^n

. Thus, non-regularity of certain languages

L'

can be proved by counting the words of a given length in

L'

. Consider, for example, the Dyck language of strings of balanced parentheses. The number of words of length

2n

in the Dyck language is equal to the

Catalan number In combinatorial mathematics, the Catalan numbers are a sequence of natural numbers that occur in various counting problems, often involving recursively defined objects. They are named after the French-Belgian mathematician Eugène Charles C ...

C_n\sim\frac

, which is not of the form

p(n)\lambda^n

, witnessing the non-regularity of the Dyck language. Care must be taken since some of the eigenvalues

\lambda_i

could have the same magnitude. For example, the number of words of length

n

in the language of all even binary words is not of the form

p(n)\lambda^n

, but the number of words of even or odd length are of this form; the corresponding eigenvalues are

2,-2

. In general, for every regular language there exists a constant

d

such that for all

a

, the number of words of length

dm+a

is asymptotically

C_a m^ \lambda_a^m

. The ''zeta function'' of a language ''L'' is :

\zeta_L(z) = \exp \left(\right) \ .

The zeta function of a regular language is not in general rational, but that of an arbitrary cyclic language is.

Generalizations

The notion of a regular language has been generalized to infinite words (see ω-automata) and to trees (see tree automaton). Rational set generalizes the notion (of regular/rational language) to monoids that are not necessarily free. Likewise, the notion of a recognizable language (by a finite automaton) has namesake as recognizable set over a monoid that is not necessarily free. Howard Straubing notes in relation to these facts that “The term "regular language" is a bit unfortunate. Papers influenced by

Eilenberg Eilenberg is a surname, and may refer to: * Samuel Eilenberg (1913–1998), Polish mathematician * Richard Eilenberg (1848–1927), German composer Named after Samuel * Eilenberg–MacLane space * Eilenberg–Moore algebra * Eilenberg–Steenro ...

's monograph in two volumes "A" (1974, ) and "B" (1976, ), the latter with two chapters by Bret Tilson. often use either the term "recognizable language", which refers to the behavior of automata, or "rational language", which refers to important analogies between regular expressions and rational power series. (In fact, Eilenberg defines rational and recognizable subsets of arbitrary monoids; the two notions do not, in general, coincide.) This terminology, while better motivated, never really caught on, and "regular language" is used almost universally.” Rational series is another generalization, this time in the context of a

formal power series over a semiring In mathematics, a formal series is an infinite sum that is considered independently from any notion of convergence, and can be manipulated with the usual algebraic operations on series (addition, subtraction, multiplication, division, partial ...

. This approach gives rise to weighted rational expressions and weighted automata. In this algebraic context, the regular languages (corresponding to Boolean-weighted rational expre