The star height problem in

formal language theory In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet". The alphabet of a formal language consists of symbol ...

is the question whether all

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

s can be expressed using

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s of limited star height, i.e. with a limited nesting depth of

Kleene star In mathematical logic and theoretical computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation on a Set (mathematics), set to generate a set of all finite-length strings that are composed of zero or more repe ...

s. Specifically, is a nesting depth of one always sufficient? If not, is there an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

to determine how many are required? The problem was first introduced by Eggan in 1963.

Families of regular languages with unbounded star height

The first question was answered in the negative when in 1963, Eggan gave examples of regular languages of star height ''n'' for every ''n''. Here, the star height ''h''(''L'') of a regular language ''L'' is defined as the minimum star height among all regular expressions representing ''L''. The first few languages found by Eggan are described in the following, by means of giving a regular expression for each language: :

\begin
e_1 &= a_1^* \\
e_2 &= \left(a_1^*a_2^*a_3\right)^*\\
e_3 &= \left(\left(a_1^*a_2^*a_3\right)^*\left(a_4^*a_5^*a_6\right)^*a_7\right)^*\\
e_4 &= \left(
\left(\left(a_1^*a_2^*a_3\right)^*\left(a_4^*a_5^*a_6\right)^*a_7\right)^*
\left(\left(a_8^*a_9^*a_\right)^*\left(a_^*a_^*a_\right)^*a_\right)^*
a_\right)^*
\end

The construction principle for these expressions is that expression

e_

is obtained by concatenating two copies of

e_n

, appropriately renaming the letters of the second copy using fresh alphabet symbols, concatenating the result with another fresh alphabet symbol, and then by surrounding the resulting expression with a Kleene star. The remaining, more difficult part, is to prove that for

e_n

there is no equivalent regular expression of star height less than ''n''; a proof is given in . However, Eggan's examples use a large

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

, of size 2^''n''-1 for the language with star height ''n''. He thus asked whether we can also find examples over binary alphabets. This was proved to be true shortly afterwards by Dejean and Schützenberger in 1966. Their examples can be described by an inductively defined family of regular expressions over the binary alphabet

\

as follows–cf. : :

\begin
e_1 & = (ab)^* \\
e_2 & = \left(aa(ab)^*bb(ab)^*\right)^* \\
e_3 & = \left(aaaa \left(aa(ab)^*bb(ab)^*\right)^* bbbb \left(aa(ab)^*bb(ab)^*\right)^*\right)^* \\
\, & \cdots \\
e_ & = (\,\underbrace_\, \cdot \, e_n\, \cdot\, \underbrace_\, \cdot\, e_n \,)^*
\end

Again, a rigorous proof is needed for the fact that

e_n

does not admit an equivalent regular expression of lower star height. Proofs are given by and by .

Computing the star height of regular languages

In contrast, the second question turned out to be much more difficult, and the question became a famous open problem in formal language theory for over two decades. For years, there was only little progress. The pure-group languages were the first interesting family of regular languages for which the star height problem was proved to be decidable. But the general problem remained open for more than 25 years until it was settled by Hashiguchi, who in 1988 published an algorithm to determine the star height of any regular language. The algorithm wasn't at all practical, being of non- elementary complexity. To illustrate the immense resource consumptions of that algorithm, give some actual numbers: Notice that alone the number

10^

has 10 billion zeros when written down in decimal notation, and is already ''by far'' larger than the number of atoms in the observable universe. A much more efficient algorithm than Hashiguchi's procedure was devised by Kirsten in 2005. This algorithm runs, for a given

nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state tr ...

as input, within double- exponential space. Yet the resource requirements of this algorithm still greatly exceed the margins of what is considered practically feasible. This algorithm has been optimized and generalized to trees by Colcombet and Löding in 2008, as part of the theory of regular cost functions. It has been implemented in 2017 in the tool suite Stamina.

Notes

References

*
(technical report version)
* * * * *

Families of regular languages with unbounded star height

Computing the star height of regular languages

See also

Notes

References

Further reading