Lyndon Word
   HOME

TheInfoList



OR:

In
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
, in the areas of
combinatorics Combinatorics is an area of mathematics primarily concerned with counting, both as a means and an end in obtaining results, and certain properties of finite structures. It is closely related to many other areas of mathematics and has many appl ...
and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a Lyndon word is a nonempty string that is strictly smaller in
lexicographic order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...
than all of its rotations. Lyndon words are named after mathematician
Roger Lyndon Roger Conant Lyndon (December 18, 1917 – June 8, 1988) was an American mathematician, for many years a professor at the University of Michigan.. He is known for Lyndon words, the Curtis–Hedlund–Lyndon theorem, Craig–Lyndon interpolation a ...
, who investigated them in 1954, calling them standard lexicographic sequences. Anatoly Shirshov introduced Lyndon words in 1953 calling them regular words. Lyndon words are a special case of
Hall word In mathematics, in the areas of group theory and combinatorics, Hall words provide a unique monoid factorisation of the free monoid. They are also totally ordered, and thus provide a total order on the monoid. This is analogous to the better-known ...
s; almost all properties of Lyndon words are shared by Hall words.


Definitions

Several equivalent definitions exist. A k-ary Lyndon word of length n > 0 is an n-character string over an
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
of size k, and which is the unique minimum element in the
lexicographical order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...
ing in the
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that e ...
of all its rotations. Being the singularly smallest rotation implies that a Lyndon word differs from any of its non-trivial rotations, and is therefore aperiodic.; . Alternately, a word w is a Lyndon word if and only if it is nonempty and lexicographically strictly smaller than any of its proper suffixes, that is w < v for all nonempty words v such that w=uv and u is nonempty. Another characterisation is the following: A Lyndon word has the property that it is nonempty and, whenever it is split into two nonempty substrings, the left substring is always lexicographically less than the right substring. That is, if w is a Lyndon word, and w=uv is any factorization into two substrings, with u and v understood to be non-empty, then u < v. This definition implies that a string w of length \ge 2 is a Lyndon word if and only if there exist Lyndon words u and v such that u < v and w=uv. Although there may be more than one choice of u and v with this property, there is a particular choice, called the ''standard factorization'', in which v is as long as possible..


Enumeration

The Lyndon words over the two-symbol binary alphabet , sorted by length and then lexicographically within each length class, form an infinite sequence that begins :0, 1, 01, 001, 011, 0001, 0011, 0111, 00001, 00011, 00101, 00111, 01011, 01111, ... The first string that does not belong to this sequence, "00", is omitted because it is periodic (it consists of two repetitions of the substring "0"); the second omitted string, "10", is aperiodic but is not minimal in its permutation class as it can be cyclically permuted to the smaller string "01". The empty string also meets the definition of a Lyndon word of length zero. The numbers of binary Lyndon words of each length, starting with length zero, form the
integer sequence In mathematics, an integer sequence is a sequence (i.e., an ordered list) of integers. An integer sequence may be specified ''explicitly'' by giving a formula for its ''n''th term, or ''implicitly'' by giving a relationship between its terms. For ...
:1, 2, 1, 2, 3, 6, 9, 18, 30, 56, 99, 186, 335, ... Lyndon words correspond to aperiodic necklace class representatives and can thus be counted with
Moreau's necklace-counting function In combinatorial mathematics, the necklace polynomial, or Moreau's necklace-counting function, introduced by , counts the number of distinct necklaces of ''n'' colored beads chosen out of α available colors. The necklaces are assumed to be aperio ...
.


Generation

provides an efficient
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
for listing the Lyndon words of length at most n with a given alphabet size s in
lexicographic order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...
. If w is one of the words in the sequence, then the next word after w can be found by the following steps: #Repeat the symbols from w to form a new word x of length exactly n, where the ith symbol of x is the same as the symbol at position i mod length(w) of w. #As long as the final symbol of x is the last symbol in the sorted ordering of the alphabet, remove it, producing a shorter word. #Replace the final remaining symbol of x by its successor in the sorted ordering of the alphabet. The worst-case time to generate the successor of a word w by this procedure is O(n). However, if the words being generated are stored in an
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
of length n, and the construction of x from w is performed by adding symbols onto the end of w rather than by making a new copy of w, then the average time to generate the successor of w (assuming each word is equally likely) is constant. Therefore, the sequence of all Lyndon words of length at most n can be generated in time proportional to the length of the sequence. A constant fraction of the words in this sequence have length exactly n, so the same procedure can be used to efficiently generate words of length exactly n or words whose length divides n, by filtering out the generated words that do not fit these criteria.


Standard factorization

According to the
Chen–Fox–Lyndon theorem In mathematics, a factorisation of a free monoid is a sequence of subsets of words with the property that every word in the free monoid can be written as a concatenation of elements drawn from the subsets. The Chen–Fox–Lyndon theorem states tha ...
, every string may be formed in a unique way by concatenating a sequence of Lyndon words, in such a way that the words in the sequence are nonincreasing lexicographically. The final Lyndon word in this sequence is the lexicographically smallest suffix of the given string.. A factorization into a nonincreasing sequence of Lyndon words (the so-called Lyndon factorization) can be constructed in
linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
. Lyndon factorizations may be used as part of a bijective variant of the
Burrows–Wheeler transform The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated c ...
for
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
, and in algorithms for
digital geometry Digital geometry deals with discrete sets (usually discrete point sets) considered to be digitized models or images of objects of the 2D or 3D Euclidean space. Simply put, digitizing is replacing an object by a discrete set of its points. Th ...
. Such factorizations can be written (uniquely) as finite binary trees, with the leaves labelled by the alphabet, with each rightward branch given by the final Lyndon word in the sequence. Such trees are sometimes called standard bracketings and can be taken as factorization of the elements of a
free group In mathematics, the free group ''F'S'' over a given set ''S'' consists of all words that can be built from members of ''S'', considering two words to be different unless their equality follows from the group axioms (e.g. ''st'' = ''suu''−1' ...
or as the basis elements for a
free Lie algebra In mathematics, a free Lie algebra over a field ''K'' is a Lie algebra generated by a set ''X'', without any imposed relations other than the defining relations of alternating ''K''-bilinearity and the Jacobi identity. Definition The definition ...
. These trees are a special case of Hall trees (as Lyndon words are a special case of Hall words), and so likewise, the Hall words provide a standard order, called the
commutator collecting process In group theory, a branch of mathematics, the commutator collecting process is a method for writing an element of a group as a product of generators and their higher commutators arranged in a certain order. The commutator collecting process was in ...
for groups, and basis for Lie algebras. Indeed, they provide an explicit construction for the commutators appearing in the
Poincaré–Birkhoff–Witt theorem In mathematics, more specifically in the theory of Lie algebras, the Poincaré–Birkhoff–Witt theorem (or PBW theorem) is a result giving an explicit description of the universal enveloping algebra of a Lie algebra. It is named after Henri Poi ...
needed for the construction of
universal enveloping algebra In mathematics, the universal enveloping algebra of a Lie algebra is the unital associative algebra whose representations correspond precisely to the representations of that Lie algebra. Universal enveloping algebras are used in the represent ...
s. Every Lyndon word can be understood as a permutation, the suffix standard permutation.


Duval algorithm

developed an algorithm for finding the standard factorization that runs in linear time and constant space. It iterates over a string trying to find as long a Lyndon word as possible. When it finds one, it adds it to the result list and proceeds to search the remaining part of the string. The resulting list of strings is the standard factorization of the given string. More formal description of the algorithm follows. Given a string ''S'' of length ''N'', one should proceed with the following steps: #Let ''m'' be the index of the symbol-candidate to be appended to the already collected symbols. Initially, ''m = 1'' (indices of symbols in a string start from zero). #Let ''k'' be the index of the symbol we would compare others to. Initially, ''k = 0''. #While ''k'' and ''m'' are less than ''N'', compare ''S ' (the ''k''-th symbol of the string ''S'') to ''S '. There are three possible outcomes: ##''S ' is equal to ''S ': append ''S ' to the current collected symbols. Increment ''k'' and ''m''. ##''S ' is less than ''S ': if we append ''S ' to the current collected symbols, we'll get a Lyndon word. But we can't add it to the result list yet because it may be just a part of a larger Lyndon word. Thus, just increment ''m'' and set ''k'' to 0 so the next symbol would be compared to the first one in the string. ##''S ' is greater than ''S ': if we append ''S ' to the current collected symbols, it will be neither a Lyndon word nor a possible beginning of one. Thus, add the first ''m-k'' collected symbols to the result list, remove them from the string, set ''m'' to 1 and ''k'' to 0 so that they point to the second and the first symbol of the string respectively. #When ''m > N'', it is essentially the same as encountering minus infinity, thus, add the first ''m-k'' collected symbols to the result list after removing them from the string, set ''m'' to 1 and ''k'' to 0, and return to the previous step. #Add ''S'' to the result list.


Connection to de Bruijn sequences

If one concatenates together, in lexicographic order, all the Lyndon words that have length dividing a given number ''n'', the result is a
de Bruijn sequence In combinatorial mathematics, a de Bruijn sequence of order ''n'' on a size-''k'' alphabet ''A'' is a cyclic sequence in which every possible length-''n'' string on ''A'' occurs exactly once as a substring (i.e., as a ''contiguous'' subseq ...
, a circular sequence of symbols such that each possible length-''n'' sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is : 0 0001 0011 01 0111 1 This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in
linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
and
logarithmic space In computational complexity theory, L (also known as LSPACE or DLOGSPACE) is the complexity class containing decision problems that can be solved by a deterministic Turing machine using a logarithmic amount of writable memory space., Definition& ...
.According to , the sequence generated in this way was first described (with a different generation method) by , and the connection between it and Lyndon words was observed by .


Additional properties and applications

Lyndon words have an application to the description of
free Lie algebra In mathematics, a free Lie algebra over a field ''K'' is a Lie algebra generated by a set ''X'', without any imposed relations other than the defining relations of alternating ''K''-bilinearity and the Jacobi identity. Definition The definition ...
s, in constructing a basis for the homogeneous part of a given degree; this was Lyndon's original motivation for introducing these words.. Lyndon words may be understood as a special case of
Hall set In mathematics, in the areas of group theory and combinatorics, Hall words provide a unique monoid factorisation of the free monoid. They are also totally ordered, and thus provide a total order on the monoid. This is analogous to the better-known ...
s. For prime ''p'', the number of irreducible
monic polynomial In algebra, a monic polynomial is a single-variable polynomial (that is, a univariate polynomial) in which the leading coefficient (the nonzero coefficient of highest degree) is equal to 1. Therefore, a monic polynomial has the form: :x^n+c_x^+\c ...
s of degree ''d'' over the field \mathbb_p is the same as the number of Lyndon words of length ''d'' in an alphabet of ''p'' symbols; they can be placed into explicit correspondence. A theorem of Radford states that a
shuffle algebra In mathematics, a shuffle algebra is a Hopf algebra with a basis corresponding to words on some set, whose product is given by the shuffle product ''X'' ⧢ ''Y'' of two words ''X'', ''Y'': the sum of all ways of interlacing them. The interlacing i ...
over a field of characteristic 0 can be viewed as a polynomial algebra over the Lyndon words. More precisely, let ''A'' be an alphabet, let ''k'' be a field of characteristic 0 (or, more general, a commutative \mathbb-algebra), and let ''R'' be the free noncommutative ''k''-algebra . The words over ''A'' can then be identified with the "noncommutative monomials" (i.e., products of the ''xa'') in ''R''; namely, we identify a word (''a''1,''a''2,...,''an'') with the monomial ''x''''a''1''x''''a''2...''xan''. Thus, the words over ''A'' form a ''k''-vector space basis of ''R''. Then, a ''shuffle product'' is defined on ''R''; this is a ''k''-bilinear, associative and commutative product, which is denoted by ш, and which on the words can be recursively defined by :1 ш ''v'' = ''v'' for any word ''v''; :''u'' ш 1 = ''u'' for any word ''u''; :''ua'' ш ''vb'' = (''u'' ш ''vb'')''a'' + (''ua'' ш ''v'')''b'' for any ''a'',''b'' ∈ A and any words ''u'' and ''v''. The ''shuffle algebra'' on the alphabet ''A'' is defined to be the additive group ''R'' endowed with ш as multiplication. Radford's theorem now states that the Lyndon words are algebraically independent elements of this shuffle algebra, and generate it; thus, the shuffle algebra is isomorphic to a polynomial ring over ''k'', with the indeterminates corresponding to the Lyndon words.


See also

*
Lexicographically minimal string rotation In computer science, the lexicographically minimal string rotation or lexicographically least circular substring is the problem of finding the rotation of a string possessing the lowest lexicographical order of all such rotations. For example, the ...
*
Morphic word In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid. Every automatic sequence is morphic. Definition Let ''f'' ...
* Sturmian word *
Necklace (combinatorics) In combinatorics, a ''k''-ary necklace of length ''n'' is an equivalence class of ''n''-character strings over an alphabet of size ''k'', taking all rotations as equivalent. It represents a structure with ''n'' circularly connected beads which ...


Notes


References

*. *. *. *. *. *. *. *. * * *
*. * *. *. * * * *. *. *. * * *. {{refend Combinatorics on words