combinatorics Combinatorics is an area of mathematics primarily concerned with counting, both as a means and an end in obtaining results, and certain properties of finite structures. It is closely related to many other areas of mathematics and has many appl ...

, a branch of

mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...

, the autocorrelation of a

word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...

is the set of periods of this word. More precisely, it is a sequence of values which indicate how much the end of a word looks likes the beginning of a word. This value can be used to compute, for example, the average value of the first occurrence of this word in a random string.

Definition

In this article, ''A'' is an

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...

, and

w=w_1\dots w_n

on ''A'' of length ''n''. The autocorrelation of

w

can be defined as the

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

w

with itself. However, we redefine this notion below.

Autocorrelation vector

The autocorrelation vector of

w

c=(c_0,\dots,c_)

, with

c_i

being 1 if the

prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...

of length

n-i

equals the

suffix In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry ...

of length

n-i

, and with

c_i

being 0 otherwise. That is

c_i

indicates whether

w_\dots w_n=w_1\dots w_

. For example, the autocorrelation vector of

aaa

(1,1,1)

since, clearly, for

i

being 0, 1 or 2, the prefix of length

n-i

is equal to the suffix of length

n-i

. The autocorrelation vector of

abb

(1,0,0)

since no strict prefix is equal to a strict suffix. Finally, the autocorrelation vector of

aabbaa

is 100011, as shown in the following table: Note that

c_0

is always equal to 1, since the prefix and the suffix of length

n

are both equal to the word

w

. Similarly,

c_

is 1 if and only if the first and the last letters are the same.

Autocorrelation polynomial

The autocorrelation polynomial of

w

is defined as

c(z)=c_0z^0+\dots+c_z^

. It is a polynomial of degree at most

n-1

. For example, the autocorrelation polynomial of

aaa

1+z+z^2

and the autocorrelation polynomial of

abb

1

. Finally, the autocorrelation polynomial of

aabbaa

1+z^4+z^5

Property

We now indicate some properties which can be computed using the autocorrelation polynomial.

First occurrence of a word in a random string

Suppose that you choose an infinite sequence

s

of letters of

A

, randomly, each letter with probability

\frac

, where

, A,

is the number of letters of

A

. Let us call

E

the expectation of the first occurrence of ?

m

s

? . Then

E

equals

, A, ^nc\left(\frac\right)

. That is, each subword

v

w

which is both a prefix and a suffix causes the average value of the first occurrence of

w

to occur

, A, ^

letters later. Here

, v,

is the length of v. For example, over the binary alphabet

A=\

, the first occurrence of

aa

is at position

2^2(1+\frac 12)=6

while the average first occurrence of

ab

is at position

2^2(1)=4

. Intuitively, the fact that the first occurrence of

aa

is later than the first occurrence of

ab

can be explained in two ways: *We can consider, for each position

p

, what are the requirement for

w

's first occurrence to be at

p

. **The first occurrence of

ab

can be at position 1 in only one way in both case. If

s

starts with

w

. This has probability

\frac14

for both considered values of

w

. **The first occurrence of

ab

is at position 2 if the prefix of

s

of length 3 is

aab

or is

bab

. However, the first occurrence of

aa

is at position 2 if and only if the prefix of

s

of length 3 is

baa

. (Note that the first occurrence of

aa

aaa

is at position 1.). **In general, the number of prefixes of length

n+1

such that the first occurrence of

aa

is at position

n

is smaller for

aa

than for

ba

. This explain why, on average, the first

aa

arrive later than the first

ab

. *We can also consider the fact that the average number of occurrences of

w

in a random string of length

l

, A, ^

. This number is independent of the autocorrelation polynomial. An occurrence of

w

may overlap another occurrence in different ways. More precisely, each 1 in its autocorrelation vector correspond to a way for occurrence to overlap. Since many occurrences of

w

can be packed together, using overlapping, but the average number of occurrences does not change, it follows that the distance between two non-overlapping occurrences is greater when the autocorrelation vector contains many 1's.

Ordinary generating functions

Autocorrelation polynomials allows to give simple equations for the

ordinary generating functions In mathematics, a generating function is a way of encoding an infinite sequence of numbers () by treating them as the coefficients of a formal power series. This series is called the generating function of the sequence. Unlike an ordinary ser ...

(OGF) of many natural questions. *The OGF of the languages of words not containing

w

\frac

. *The OGF of the languages of words containing

w

\frac

. *The OGF of the languages of words containing a single occurrence of

w

, at the end of the word is

\frac

References

* * *{{cite journal, last1=Odlyzko, first1=A. M., last2=Guibas, first2=L. J., title=String overlaps, pattern matching, and nontransitive games, journal=Journal of Combinatorial Theory, date=1981, volume=Series A 30, issue=2, pages=183–208, doi=10.1016/0097-3165(81)90005-4, doi-access=free Formal languages Combinatorics on words