computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, in the area of

formal language theory In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet". The alphabet of a formal language consists of symbol ...

, frequent use is made of a variety of string functions; however, the notation used is different from that used for

computer programming Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...

, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

Strings and languages

A string is a finite sequence of characters. The

empty string In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero. Formal theory Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...

is denoted by

\varepsilon

. The concatenation of two string

s

and

t

is denoted by

s \cdot t

, or shorter by

s t

. Concatenating with the empty string makes no difference:

s \cdot \varepsilon = s = \varepsilon \cdot s

. Concatenation of strings is

associative In mathematics, the associative property is a property of some binary operations that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement for express ...

s \cdot (t \cdot u) = (s \cdot t) \cdot u

. For example,

(\langle b \rangle \cdot \langle l \rangle) \cdot (\varepsilon \cdot \langle ah \rangle) = \langle bl \rangle \cdot \langle ah \rangle = \langle blah \rangle

. A

language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...

is a finite or infinite set of strings. Besides the usual set operations like union, intersection etc., concatenation can be applied to languages: if both

S

and

T

are languages, their concatenation

S \cdot T

is defined as the set of concatenations of any string from

S

and any string from

T

, formally

S \cdot T = \

. Again, the concatenation dot

\cdot

is often omitted for brevity. The language

\

consisting of just the empty string is to be distinguished from the empty language

\

. Concatenating any language with the former doesn't make any change:

S \cdot \ = S = \ \cdot S

, while concatenating with the latter always yields the empty language:

S \cdot \ = \ = \ \cdot S

. Concatenation of languages is associative:

S \cdot (T \cdot U) = (S \cdot T) \cdot U

. For example, abbreviating

D = \

, the set of all three-digit decimal numbers is obtained as

D \cdot D \cdot D

. The set of all decimal numbers of arbitrary length is an example for an infinite language.

Alphabet of a string

The alphabet of a string is the set of all of the characters that occur in a particular string. If ''s'' is a string, its

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

is denoted by :

\operatorname(s)

The alphabet of a language

S

is the set of all characters that occur in any string of

S

, formally:

\operatorname(S) = \bigcup_ \operatorname(s)

. For example, the set

\

is the alphabet of the string

\langle cacao \rangle

, and the above

D

is the alphabet of the above language

D \cdot D \cdot D

as well as of the language of all decimal numbers.

String substitution

Let ''L'' be a

, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping ''f'' that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character ''a'' ∈ Σ, one has ''f''(''a'')=''L''_''a'' where ''L''_''a'' ⊆ Δ ^* is some language whose alphabet is Δ. This mapping may be extended to strings as :''f''(ε)=ε for the

ε, and :''f''(''sa'')=''f''(''s'')''f''(''a'') for string ''s'' ∈ ''L'' and character ''a'' ∈ Σ. String substitutions may be extended to entire languages as :

f(L)=\bigcup_ f(s)

Regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

s are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language. Similarly, context-free languages are closed under string substitution.Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages. A simple example is the conversion ''f''_uc(.) to uppercase, which may be defined e.g. as follows: For the extension of ''f''_uc to strings, we have e.g. * ''f''_uc(‹Straße›) = ⋅ ⋅ ⋅ ⋅ ⋅ = , * ''f''_uc(‹u2›) = ⋅ = , and * ''f''_uc(‹Go!›) = ⋅ ⋅ = . For the extension of ''f''_uc to languages, we have e.g. * ''f''_uc() = ∪ ∪ = .

String homomorphism

A string homomorphism (often referred to simply as a

homomorphism In algebra, a homomorphism is a morphism, structure-preserving map (mathematics), map between two algebraic structures of the same type (such as two group (mathematics), groups, two ring (mathematics), rings, or two vector spaces). The word ''homo ...

) is a string substitution such that each character is replaced by a single string. That is,

f(a)=s

, where

s

is a string, for each character

a

.Strictly formally, a homomorphism yields a language consisting of just one string, i.e.

f(a) = \

. String homomorphisms are

monoid morphism In abstract algebra, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being . Monoids are semigroups with identi ...

s on the

free monoid In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences (or strings) of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero ...

, preserving the empty string and the

binary operation In mathematics, a binary operation or dyadic operation is a rule for combining two elements (called operands) to produce another element. More formally, a binary operation is an operation of arity two. More specifically, a binary operation ...

of string concatenation. Given a language

L

, the set

f(L)

is called the homomorphic image of

L

. The inverse homomorphic image of a string

s

is defined as

f^(s) = \

while the inverse homomorphic image of a language

L

is defined as

f^(L) = \

In general,

f(f^(L)) \neq L

, while one does have

f(f^(L)) \subseteq L

and

L \subseteq f^(f(L))

for any language

L

. The class of regular languages is closed under homomorphisms and inverse homomorphisms. Similarly, the context-free languages are closed under homomorphismsThis follows from the above-mentioned closure under arbitrary substitutions. and inverse homomorphisms. A string homomorphism is said to be ε-free (or e-free) if

f(a) \neq \varepsilon

for all ''a'' in the alphabet

\Sigma

. Simple single-letter

substitution cipher In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, t ...

s are examples of (ε-free) string homomorphisms. An example string homomorphism ''g''_uc can also be obtained by defining similar to the above substitution: ''g''_uc(‹a›) = ‹A›, ..., ''g''_uc(‹0›) = ε, but letting ''g''_uc be undefined on punctuation chars. Examples for inverse homomorphic images are * ''g''_uc⁻¹() = , since ''g''_uc(‹sss›) = ''g''_uc(‹sß›) = ''g''_uc(‹ßs›) = ‹SSS›, and *''g''_uc⁻¹() = , since ''g''_uc(‹a›) = ‹A›, while ‹bb› cannot be reached by ''g''_uc. For the latter language, ''g''_uc(''g''_uc⁻¹()) = ''g''_uc() = ≠ . The homomorphism ''g''_uc is not ε-free, since it maps e.g. ‹0› to ε. A very simple string homomorphism example that maps each character to just a character is the conversion of an

EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding si ...

-encoded string to

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

String projection

If ''s'' is a string, and

\Sigma

is an alphabet, the string projection of ''s'' is the string that results by removing all characters that are not in

\Sigma

. It is written as

\pi_\Sigma(s)\,

. It is formally defined by removal of characters from the right hand side: :

\pi_\Sigma(s) = \begin 
\varepsilon & \mbox s=\varepsilon \mbox \\
\pi_\Sigma(t) & \mbox s=ta \mbox a \notin \Sigma \\ 
\pi_\Sigma(t)a & \mbox s=ta \mbox a \in \Sigma   
\end

Here

\varepsilon

denotes the

. The projection of a string is essentially the same as a projection in relational algebra. String projection may be promoted to the projection of a language. Given a

formal language In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet". The alphabet of a formal language consists of symbols that concatenate into strings (also c ...

''L'', its projection is given by :

\pi_\Sigma (L)=\

Right and left quotient

The right quotient of a character ''a'' from a string ''s'' is the truncation of the character ''a'' in the string ''s'', from the right hand side. It is denoted as

s/a

. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus: :

(sa)/ b = \begin 
s & \mbox a=b \\
\varepsilon & \mbox a \ne b
\end

The quotient of the empty string may be taken: :

\varepsilon / a = \varepsilon

Similarly, given a subset

S\subset M

of a monoid

M

, one may define the quotient subset as :

S/a=\

Left quotients may be defined similarly, with operations taking place on the left of a string. Hopcroft and Ullman (1979) define the quotient ''L''₁/''L''₂ of the languages ''L''₁ and ''L''₂ over the same alphabet as . This is not a generalization of the above definition, since, for a string ''s'' and distinct characters ''a'', ''b'', Hopcroft's and Ullman's definition implies yielding , rather than . The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language ''L''₁ and an arbitrary language ''L''₂ is known as Brzozowski derivative; if ''L''₂ is represented by a

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

, so can be the left quotient.

Syntactic relation

The right quotient of a subset

S\subset M

of a monoid

M

defines an

equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric, and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. A simpler example is equ ...

, called the right syntactic relation of ''S''. It is given by :

\sim_S \;\,=\, \

The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if :

\

is finite. In the case that ''M'' is the monoid of words over some alphabet, ''S'' is then a

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

, that is, a language that can be recognized by a finite-state automaton. This is discussed in greater detail in the article on

syntactic monoid In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the minimal monoid that recognizes the language L. By the Myhill–Nerode theorem, the syntactic monoid is unique up to unique isomorphism. Syntactic quot ...

Right cancellation

The right cancellation of a character ''a'' from a string ''s'' is the removal of the first occurrence of the character ''a'' in the string ''s'', starting from the right hand side. It is denoted as

s\div a

and is recursively defined as :

(sa)\div b = \begin 
s & \mbox a=b \\
(s\div b)a & \mbox a \ne b
\end

The empty string is always cancellable: :

\varepsilon \div a = \varepsilon

Clearly, right cancellation and projection commute: :

\pi_\Sigma(s)\div a = \pi_\Sigma(s \div a )

Prefixes

The prefixes of a string is the set of all prefixes to a string, with respect to a given language: :

\operatorname_L(s) = \

where

s\in L

. The prefix closure of a language is :

\operatorname (L) = \bigcup_ \operatorname_L(s) = \left\

Example:

L=\left\\mbox \operatorname(L)=\left\

A language is called prefix closed if

\operatorname (L) = L

. The prefix closure operator is

idempotent Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...

: :

\operatorname (\operatorname (L)) =\operatorname (L)

The prefix relation is a

binary relation In mathematics, a binary relation associates some elements of one Set (mathematics), set called the ''domain'' with some elements of another set called the ''codomain''. Precisely, a binary relation over sets X and Y is a set of ordered pairs ...

\sqsubseteq

such that

s\sqsubseteq t

if and only if

s \in \operatorname_L(t)

. This relation is a particular example of a prefix order.

Notes

References

* ''(See chapter 3.)'' {{Strings Formal languages Relational algebra Operations