In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, in the area of
formal language theory
In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet".
The alphabet of a formal language consists of symbol ...
, frequent use is made of a variety of
string functions; however, the notation used is different from that used for
computer programming
Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.
Strings and languages
A string is a finite sequence of characters.
The
empty string
In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero.
Formal theory
Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...
is denoted by
.
The concatenation of two string
and
is denoted by
, or shorter by
.
Concatenating with the empty string makes no difference:
.
Concatenation of strings is
associative
In mathematics, the associative property is a property of some binary operations that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement for express ...
:
.
For example,
.
A
language
Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
is a finite or infinite set of strings.
Besides the usual set operations like union, intersection etc., concatenation can be applied to languages:
if both
and
are languages, their concatenation
is defined as the set of concatenations of any string from
and any string from
, formally
.
Again, the concatenation dot
is often omitted for brevity.
The language
consisting of just the empty string is to be distinguished from the empty language
.
Concatenating any language with the former doesn't make any change:
,
while concatenating with the latter always yields the empty language:
.
Concatenation of languages is associative:
.
For example, abbreviating
, the set of all three-digit decimal numbers is obtained as
. The set of all decimal numbers of arbitrary length is an example for an infinite language.
Alphabet of a string
The alphabet of a string is the set of all of the characters that occur in a particular string. If ''s'' is a string, its
alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
is denoted by
:
The alphabet of a language
is the set of all characters that occur in any string of
, formally:
.
For example, the set
is the alphabet of the string
,
and the
above is the alphabet of the
above language
as well as of the language of all decimal numbers.
String substitution
Let ''L'' be a
language
Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping ''f'' that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character ''a'' ∈ Σ, one has ''f''(''a'')=''L''
''a'' where ''L''
''a'' ⊆ Δ
* is some language whose alphabet is Δ. This mapping may be extended to strings as
:''f''(ε)=ε
for the
empty string
In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero.
Formal theory
Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...
ε, and
:''f''(''sa'')=''f''(''s'')''f''(''a'')
for string ''s'' ∈ ''L'' and character ''a'' ∈ Σ. String substitutions may be extended to entire languages as
:
Regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.
Similarly,
context-free languages are closed under string substitution.
[Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages.]
A simple example is the conversion ''f''
uc(.) to uppercase, which may be defined e.g. as follows:
For the extension of ''f''
uc to strings, we have e.g.
* ''f''
uc(‹Straße›) = ⋅ ⋅ ⋅ ⋅ ⋅ = ,
* ''f''
uc(‹u2›) = ⋅ = , and
* ''f''
uc(‹Go!›) = ⋅ ⋅ = .
For the extension of ''f''
uc to languages, we have e.g.
* ''f''
uc() = ∪ ∪ = .
String homomorphism
A string homomorphism (often referred to simply as a
homomorphism
In algebra, a homomorphism is a morphism, structure-preserving map (mathematics), map between two algebraic structures of the same type (such as two group (mathematics), groups, two ring (mathematics), rings, or two vector spaces). The word ''homo ...
in
formal language theory
In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet".
The alphabet of a formal language consists of symbol ...
) is a string substitution such that each character is replaced by a single string. That is,
, where
is a string, for each character
.
[Strictly formally, a homomorphism yields a language consisting of just one string, i.e. .]
String homomorphisms are
monoid morphism
In abstract algebra, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being .
Monoids are semigroups with identi ...
s on the
free monoid
In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences (or strings) of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero ...
, preserving the empty string and the
binary operation
In mathematics, a binary operation or dyadic operation is a rule for combining two elements (called operands) to produce another element. More formally, a binary operation is an operation of arity two.
More specifically, a binary operation ...
of
string concatenation. Given a language
, the set
is called the homomorphic image of
. The inverse homomorphic image of a string
is defined as
while the inverse homomorphic image of a language
is defined as
In general,
, while one does have
and
for any language
.
The class of regular languages is closed under homomorphisms and inverse homomorphisms.
Similarly, the context-free languages are closed under homomorphisms
[This follows from the above-mentioned closure under arbitrary substitutions.] and inverse homomorphisms.
A string homomorphism is said to be ε-free (or e-free) if
for all ''a'' in the alphabet
. Simple single-letter
substitution cipher
In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, t ...
s are examples of (ε-free) string homomorphisms.
An example string homomorphism ''g''
uc can also be obtained by defining similar to the
above substitution: ''g''
uc(‹a›) = ‹A›, ..., ''g''
uc(‹0›) = ε, but letting ''g''
uc be undefined on punctuation chars.
Examples for inverse homomorphic images are
* ''g''
uc−1() = , since ''g''
uc(‹sss›) = ''g''
uc(‹sß›) = ''g''
uc(‹ßs›) = ‹SSS›, and
*''g''
uc−1() = , since ''g''
uc(‹a›) = ‹A›, while ‹bb› cannot be reached by ''g''
uc.
For the latter language, ''g''
uc(''g''
uc−1()) = ''g''
uc() = ≠ .
The homomorphism ''g''
uc is not ε-free, since it maps e.g. ‹0› to ε.
A very simple string homomorphism example that maps each character to just a character is the conversion of an
EBCDIC
Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding si ...
-encoded string to
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
.
String projection
If ''s'' is a string, and
is an alphabet, the string projection of ''s'' is the string that results by removing all characters that are not in
. It is written as
. It is formally defined by removal of characters from the right hand side:
:
Here
denotes the
empty string
In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero.
Formal theory
Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...
. The projection of a string is essentially the same as a
projection in relational algebra.
String projection may be promoted to the projection of a language. Given a
formal language
In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet".
The alphabet of a formal language consists of symbols that concatenate into strings (also c ...
''L'', its projection is given by
:
Right and left quotient
The right quotient of a character ''a'' from a string ''s'' is the truncation of the character ''a'' in the string ''s'', from the right hand side. It is denoted as
. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus:
:
The quotient of the empty string may be taken:
:
Similarly, given a subset
of a monoid
, one may define the quotient subset as
:
Left quotients may be defined similarly, with operations taking place on the left of a string.
Hopcroft and Ullman (1979) define the quotient ''L''
1/''L''
2 of the languages ''L''
1 and ''L''
2 over the same alphabet as .
This is not a generalization of the above definition, since, for a string ''s'' and distinct characters ''a'', ''b'', Hopcroft's and Ullman's definition implies yielding , rather than .
The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language ''L''
1 and an arbitrary language ''L''
2 is known as
Brzozowski derivative; if ''L''
2 is represented by a
regular expression
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
, so can be the left quotient.
Syntactic relation
The right quotient of a subset
of a monoid
defines an
equivalence relation
In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric, and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. A simpler example is equ ...
, called the right
syntactic relation of ''S''. It is given by
:
The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if
:
is finite. In the case that ''M'' is the monoid of words over some alphabet, ''S'' is then a
regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
, that is, a language that can be recognized by a
finite-state automaton. This is discussed in greater detail in the article on
syntactic monoid
In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the minimal monoid that recognizes the language L. By the Myhill–Nerode theorem, the syntactic monoid is unique up to unique isomorphism.
Syntactic quot ...
s.
Right cancellation
The right cancellation of a character ''a'' from a string ''s'' is the removal of the first occurrence of the character ''a'' in the string ''s'', starting from the right hand side. It is denoted as
and is recursively defined as
:
The empty string is always cancellable:
:
Clearly, right cancellation and projection
commute:
:
Prefixes
The prefixes of a string is the set of all
prefixes to a string, with respect to a given language:
:
where
.
The prefix closure of a language is
:
Example:
A language is called prefix closed if
.
The prefix closure operator is
idempotent
Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...
:
:
The prefix relation is a
binary relation
In mathematics, a binary relation associates some elements of one Set (mathematics), set called the ''domain'' with some elements of another set called the ''codomain''. Precisely, a binary relation over sets X and Y is a set of ordered pairs ...
such that
if and only if
. This relation is a particular example of a
prefix order.
See also
*
Comparison of programming languages (string functions)
*
Levi's lemma
*
String (computer science)
In computer programming, a string is traditionally a sequence of character (computing), characters, either as a literal (computer programming), literal constant or as some kind of Variable (computer science), variable. The latter may allow its el ...
— definition and implementation of more basic operations on strings
Notes
References
* ''(See chapter 3.)''
{{Strings
Formal languages
Relational algebra
Operations