HOME

TheInfoList



OR:

In
theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...
, a pattern language is a
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...
that can be defined as the set of all particular instances of a string of constants and variables. Pattern Languages were introduced by
Dana Angluin Dana Angluin is a professor emeritus of computer science at Yale University. She is known for foundational work in computational learning theory and distributed computing. Education Angluin received her B.A. (1969) and Ph.D. (1976) at University ...
in the context of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.


Definition

Given a finite set Σ of constant symbols and a countable set ''X'' of
variable Variable may refer to: * Variable (computer science), a symbolic name associated with a value and whose associated value may be changed * Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many ...
symbols disjoint from Σ, a pattern is a finite
non-empty In mathematics, the empty set is the unique set having no elements; its size or cardinality (count of elements in a set) is zero. Some axiomatic set theories ensure that the empty set exists by including an axiom of empty set, while in other t ...
string of symbols from Σ∪''X''. The length of a pattern ''p'', denoted by , ''p'', , is just the number of its symbols. The set of all patterns containing exactly ''n'' distinct variables (each of which may occur several times) is denoted by ''P''''n'', the set of all patterns at all by ''P''*. A substitution is a mapping ''f'': ''P''* → ''P''* such thatAngluin's notion of substitution differs from the usual notion of
string substitution In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical ...
.
* ''f'' is a
homomorphism In algebra, a homomorphism is a structure-preserving map between two algebraic structures of the same type (such as two groups, two rings, or two vector spaces). The word ''homomorphism'' comes from the Ancient Greek language: () meaning "same" ...
with respect to
string concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...
(⋅), formally: ∀''p'',''q''∈''P''*. ''f''(''p''⋅''q'') = ''f''(''p'')⋅''f''(''q''); * ''f'' is non-erasing, formally: ∀''p''∈''P''*. ''f''(''p'') ≠ ε, where ε denotes the
empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special cas ...
; and * ''f'' respects constants, formally: ∀''s''∈Σ. ''f''(''s'') = ''s''. If ''p'' = ''f''(''q'') for some patterns ''p'', ''q'' ∈ ''P''* and some substitution ''f'', then ''p'' is said to be less general than ''q'', written ''p''≤''q''; in that case, necessarily , ''p'', ≥ , ''q'', holds. For a pattern ''p'', its language is defined as the set of all less general patterns that are built from constants only, formally: ''L''(''p'') = , where Σ+ denotes the set of all finite non-empty strings of symbols from Σ. For example, using the constants Σ = and the variables ''X'' = , the pattern 0''x''10''xx''1 ∈''P''1 and ''xxy'' ∈''P''2 has length 7 and 3, respectively. An instance of the former pattern is 00''z''100''z''0''z''1 and 01''z''101''z''1''z''1, it is obtained by the substitution that maps ''x'' to 0''z'' and to 1''z'', respectively, and each other symbol to itself. Both 00''z''100''z''0''z''1 and 01''z''101''z''1''z''1 are also instances of ''xxy''. In fact, ''L''(0''x''10''xx''1) is a subset of ''L''(''xxy''). The language of the pattern ''x''0 and ''x''1 is the set of all bit strings which denote an even and odd
binary number A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" ( one). The base-2 numeral system is a positional notatio ...
, respectively. The language of ''xx'' is the set of all strings obtainable by concatenating a bit string with itself, e.g. 00, 11, 0101, 1010, 11101110 ∈ ''L''(''xx'').


Properties

The problem of deciding whether ''s'' ∈ ''L''(''p'') for an arbitrary string ''s'' ∈ Σ+ and pattern ''p'' is
NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryi ...
(see picture), and so is hence the problem of deciding ''p'' ≤ ''q'' for arbitrary patterns ''p'', ''q''. The class of pattern languages is not closed under ... * union: e.g. for Σ = as above, ''L''(01)∪''L''(10) is not a pattern language; * complement: Σ+ \ ''L''(0) is not a pattern language; * intersection: ''L''(''x''0''y'')∩''L''(''x''1''y'') is not a pattern language; *
Kleene plus In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid c ...
: ''L''(0)+ is not a pattern language; * homomorphism: ''f''(''L''(''x'')) = ''L''(0)+ is not a pattern language, assuming ''f''(0) = 0 = ''f''(1); * inverse homomorphism: ''f''−1(111) = is not a pattern language, assuming ''f''(0) = 1 and ''f''(1) = 11. The class of pattern languages is closed under ... * concatenation: ''L''(''p'')⋅''L''(''q'') = ''L''(''p''⋅''q''); * reversal: ''L''(''p'')rev = ''L''(''p''rev). If ''p'', ''q'' ∈ ''P''1 are patterns containing exactly one variable, then ''p'' ≤ ''q'' if and only if ''L''(''p'') ⊆ ''L''(''q''); the same equivalence holds for patterns of equal length. For patterns of different length, the above example ''p'' = 0''x''10''xx''1 and ''q'' = ''xxy'' shows that ''L''(''p'') ⊆ ''L''(''q'') may hold without implying ''p'' ≤ ''q''. However, any two patterns ''p'' and ''q'', of arbitrary lengths, generate the same language if and only if they are equal up to consistent variable renaming. Each pattern ''p'' is a common generalization of all strings in its generated language ''L''(''p''), modulo associativity of (⋅).


Location in the Chomsky hierarchy

In a refined
Chomsky hierarchy In formal language theory, computer science and linguistics, the Chomsky hierarchy (also referred to as the Chomsky–Schützenberger hierarchy) is a containment hierarchy of classes of formal grammars. This hierarchy of grammars was described by ...
, the class of pattern languages is a proper superclass and subclass of the singletoni.e. languages consisting of a single string; they correspond to
straight-line grammar A straight-line grammar (sometimes abbreviated as SLG) is a formal grammar that generates exactly one string.Florian Benz and Timo Kötzing, “An effective heuristic for the smallest grammar problem,” Proceedings of the fifteenth annual conferen ...
s
and the
indexed language Indexed languages are a class of formal languages discovered by Alfred Aho; they are described by indexed grammars and can be recognized by nested stack automata. Indexed languages are a proper subset of context-sensitive languages. They qualify as ...
s, respectively, but incomparable to the language classes in between; due to the latter, the pattern language class is not explicitly shown in the table
below Below may refer to: *Earth *Ground (disambiguation) *Soil *Floor *Bottom (disambiguation) Bottom may refer to: Anatomy and sex * Bottom (BDSM), the partner in a BDSM who takes the passive, receiving, or obedient role, to that of the top or ...
. The class of pattern languages is incomparable with the class of
finite language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s, with the class of
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s, and with the class of
context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
s: * the pattern language ''L''(''xx'') is not context-free (hence neither regular nor
finite Finite is the opposite of infinite. It may refer to: * Finite number (disambiguation) * Finite set, a set whose cardinality (number of elements) is some natural number * Finite verb, a verb form that has a subject, usually being inflected or marked ...
) due to the
pumping lemma In the theory of formal languages, the pumping lemma may refer to: *Pumping lemma for regular languages, the fact that all sufficiently long strings in such a language have a substring that can be repeated arbitrarily many times, usually used to pro ...
; * the finite (hence also regular and context-free) language is not a pattern language. Each singleton language is trivially a pattern language, generated by a pattern without variables. Each pattern language can be produced by an
indexed grammar Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of ''flags'', or ''index symbols''. The language produced by an indexed grammar is called an indexed language. Definition Modern definition ...
: For example, using Σ = and ''X'' = , the pattern ''a'' ''x'' ''b'' ''y'' ''c'' ''x'' ''a'' ''y'' ''b'' is generated by a grammar with nonterminal symbols ''N'' = ∪ ''X'', terminal symbols ''T'' = Σ, index symbols ''F'' = , start symbol ''S''''x'', and the following production rules: An example derivation is:   ⇒     ⇒     ⇒     ⇒     ⇒     ⇒     ⇒     ⇒     ⇒     ⇒     ⇒ ... ⇒     ⇒     ⇒ ... ⇒     ⇒     ⇒ ... ⇒     ⇒   In a similar way, an index grammar can be constructed from any pattern.


Learning patterns

Given a sample set ''S'' of strings, a pattern ''p'' is called descriptive of ''S'' if ''S'' ⊆ ''L''(''p''), but not ''S'' ⊆ ''L''(''q'') ⊂ ''L''(''p'') for any other pattern ''q''. Given any sample set ''S'', a descriptive pattern for ''S'' can be computed by * enumerating all patterns (up to variable renaming) not longer than the shortest string in ''S'', * selecting from them the patterns that generate a superset of ''S'', * selecting from them the patterns of maximal length, and * selecting from them a pattern that is minimal with respect to ≤. Based on this algorithm, the class of pattern languages can be identified in the limit from positive examples.; here: Example 1, p.125


Notes


References

{{formal languages and grammars Formal languages Theoretical computer science Machine learning