Brzozowski derivative
   HOME

TheInfoList



OR:

In
theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...
, in particular in
formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...
, the Brzozowski derivative u^S of a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
S of strings and a string u is the set of all strings obtainable from a string in S by cutting off the
prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
u, as illustrated in the figure. Formally: u^S = \. It was introduced under various different names since the late 1950s. Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
to compute the derivative of a generalized
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
.


Definition

Even though originally studied for regular expressions, the definition applies to arbitrary formal languages. Given any
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...
S over an alphabet \Sigma and any string u \in \Sigma^*, the derivative of S with respect to u is defined as: :u^S = \ The Brzozowski derivative is a special case of left quotient by a singleton set containing only u: \ u^S = \ \;\backslash\; S. Equivalently, for all u,v \in \Sigma^*: :v \in u^S \;\Leftrightarrow\; uv \in S. From the definition, for all u, v \in \Sigma^*: :(uv)^S = v^(u^S) since for all w \in \Sigma^*, we have The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all a \in \Sigma, u \in \Sigma^*: \begin (ua)^S &= a^(u^S) \\ \varepsilon^S &= S \end A language S \subseteq \Sigma^* is called ''nullable'' if and only if it contains the empty string \varepsilon. Each language S is uniquely determined by nullability of its derivatives: :w \in S \ \Leftrightarrow\ \varepsilon \in w^S A language can be viewed as a (potentially infinite) boolean-labelled
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
(see also
tree (set theory) In set theory, a tree is a partially ordered set (''T'', <) such that for each ''t'' ∈ ''T'', the set is ...
and
infinite-tree automaton In computer science and mathematical logic, an infinite-tree automaton is a state machine that deals with infinite tree structures. It can be seen as an extension of top-down finite-tree automata to infinite trees or as an extension of infinite ...
). Each possible string w \in \Sigma^* denotes a node in the tree, with label ''true'' when w \in S and ''false'' otherwise. In this interpretation, the derivative with respect to a symbol a corresponds to the subtree obtained by following the edge a from the root. Decomposing a tree into the root and the subtrees a^S corresponds to the following equality, which holds for every language S \subseteq \Sigma^*: :S = (\ \cap S) \cup \bigcup_ a(a^S).


Derivatives of generalized regular expressions

When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression. Given a finite
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
''A'' of symbols, a generalized regular expression ''R'' denotes a possibly infinite set of finite-length strings over the alphabet ''A'', called the language of ''R'', denoted ''L''(''R''). A generalized regular expression can be one of the following (where ''a'' is a symbol of the alphabet ''A'', and ''R'' and ''S'' are generalized regular expressions): * "∅" denotes the empty set: ''L''(∅) = , * "ε" denotes the singleton set containing the empty string: ''L''(ε) = , * "''a''" denotes the singleton set containing the single-symbol string ''a'': ''L''(''a'') = , * "''R''∨''S''" denotes the union of ''R'' and ''S'': ''L''(''R''∨''S'') = ''L''(''R'') ∪ ''L''(''S''), * "''R''∧''S''" denotes the intersection of ''R'' and ''S'': ''L''(''R''∧''S'') = ''L''(''R'') ∩ ''L''(''S''), * "¬''R''" denotes the complement of ''R'' (with respect to ''A''*, the set of all strings over ''A''): ''L''(¬''R'') = ''A''* \ ''L''(''R''), * "''RS''" denotes the concatenation of ''R'' and ''S'': ''L''(''RS'') = ''L''(''R'') · ''L''(''S''), * "''R''*" denotes the
Kleene closure In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...
of ''R'': ''L''(''R''*) = ''L''(''R'')*. In an ordinary regular expression, neither ∧ nor ¬ is allowed.


Computation

For any given generalized regular expression ''R'' and any string ''u'', the derivative ''u''−1''R'' is again a generalized regular expression (denoting the language ''u''−1''L''(''R'')). It may be computed recursively as follows. : Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string ''a''. The latter can be computed as follows: : Here, is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ''ε'' if ''R'' 's language contains ''ε'', and otherwise evaluates to ∅. This function can be computed by the following rules: :


Properties

A string ''u'' is a member of the string set denoted by a generalized regular expression ''R'' if and only if ε is a member of the string set denoted by the derivative ''u''−1''R''. Considering all the derivatives of a fixed generalized regular expression ''R'' results in only finitely many different languages. If their number is denoted by ''d''''R'', all these languages can be obtained as derivatives of ''R'' with respect to strings of length less than ''d''''R''. Furthermore, there is a complete
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state autom ...
with ''d''''R'' states that recognises the regular language given by ''R'', as stated by the
Myhill–Nerode theorem In the theory of formal languages, the Myhill–Nerode theorem provides a necessary and sufficient condition for a language to be regular. The theorem is named for John Myhill and Anil Nerode, who proved it at the University of Chicago in 1958 . ...
.


Derivatives of context-free languages

Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
s. This insight was used to derive
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
algorithms for
context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
s. Implementation of such algorithms have shown to have cubic
time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
, corresponding to the complexity of the
Earley parser In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars. The algorithm, named after its inven ...
on general context-free grammars.


See also

*
Quotient of a formal language In mathematics and computer science, the right quotient (or simply quotient) of a language L_1 with respect to language L_2 is the language consisting of strings ''w'' such that ''wx'' is in L_1 for some string ''x'' in Formally: L_1 / L_2 = \ In ...


References

{{reflist Formal languages