Datalog
   HOME

TheInfoList



OR:

Datalog is a declarative
logic programming Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic pro ...
language. While it is syntactically a subset of
Prolog Prolog is a logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily ...
, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from
Prolog Prolog is a logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily ...
. It is often used as a
query language Query languages, data query languages or database query languages (DQL) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL). Types Broadly, query language ...
for deductive databases. In recent years, Datalog has found new application in
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
,
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
,
networking Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematic ...
, program analysis,
security" \n\n\nsecurity.txt is a proposed standard for websites' security information that is meant to allow security researchers to easily report security vulnerabilities. The standard prescribes a text file called \"security.txt\" in the well known locat ...
,
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Its origins date back to the beginning of
logic programming Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic pro ...
, but it became prominent as a separate area around 1977 when Hervé Gallaire and Jack Minker organized a workshop on
logic Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the science of deductively valid inferences or of logical truths. It is a formal science investigating how conclusions follow from prem ...
and
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
s.
David Maier David Maier (born 2 June 1953) is the Maseeh Professor of Emerging Technologies in the Department of Computer Science at Portland State University. Born in Eugene, OR, he has also been a computer science faculty member at the State University of ...
is credited with coining the term Datalog.


Features, limitations and extensions

Unlike in Prolog, statements of a Datalog program can be stated in any order. Furthermore, Datalog queries on finite sets are guaranteed to terminate, so Datalog does not have Prolog's cut operator. This makes Datalog a fully
declarative language In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that a ...
. In contrast to Prolog, Datalog # disallows complex terms as arguments of
predicate Predicate or predication may refer to: * Predicate (grammar), in linguistics * Predication (philosophy) * several closely related uses in mathematics and formal logic: **Predicate (mathematical logic) **Propositional function **Finitary relation, o ...
s, e.g., p(1, 2) is admissible but not p(f(1), 2), # imposes certain stratification restrictions on the use of negation and
recursion Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematic ...
, # requires that every variable that appears in the head of a
clause In language, a clause is a constituent that comprises a semantic predicand (expressed or not) and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb wit ...
also appear in a nonarithmetic positive (i.e. not negated)
literal Literal may refer to: * Interpretation of legal concepts: ** Strict constructionism ** The plain meaning rule The plain meaning rule, also known as the literal rule, is one of three rules of statutory construction traditionally applied by ...
in the body of the clause, # requires that every variable appearing in a negative literal in the body of a clause also appear in some positive literal in the body of the clause Query evaluation with Datalog is based on
first-order logic First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantifie ...
, and is thus
sound In physics, sound is a vibration that propagates as an acoustic wave, through a transmission medium such as a gas, liquid or solid. In human physiology and psychology, sound is the ''reception'' of such waves and their ''perception'' by ...
and
complete Complete may refer to: Logic * Completeness (logic) * Completeness of a theory, the property of a theory that every formula in the theory's language or its negation is provable Mathematics * The completeness of the real numbers, which implies t ...
. However, Datalog is not
Turing complete Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical ...
, and is thus used as a
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
that can take advantage of efficient algorithms developed for query resolution. Indeed, various methods have been proposed to efficiently perform queries, e.g., the Magic Sets algorithm, tabled logic programming or SLG resolution. Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2. Moreover, Datalog engines are behind specialised
database system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
s such as Intellidimension's database for the semantic web. Several extensions have been made to Datalog, e.g., to support aggregate functions, to allow
object-oriented programming Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. The data is in the form of fields (often known as attributes or ''properties''), and the code is in the form of ...
, or to allow
disjunction In logic, disjunction is a logical connective typically notated as \lor and read aloud as "or". For instance, the English language sentence "it is raining or it is snowing" can be represented in logic using the disjunctive formula R \lor ...
s as heads of
clause In language, a clause is a constituent that comprises a semantic predicand (expressed or not) and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb wit ...
s. These extensions have significant impacts on the definition of Datalog's semantics and on the implementation of a corresponding Datalog interpreter.


Fragments

Datalog programs may or may not use
negation In logic, negation, also called the logical complement, is an operation that takes a proposition P to another proposition "not P", written \neg P, \mathord P or \overline. It is interpreted intuitively as being true when P is false, and false ...
in rule bodies: Datalog programs with negation are often required to use it as stratified negation to ensure that the semantics are well-defined. Datalog programs may or may not use
inequalities Inequality may refer to: Economics * Attention inequality, unequal distribution of attention across users, groups of people, issues in etc. in attention economy * Economic inequality, difference in economic well-being between population groups * ...
between variables in rule bodies. Some syntactic fragments of Datalog have been defined, such as the following (from most restricted to least restricted): * linear Datalog, where rule bodies must consist of a single atom * guarded Datalog, where for every rule, all the variables that occur in the rule bodies must occur together in at least one atom, called a ''guard atom'' * frontier-guarded Datalog, where for every rule, all the variables that are shared between the rule body and the rule head (called the ''frontier variables'') must all occur together in a guard atom Another restriction concerns the use of recursion: nonrecursive Datalog is defined by disallowing recursion in the definition of Datalog programs. This fragment of Datalog is as expressive as
unions of conjunctive queries In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relationa ...
, but writing queries as nonrecursive Datalog can be more concise.


Expressiveness

The boundedness problem for Datalog asks, given a Datalog program, whether it is bounded, i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program. Solving the boundedness problem on arbitrary Datalog programs is undecidable, but it can be made decidable by restricting to some fragments of Datalog.


Example

These two lines define two ''facts'', i.e. things that always hold: parent(xerces, brooke). parent(brooke, damocles). This is what they mean: ''xerces is a parent of brooke'' and ''brooke is a parent of damocles''. The names are written in lowercase because strings beginning with an uppercase letter stand for variables. These two lines define ''rules'', which define how new facts can be inferred from known facts. ancestor(X, Y) :- parent(X, Y). ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). meaning: * ''X is an ancestor of Y if X is a parent of Y.'' * ''X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.'' This line is a query: ?- ancestor(xerces, X). It asks the following: ''Who are all the X that xerces is an ancestor of?'' It would return ''brooke'' and ''damocles'' when posed against a Datalog system containing the facts and rules described above. More about rules: a rule has a '':-'' symbol in the middle: the part to the left of this symbol is the ''head'' of the rule, the part to the right is the ''body''. A rule reads like this: '' is known to be true if it is known that is true''. Uppercase letters in rules stand for variables: in the example, we don't know who ''X'' or ''Y'' are, but some ''X'' is the ancestor of some ''Y'' if that ''X'' is the parent of that ''Y''. The ordering of the clauses is irrelevant in Datalog, in contrast to Prolog, which depends on the ordering of clauses for computing the result of the query call. Datalog distinguishes between ''extensional predicate symbols'' (defined by facts) and ''intensional predicate symbols'' (defined by rules). In the example above ancestor is an intensional predicate symbol, and parent is extensional. Predicates may also be defined by facts and rules and therefore neither be purely extensional nor intensional, but any Datalog program can be rewritten into an equivalent program without such predicate symbols with duplicate roles.


Syntax

A Datalog program consists of a list of facts and rules ( Horn clauses). If ''constant'' and ''variable'' are two
countable In mathematics, a set is countable if either it is finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function from it into the natural numbers ...
sets of constants and variables respectively and ''relation'' is a countable set of predicate symbols, then the following grammar expresses the structure of a Datalog program: ::= , , ɛ ::= "(" ")." ::= ":-" "." ::= "(" ")" ::= , "," ::= , ::= , "," ::= , "," Atoms are also referred to as ''literals'' in the literature.


Semantics

There are three widely-used approaches to the semantics of Datalog programs: model-theoretic, fixed-point, and proof-theoretic, developed and shown to be equivalent for Horn clause logic programs more generally by Maarten van Emden and
Robert Kowalski Robert Anthony Kowalski (born 15 May 1941) is an American-British logician and computer scientist, whose research is concerned with developing both human-oriented models of computing and computational models of human thinking. He has spent mo ...


Model theory

A fact or rule is called ''ground'' if all of its subterms are constants. A ground rule ''R''1 is a ''ground instance'' of another rule ''R''2 if ''R''1 is the result of a
substitution Substitution may refer to: Arts and media *Chord substitution, in music, swapping one chord for a related one within a chord progression *Substitution (poetry), a variation in poetic scansion * "Substitution" (song), a 2009 song by Silversun Pic ...
of constants for all the variables in ''R''2. The ''
Herbrand base In first-order logic, a Herbrand structure ''S'' is a structure over a vocabulary σ that is defined solely by the syntactical properties of σ. The idea is to take the symbols of terms as their values, e.g. the denotation of a constant symbol ' ...
'' of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. An ''interpretation'' (also known as a ''database instance'') is a subset of the Herbrand base. A ground atom is true in an interpretation if it is an element of . A rule is ''true in an interpretation '' if for each ground instance of that rule, if all the clauses in the body are true in , then the head of the rule is also true in . A ''model'' of a Datalog program ''P'' is an interpretation of ''P'' which contains all the ground facts of ''P'', and makes all of the rules of ''P'' true in . Model-theoretic semantics state that the meaning of a Datalog program is its minimal model (equivalently, the intersection of all its models).


Fixed-point semantics

Let be the set of interpretations of a Datalog program ''P'' (i.e. , where ''H'' is the Herbrand base of ''P'' and P is the
powerset In mathematics, the power set (or powerset) of a set is the set of all subsets of , including the empty set and itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is post ...
operator). Define a map from to as follows: For each ground instance of each rule in ''P'', if every clause in the body is in the input interpretation, then add the head of the ground instance to the output interpretation. Then the fixed point of this map is the minimal model of the program. The fixpoint semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached.


Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.


Bottom-up evaluation strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until the either some goal or query is established, or until the complete minimal model of the program is produced.


Naïve evaluation

''Naïve evaluation'' mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.


Systems implementing Datalog

Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:


Free software/open source


Non-free software

* Datomic is a distributed database designed to enable scalable, flexible and intelligent applications, running on new cloud architectures. It uses Datalog as the query language. * FoundationDB provides a free-of-charge database binding for pyDatalog, with a tutorial on its use. * Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight. * LogicBlox, a commercial implementation of Datalog used for web-based retail planning and insurance applications. * Profium Sense is a native RDF compliant
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the '' graph'' (or ''edge'' or ''relationship''). The graph rel ...
written in Java. It provides Datalog evaluation support of user defined rules. *
.QL .QL (pronounced "dot-cue-el") is an object-oriented query language used to retrieve data from relational database management systems. It is reminiscent of the standard query language SQL and the object-oriented programming language Java. .QL is ...
, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities. * SecPAL a security policy language developed by
Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technolog ...
. * Stardog is a
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the '' graph'' (or ''edge'' or ''relationship''). The graph rel ...
, implemented in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
. It provides support for RDF and all OWL 2 profiles providing extensive reasoning capabilities, including datalog evaluation. * StrixDB: a commercial RDF graph store,
SPARQL SPARQL (pronounced " sparkle" , a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description ...
compliant with Lua API and Datalog inference capabilities. Could be used as httpd (
Apache HTTP Server The Apache HTTP Server ( ) is a free and open-source cross-platform web server software, released under the terms of Apache License 2.0. Apache is developed and maintained by an open community of developers under the auspices of the Apache S ...
) module or standalone (although beta versions are under the Perl Artistic License 2.0).


See also

*
Answer set programming Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems. It is based on the stable model (answer set) semantics of logic programming. In ASP, search problems are reduced ...
* Conjunctive query *
DLV The DLV (DataLog with Disjunction, where the logical disjunction symbol V is used) system is a disjunctive logic programming system, implementing the stable model semantics under the answer set programming paradigm. It extends the Datalog l ...
*
Flix Flix () is a town in the '' comarca'' of Ribera d'Ebre, Catalonia, Spain. Situated on a promontory by the Ebro river, the town occupied an important strategic position. Situated on the Madrid– Barcelona railway line, it expanded in the ea ...
* SWRL *
Tuple-generating dependency In relational database theory, a tuple-generating dependency (TGD) is a certain kind of constraint on a relational database. It is a subclass of the class of embedded dependencies (EDs). An algorithm known as the chase takes as input an instanc ...
(TGD), a language for integrity constraints on
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relati ...
s with a similar syntax to Datalog


References


Bibliography

*


Further reading

* {{Authority control Query languages Logic programming languages Declarative programming languages