Datalog
   HOME

TheInfoList



OR:

Datalog is a declarative
logic programming Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic prog ...
language. While it is syntactically a subset of
Prolog Prolog is a logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily ...
, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from
Prolog Prolog is a logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily ...
. It is often used as a
query language Query languages, data query languages or database query languages (DQL) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL). Types Broadly, query language ...
for
deductive database A deductive database is a database system that can make deductions (i.e. conclude additional facts) based on rules and facts stored in the (deductive) database. Datalog is the language typically used to specify facts, rules and queries in deductiv ...
s. In recent years, Datalog has found new application in
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
,
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
, networking,
program analysis In computer science, program analysis is the process of automatically analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program op ...
,
security Security is protection from, or resilience against, potential harm (or other unwanted coercive change) caused by others, by restraining the freedom of others to act. Beneficiaries (technically referents) of security may be of persons and social ...
,
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Its origins date back to the beginning of
logic programming Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic prog ...
, but it became prominent as a separate area around 1977 when Hervé Gallaire and
Jack Minker Jack Minker (4 July 1927 – 9 April 2021) was a leading authority in artificial intelligence, deductive databases, logic programming and non-monotonic reasoning. He was also an internationally recognized leader in the field of human rights of c ...
organized a workshop on
logic Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the science of deductively valid inferences or of logical truths. It is a formal science investigating how conclusions follow from premises ...
and
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
s.
David Maier David Maier (born 2 June 1953) is the Maseeh Professor of Emerging Technologies in the Department of Computer Science at Portland State University. Born in Eugene, OR, he has also been a computer science faculty member at the State University of ...
is credited with coining the term Datalog.


Features, limitations and extensions

Unlike in Prolog, statements of a Datalog program can be stated in any order. Furthermore, Datalog queries on
finite sets In mathematics, particularly set theory, a finite set is a set that has a finite number of elements. Informally, a finite set is a set which one could in principle count and finish counting. For example, :\ is a finite set with five elements. Th ...
are guaranteed to terminate, so Datalog does not have Prolog's
cut Cut may refer to: Common uses * The act of cutting, the separation of an object into two through acutely-directed force ** A type of wound ** Cut (archaeology), a hole dug in the past ** Cut (clothing), the style or shape of a garment ** Cut (ea ...
operator. This makes Datalog a fully
declarative language In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that a ...
. In contrast to Prolog, Datalog # disallows complex terms as arguments of
predicate Predicate or predication may refer to: * Predicate (grammar), in linguistics * Predication (philosophy) * several closely related uses in mathematics and formal logic: **Predicate (mathematical logic) **Propositional function **Finitary relation, o ...
s, e.g., p(1, 2) is admissible but not p(f(1), 2), # imposes certain
stratification Stratification may refer to: Mathematics * Stratification (mathematics), any consistent assignment of numbers to predicate symbols * Data stratification in statistics Earth sciences * Stable and unstable stratification * Stratification, or st ...
restrictions on the use of negation and
recursion Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
, # requires that every variable that appears in the head of a
clause In language, a clause is a constituent that comprises a semantic predicand (expressed or not) and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb with ...
also appear in a nonarithmetic positive (i.e. not negated) literal in the body of the clause, # requires that every variable appearing in a negative literal in the body of a clause also appear in some positive literal in the body of the clause Query evaluation with Datalog is based on
first-order logic First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantifie ...
, and is thus
sound In physics, sound is a vibration that propagates as an acoustic wave, through a transmission medium such as a gas, liquid or solid. In human physiology and psychology, sound is the ''reception'' of such waves and their ''perception'' by the ...
and
complete Complete may refer to: Logic * Completeness (logic) * Completeness of a theory, the property of a theory that every formula in the theory's language or its negation is provable Mathematics * The completeness of the real numbers, which implies t ...
. However, Datalog is not
Turing complete Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical com ...
, and is thus used as a
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging f ...
that can take advantage of efficient algorithms developed for query resolution. Indeed, various methods have been proposed to efficiently perform queries, e.g., the Magic Sets algorithm, tabled logic programming or SLG resolution. Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2. Moreover, Datalog engines are behind specialised
database system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
s such as Intellidimension's database for the semantic web. Several extensions have been made to Datalog, e.g., to support
aggregate function In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value. Common aggregate functions include: * Average (i.e., arithmetic mean) * C ...
s, to allow
object-oriented programming Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. The data is in the form of fields (often known as attributes or ''properties''), and the code is in the form of pr ...
, or to allow
disjunction In logic, disjunction is a logical connective typically notated as \lor and read aloud as "or". For instance, the English language sentence "it is raining or it is snowing" can be represented in logic using the disjunctive formula R \lor S ...
s as heads of
clause In language, a clause is a constituent that comprises a semantic predicand (expressed or not) and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb with ...
s. These extensions have significant impacts on the definition of Datalog's semantics and on the implementation of a corresponding Datalog interpreter.


Fragments

Datalog programs may or may not use
negation In logic, negation, also called the logical complement, is an operation that takes a proposition P to another proposition "not P", written \neg P, \mathord P or \overline. It is interpreted intuitively as being true when P is false, and false ...
in rule bodies: Datalog programs with negation are often required to use it as stratified negation to ensure that the semantics are well-defined. Datalog programs may or may not use
inequalities Inequality may refer to: Economics * Attention inequality, unequal distribution of attention across users, groups of people, issues in etc. in attention economy * Economic inequality, difference in economic well-being between population groups * ...
between variables in rule bodies. Some syntactic fragments of Datalog have been defined, such as the following (from most restricted to least restricted): * linear Datalog, where rule bodies must consist of a single atom * guarded Datalog, where for every rule, all the variables that occur in the rule bodies must occur together in at least one atom, called a ''guard atom'' * frontier-guarded Datalog, where for every rule, all the variables that are shared between the rule body and the rule head (called the ''frontier variables'') must all occur together in a guard atom Another restriction concerns the use of recursion: nonrecursive Datalog is defined by disallowing recursion in the definition of Datalog programs. This fragment of Datalog is as expressive as unions of conjunctive queries, but writing queries as nonrecursive Datalog can be more concise.


Expressiveness

The boundedness problem for Datalog asks, given a Datalog program, whether it is bounded, i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program. Solving the boundedness problem on arbitrary Datalog programs is undecidable, but it can be made decidable by restricting to some fragments of Datalog.


Example

These two lines define two ''facts'', i.e. things that always hold: parent(xerces, brooke). parent(brooke, damocles). This is what they mean: ''xerces is a parent of brooke'' and ''brooke is a parent of damocles''. The names are written in lowercase because strings beginning with an uppercase letter stand for variables. These two lines define ''rules'', which define how new facts can be inferred from known facts. ancestor(X, Y) :- parent(X, Y). ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). meaning: * ''X is an ancestor of Y if X is a parent of Y.'' * ''X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.'' This line is a query: ?- ancestor(xerces, X). It asks the following: ''Who are all the X that xerces is an ancestor of?'' It would return ''brooke'' and ''damocles'' when posed against a Datalog system containing the facts and rules described above. More about rules: a rule has a '':-'' symbol in the middle: the part to the left of this symbol is the ''head'' of the rule, the part to the right is the ''body''. A rule reads like this: '' is known to be true if it is known that is true''. Uppercase letters in rules stand for variables: in the example, we don't know who ''X'' or ''Y'' are, but some ''X'' is the ancestor of some ''Y'' if that ''X'' is the parent of that ''Y''. The ordering of the clauses is irrelevant in Datalog, in contrast to Prolog, which depends on the ordering of clauses for computing the result of the query call. Datalog distinguishes between ''extensional predicate symbols'' (defined by facts) and ''intensional predicate symbols'' (defined by rules). In the example above ancestor is an intensional predicate symbol, and parent is extensional. Predicates may also be defined by facts and rules and therefore neither be purely extensional nor intensional, but any Datalog program can be rewritten into an equivalent program without such predicate symbols with duplicate roles.


Syntax

A Datalog program consists of a list of facts and rules (
Horn clause In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form which gives it useful properties for use in logic programming, formal specification, and model theory. Horn clauses are named for the log ...
s). If ''constant'' and ''variable'' are two
countable In mathematics, a set is countable if either it is finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function from it into the natural numbers; ...
sets of constants and variables respectively and ''relation'' is a countable set of predicate symbols, then the following grammar expresses the structure of a Datalog program: ::= , , ɛ ::= "(" ")." ::= ":-" "." ::= "(" ")" ::= , "," ::= , ::= , "," ::= , "," Atoms are also referred to as ''literals'' in the literature.


Semantics

There are three widely-used approaches to the semantics of Datalog programs:
model-theoretic In mathematical logic, model theory is the study of the relationship between formal theories (a collection of sentences in a formal language expressing statements about a mathematical structure), and their models (those structures in which the st ...
, fixed-point, and
proof-theoretic Proof theory is a major branchAccording to Wang (1981), pp. 3–4, proof theory is one of four domains mathematical logic, together with model theory, axiomatic set theory, and recursion theory. Barwise (1978) consists of four corresponding parts, ...
, developed and shown to be equivalent for Horn clause logic programs more generally by Maarten van Emden and
Robert Kowalski Robert Anthony Kowalski (born 15 May 1941) is an American-British logician and computer scientist, whose research is concerned with developing both human-oriented models of computing and computational models of human thinking. He has spent mo ...


Model theory

A fact or rule is called ''ground'' if all of its subterms are constants. A ground rule ''R''1 is a ''ground instance'' of another rule ''R''2 if ''R''1 is the result of a substitution of constants for all the variables in ''R''2. The ''
Herbrand base In first-order logic, a Herbrand structure ''S'' is a structure over a vocabulary σ that is defined solely by the syntactical properties of σ. The idea is to take the symbols of terms as their values, e.g. the denotation of a constant symbol ' ...
'' of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. An ''interpretation'' (also known as a ''database instance'') is a subset of the Herbrand base. A ground atom is true in an interpretation if it is an element of . A rule is ''true in an interpretation '' if for each ground instance of that rule, if all the clauses in the body are true in , then the head of the rule is also true in . A ''model'' of a Datalog program ''P'' is an interpretation of ''P'' which contains all the ground facts of ''P'', and makes all of the rules of ''P'' true in .
Model-theoretic In mathematical logic, model theory is the study of the relationship between formal theories (a collection of sentences in a formal language expressing statements about a mathematical structure), and their models (those structures in which the st ...
semantics state that the meaning of a Datalog program is its minimal model (equivalently, the intersection of all its models).


Fixed-point semantics

Let be the set of interpretations of a Datalog program ''P'' (i.e. , where ''H'' is the Herbrand base of ''P'' and P is the
powerset In mathematics, the power set (or powerset) of a set is the set of all subsets of , including the empty set and itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is postu ...
operator). Define a map from to as follows: For each ground instance of each rule in ''P'', if every clause in the body is in the input interpretation, then add the head of the ground instance to the output interpretation. Then the fixed point of this map is the minimal model of the program. The
fixpoint A fixed point (sometimes shortened to fixpoint, also known as an invariant point) is a value that does not change under a given transformation. Specifically, in mathematics, a fixed point of a function is an element that is mapped to itself by the ...
semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached.


Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.


Bottom-up evaluation strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until the either some goal or query is established, or until the complete minimal model of the program is produced.


Naïve evaluation

''Naïve evaluation'' mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.


Systems implementing Datalog

Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:


Free software/open source


Non-free software

*
Datomic Datomic is a distributed database and implementation of Datalog. It has ACID transactions, joins, and a logical query language, Datalog. A distinguishing feature of Datomic is that time is a basic feature of data entities. Architecture It has bee ...
is a distributed database designed to enable scalable, flexible and intelligent applications, running on new cloud architectures. It uses Datalog as the query language. *
FoundationDB FoundationDB is a free and open-source multi-model distributed NoSQL database developed by Apple Inc. with a shared-nothing architecture. The product was designed around a "core" database, with additional features supplied in "layers." The cor ...
provides a free-of-charge database binding for pyDatalog, with a tutorial on its use. * Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight. *
LogicBlox The LogicBlox system is a Commercial software, commercial, Declarative programming, declarative, Incremental computing, incremental logic programming language and deductive database inspired by Datalog. The LogiQL programming language extends Da ...
, a commercial implementation of Datalog used for web-based retail planning and insurance applications. * Profium Sense is a native RDF compliant
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the ''graph'' (or ''edge'' or ''relationship''). The graph relat ...
written in Java. It provides Datalog evaluation support of user defined rules. *
.QL .QL (pronounced "dot-cue-el") is an object-oriented query language used to retrieve data from relational database management systems. It is reminiscent of the standard query language SQL and the object-oriented programming language Java. .QL is ...
, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities. *
SecPAL SecPAL is a declarative, logic-based, security policy language that has been developed to support the complex access control requirements of large scale distributed computing environments. Common access control requirements Here is a partial-list ...
a security policy language developed by
Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
. * Stardog is a
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the ''graph'' (or ''edge'' or ''relationship''). The graph relat ...
, implemented in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
. It provides support for RDF and all
OWL 2 The Web Ontology Language (OWL) is a family of Knowledge representation and reasoning, knowledge representation languages for authoring Ontology (information science), ontologies. Ontologies are a formal way to describe taxonomies and classificat ...
profiles providing extensive reasoning capabilities, including datalog evaluation. * StrixDB: a commercial RDF graph store,
SPARQL SPARQL (pronounced "sparkle" , a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description F ...
compliant with
Lua Lua or LUA may refer to: Science and technology * Lua (programming language) * Latvia University of Agriculture * Last universal ancestor, in evolution Ethnicity and language * Lua people, of Laos * Lawa people, of Thailand sometimes referred t ...
API and Datalog inference capabilities. Could be used as httpd (
Apache HTTP Server The Apache HTTP Server ( ) is a free and open-source cross-platform web server software, released under the terms of Apache License 2.0. Apache is developed and maintained by an open community of developers under the auspices of the Apache So ...
) module or standalone (although beta versions are under the Perl Artistic License 2.0).


See also

* Answer set programming *
Conjunctive query In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational ...
*
DLV The DLV (DataLog with Disjunction, where the logical disjunction symbol V is used) system is a disjunctive logic programming system, implementing the stable model semantics under the answer set programming paradigm. It extends the Datalog ...
*
Flix Flix () is a town in the ''comarca'' of Ribera d'Ebre, Catalonia, Spain. Situated on a promontory by the Ebro river, the town occupied an important strategic position. Situated on the Madrid–Barcelona railway line, it expanded in the earl ...
* SWRL *
Tuple-generating dependency In relational database theory, a tuple-generating dependency (TGD) is a certain kind of constraint on a relational database. It is a subclass of the class of embedded dependencies (EDs). An algorithm known as the chase takes as input an instance ...
(TGD), a language for
integrity constraint Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The ter ...
s on
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s with a similar syntax to Datalog


References


Bibliography

*


Further reading

* {{Authority control Query languages Logic programming languages Declarative programming languages