The Vadalog system is a ''Knowledge Graph Management System'' (KGMS) that offers a language for performing complex logic reasoning tasks over knowledge graphs. At the same time, Vadalog delivers a platform to support the entire spectrum of data science tasks: data integration, pre-processing, statistical analysis,

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, algorithmic modeling, probabilistic reasoning and temporal reasoning. Its language is based on an extension of the rule-based language Datalog, ''Warded Datalog^±'', a high-performance language using an aggressive termination control strategy. Vadalog can support the entire spectrum of data science activities and tools. The system can read from and connect to multiple sources, from relational databases, such as

PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...

and

MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...

, to graph databases, such as

Neo4j Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is available in a non-open-source "community edition" ...

, as well as make use of machine learning tools (e.g.,

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...

and scikit-learn), and a web data extraction tool, OXPath. Additional Python libraries and extensions can also be easily integrated into the system. Vadalog is the result of a joint effort between

University of Oxford , mottoeng = The Lord is my light , established = , endowment = £6.1 billion (including colleges) (2019) , budget = £2.145 billion (2019–20) , chancellor ...

, the Knowledge Graph Lab of

Technische Universität Wien TU Wien (TUW; german: Technische Universität Wien; still known in English as the Vienna University of Technology from 1975–2014) is one of the major universities in Vienna, Austria. The university finds high international and domestic recogn ...

and Bank of Italy.

Knowledge graph management systems (KGMS)

A knowledge graph management system (KGMS) has to manage knowledge graphs, which incorporate large amounts of data in the form of ''facts'' and ''relationships.'' In general, it can be seen as the union of three components: # KBMS, that is, a knowledge base management system, # Big Data, which is the need of handling large amounts of data, especially when considering that knowledge graphs have been thought as a solution for integrating multiple data sources, both corporate and public knowledge, which can be integrated into large knowledge graphs, # (Data) Analytics is the need to provide access to existing software packages for machine learning, text mining, data analytics, and data visualization and to combine them together in the same platform. From a more technical standpoint, some additional requirements can be identified for defining a proper KGMS: * definition of a language and a formalism with high expressive power, * cost-effective

data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...

, in all its steps, from data cleaning to web data extraction and big data access from many different sources, * efficient logical,

probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

and ontological reasoning, * low complexity, both in terms of space complexity (to handle big data) and syntax, * interfaces ( APIs) to access many heterogeneous data sources, such as corporate

RDBMS A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...

NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...

or RDF stores, the web, machine-learning and analytics packages. Other requirements may include more typical

DBMS In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...

functions and services, as the ones proposed by Codd.

Vadalog system

Vadalog offers a platform that fulfills all the requirements of a KGMS listed above. It is able to perform ''rule-based'' reasoning tasks on top of knowledge graphs and it also supports the data science workflow, such as data visualization and machine learning.

Reasoning task and recursion

A rule is an expression of the form :− where: * are the atoms of the ''body,'' * is the atom of the ''head''. A ''rule'' allows to infer new knowledge starting from the variables that are in the body: when all the variables in the body of a rule are successfully assigned, the rule is activated and it results in the derivation of the head predicate: given a database and a set of rules , ''a reasoning task'' aims at inferring new knowledge, applying the rules of the set to the database (the ''extensional knowledge).'' The most widespread form of knowledge that has been adopted over the last decades has been in the form of rules, be it in rule-based systems, ontology-based systems or other forms and it can be typically captured in knowledge graphs. The nature of knowledge graphs also makes the presence of

recursion Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...

in these rules a particularly important aspect. Recursion means that the same rules might be called multiple times before obtaining the final answer of the reasoning task and it is particularly powerful as it allows an inference based on previously inferred results. This implies that the system must provide a strategy that guarantees termination. More technically, a program is recursive if the

dependency graph In mathematics, computer science and digital electronics, a dependency graph is a directed graph representing dependencies of several objects towards each other. It is possible to derive an evaluation order or the absence of an evaluation order th ...

built with the application of the rules is cyclical. The simplest form of recursion is that in which the head of a rule also appears in the body (''self-recursive rules'').

The query language

The Vadalog language allows to answer reasoning queries that also include recursion. It is based on Warded Datalog^±, which belongs to the Datalog^± family of languages that extends Datalog with existential quantifiers in rule heads and at the same time restricts its syntax in order to achieve decidability and tractability. Existential rules are also known as tuple-generating dependencies (''tgds''). An ''existential rule'' has the following form: :

\varphi(x) \Rightarrow \exist z \Psi(x,z)

or, alternatively, in Datalog syntax, it can be written as follows: p(X,Z) :- r(X). Variables in Vadalog are like variables in first-order logic and a variable is local to the rule in which it occurs. This means that occurrences of the same variable name in different rules refer to different variables.

Warded Datalog^±

In case of a set of rules

\Sigma

, consisting of the following: r(X,Y) :- p(X). p(Z) :- r(X,Z). the variable Z in the second rule is said to be ''dangerous,'' since the first rule will generate a ''null'' in the second term of the atom ''r'' and this will be injected to the second rule to get the atom ''p,'' leading to a propagation of ''nulls'' when trying to find an answer to the program. If arbitrary propagation is allowed, reasoning is undecidable and the program will be infinite. Warded Datalog^± overcomes this issue asking that for every rule defined in a set

\Sigma

, all the variables in the rule bodies must coexist in at least one atom in the head, called a ''ward.'' The concept of ''wardness'' restricts the way dangerous variable can be used inside a program. Although this is a limit in terms of expressive power, with this requirement and thanks to its architecture and termination algorithms'','' Warded Datalog^± is able to find answers to a program in a finite number of steps. It also exhibits a good trade-off between computational complexity and expressive power, capturing PTIME data complexity while allowing ontological reasoning and the possibility of running programs with recursion.

Vadalog extension

Vadalog replicates in its entirety Warded Datalog^± and extends it with the inclusion in the language of: * monotonic aggregations (''min, max, sum, prod, count'' operators), * stratified negation, * support for different data types (''strings, integer, double, date, boolean, set, lists, marked nulls),'' * rich annotation mechanism to define how to interact with data sources and external libraries. In addition, the system provides a highly engineered architecture to allow efficient computation. This is done in the following two ways. # Adopting an

in-memory processing In computer science, in-memory processing is an emerging technology for processing of data stored in an in-memory database. In-memory processing is one method of addressing the performance and power bottlenecks caused by the movement of data be ...

architecture and a pull stream-based approach, which limits the memory consumption or make it predictable. # Using an aggressive ''termination control strategy'', which detects patterns and redundancy while building of the chase-graph (used to generate the answers) as early as possible and cuts off computation when they occur. This is connected with the concept of

isomorphism In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word is ...

of facts, which reduces the number of steps needed to get an answer: if a fact ''h'' is isomorphic to ''h, the system will only explore the chase graph of fact ''h''. The Vadalog system is therefore able to perform ontological reasoning tasks, as it belongs to the Datalog family. Reasoning with the logical core of Vadalog captures OWL 2 QL and SPARQL (through the use of existential quantifiers), and graph analytics (through support for recursion and aggregation). The declarative nature of the language makes the code easy-to-read and manageable.

Example of ontological reasoning task

Consider the following set of Vadalog rules: ancestor(Y,X) :- person(X). ancestor(Y,Z) :- ancestor(Y,X), parent(X,Z). The first rule states that for each person

X

there exists an ancestor

Y

. The second rule states that, if

X

is a parent of

Z

, then

Y

is an ancestor of

Z

too. Note the existential quantification in the first position of the ancestor predicate in the first rule, which will generate a null ν_i in the chase procedure. Such null is then propagated to the head of the second rule. Consider a database D = with the extensional facts and the query of finding all the entailed ancestor facts as reasoning task. By performing the chase procedure, the fact ''ancestor''(ν₁,Alice) is generated by triggering the first rule on ''person''(Alice). Then, ''ancestor''(ν₁,Bob) is created by activating the second rule on ''ancestor''(ν₁,Alice) and ''parent''(Alice,Bob). Finally, the first rule could be triggered on ''person''(Bob), but the resulting fact ''ancestor''(ν₂,Bob) is isomorphic with ''ancestor''(ν₁,Bob), thus this fact is not generated and the corresponding portion of the chase graph is not explored. In conclusion, the answer to the query is the set of facts .

Additional Features

The integration of Vadalog with data science tools is achieved by means of data bindings primitives and functions''.'' # ''Data binding primitives'': bindings allow to connect to external data sources or systems like a database, a framework or a library and tell the system how to deal with them. This includes connections with relational database systems as

, and graph databases, such as

. It is also possible to integrate machine learning libraries (

, scikit-learn, Tensorflow) and a web data extraction tool, OXPath, which allows to retrieve data directly from the web. This is possible through the so-called ''annotations:'' they are special ''facts'' that augment the sets of existential rules with specific behaviours. The @input annotation tells the system that the facts for an atom of the program are imported from an external data source, e.g., a relational database. The @output annotation specifies that the facts for an atom of the program will be exported to an external target, e.g., the standard output or a relational database. Other annotations like @bind, @mapping, @qbind allow to customize the data sources for the @input annotation or the targets for the @output annotation. # ''Functions'': custom expressions which make use of arithmetic operators, string manipulation or comparison can also be supported. Libraries written in Python are also enabled and they can be integrated with the dedicated annotation @library. The system also provides an integration with the JupyterLab platform, where Vadalog programs can be written and run and the output can be read, exploiting the functionalities of the platform. It gives also the possibility to evaluate the correctness of the program, run it and analyse the derivation process of output facts by means of tools as ''syntax highlighting'', ''code analysis'' (checking whether the code is correct or there are errors) and ''explanations of results'' (how the result has been obtained): all these functionalities are embedded in the notebook and help in writing and analyzing Vadalog code.

Use Cases

The Vadalog system can be employed to address many real-world use cases from distinct research and industry fields. Among the latter, this section presents two relevant and accessible cases belonging to the financial domain.

Company Control

A company ownership graph shows entities as nodes and shares as edges. When an entity has a certain amount of shares on another one (commonly identified in the absolute majority), it is able to exert a decision power on that entity and this configures a company control and, more generally, a group structure. Searching for all ''control'' relationships requires to investigate different scenarios and very complex group structures, namely direct and indirect control. This query be translated into the following rules: * a company

X

directly controls a company

Y

X

directly owns more than 50% of the total equity of

Y

, * a company

X

indirectly controls a company

Y

X

controls a set of intermediary companies that jointly (i.e., summing their shares) own more than 50% of

Y

. These rules can be written in a Vadalog program that will derive all ''control'' edges like the following: control(X,X) :- company(X). control(X,Y) :- control(X,Y), own(Y,Z,W), V = sum(W,), V > 0.5. The first rule states that each company controls itself. The second rule defines control of

X

over

Z

by summing the shares of

Z

held by companies

Y

, over all companies

Y

controlled by

X

Close Link

This scenario consists in determining whether there exists a link between two entities in a company ownerships graph. Determining the existence of such links is relevant, for instance, in banking supervision and credit worthiness evaluation, as a company cannot act as guarantor for loans to another company if the two share such a relationship. Formally, two companies

X

and

Y

are involved in a ''close link'' if: *

x

(resp.,

Y

) owns directly or indirectly 20% or more of the equity of

Y

(resp.,

X

), or * a common third-party

Z

owns directly or indirectly 20% or more of the equity of both

X

and

Y

. These rules can be written in a Vadalog program that will derive all ''close link'' edges like the following: mcl(X,Y,S) :- own(X,Y,S). mcl(X,Z,S1 * S2) :- mc1(X,Y,S1), own(Y,Z,S2). cl1(X,Y) :- mcl(X,Y,S), TS = sum(S), TS > 0.2. cl2(X,Y) :- cl1(Z,X), cl1(Z,Y), X != Y. closelink(X,Y) :- cl1(X,Y). closelink(X,Y) :- cl2(X,Y). The first rule states that two companies

X

and

Y

connected by an ownership edge are possible close links. The second rule states that, if

X

and

Y

are possible close links with a share

S1

and there exists an ownership edge from

Y

to a company

Z

with a share

S2

, then also

X

and

Z

are possible close links with a share

S1

*

S2

. The third rule states that, if the sum of all the partial shares

S

Y

owned directly or indirectly by

X

is greater than or equal to 20% of the equity of

Y

, then they are close links according to the first definition. The fourth rule models the second definition of close links, i.e., the third-party case.

References

{{Reflist Graph databases Declarative programming languages