computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, an e-graph is a

data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...

that stores an

equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric, and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. A simpler example is equ ...

over terms of some language.

Definition and operations

Let

\Sigma

be a set of uninterpreted functions, where

\Sigma_n

is the subset of

\Sigma

consisting of functions of arity

n

. Let

\mathbb

be a countable set of opaque identifiers that may be compared for equality, called e-class IDs. The application of

f\in\Sigma_n

to e-class IDs

i_1, i_2, \ldots, i_n\in\mathbb

is denoted

f(i_1, i_2, \ldots, i_n)

and called an e-node. The e-graph then represents equivalence classes of e-nodes, using the following data structures: * A union-find structure

U

representing equivalence classes of e-class IDs, with the usual operations

\mathrm

\mathrm

and

\mathrm

. An e-class ID

e

is canonical if

\mathrm(U, e) = e

; an e-node

f(i_1,\ldots,i_n)

is canonical if each

i_j

is canonical (

j

1,\ldots,n

). * An association of e-class IDs with sets of e-nodes, called e-classes. This consists of ** a hashcons

H

(i.e. a mapping) from canonical e-nodes to e-class IDs, and ** an e-class map

M

that maps e-class IDs to e-classes, such that

M

maps equivalent IDs to the same set of e-nodes:

Leftrightarrow \mathrm(U,i)=\mathrm(U,j)

Invariants

In addition to the above structure, a valid e-graph conforms to several data structure invariants. Two e-nodes are equivalent if they are in the same e-class. The congruence invariant states that an e-graph must ensure that equivalence is closed under congruence, where two e-nodes

f(i_1,\ldots,i_n),f(j_1,\ldots,j_n)

are congruent when

\mathrm(U, i_k)=\mathrm(U, j_k),k\in \

. The hashcons invariant states that the hashcons maps canonical e-nodes to their e-class ID.

Operations

E-graphs expose wrappers around the

\mathrm

\mathrm

, and

\mathrm

operations from the union-find that preserve the e-graph invariants. The last operation, e-matching, is described below.

Equivalent formulations

An e-graph can also be formulated as a

bipartite graph In the mathematics, mathematical field of graph theory, a bipartite graph (or bigraph) is a Graph (discrete mathematics), graph whose vertex (graph theory), vertices can be divided into two disjoint sets, disjoint and Independent set (graph theo ...

G=(N\uplus\mathrm,E)

where *

\mathrm

is the set of e-class IDs (as above), *

N

is the set of e-nodes, and *

E\subseteq (\mathrm\times N)\cup (N\times \mathrm)

is a set of directed edges. There is a directed edge from each e-class to each of its members, and from each e-node to each of its children.

E-matching

Let

V

be a set of variables and let

\mathrm(\Sigma, V)

be the smallest set that includes the 0-arity function symbols (also called constants), includes the variables, and is closed under application of the function symbols. In other words,

\mathrm(\Sigma, V)

is the smallest set such that

V\subset\mathrm( \Sigma, V)

\Sigma_0\subset\mathrm(\Sigma, V)

, and when

x_1, \ldots, x_n\in \mathrm(\Sigma, V)

and

f\in\Sigma_n

, then

f(x_1,\ldots,x_n)\in\mathrm(\Sigma, V)

. A term containing variables is called a pattern, a term without variables is called ground. An e-graph

E

represents a ground term

t\in\mathrm(\Sigma, \emptyset)

if one of its e-classes represents

t

. An e-class

C

represents

t

if some e-node

f(i_1,\ldots,i_n)\in C

does. An e-node

f(i_1,\ldots,i_n)\in C

represents a term

g(j_1,\ldots,j_n)

f=g

and each e-class

M_k /math> represents the term j_k (k in 1,\ldots,n).

e-matching is an operation that takes a pattern p\in\mathrm(\Sigma, V) and an e-graph E, and yields all pairs (\sigma, C) where \sigma\subset V\times\mathbb is a substitution mapping the variables in p to e-class IDs and C\in\mathbb is an e-class ID such that the term \sigma(p) is represented by C . There are several known algorithms for e-matching, the ''relational e-matching'' algorithm is based on worst-case optimal joins and is worst-case optimal.

Extraction

Given an e-class and a cost function that maps each function symbol in

\Sigma

to a natural number, the extraction problem is to find a ground term with minimal total cost that is represented by the given e-class. This problem is

NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...

. There is also no constant-factor approximation algorithm for this problem, which can be shown by reduction from the

set cover problem The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory. Given a set of elements (henceforth referred to as the universe, specifying all possible elements under considerati ...

. However, for graphs with bounded

treewidth In graph theory, the treewidth of an undirected graph is an integer number which specifies, informally, how far the graph is from being a tree. The smallest treewidth is 1; the graphs with treewidth 1 are exactly the trees and the forests ...

, there is a linear-time,

fixed-parameter tractable In computer science, parameterized complexity is a branch of computational complexity theory that focuses on classifying computational problems according to their inherent difficulty with respect to ''multiple'' parameters of the input or output. ...

algorithm.

Complexity

* An e-graph with ''n'' equalities can be constructed in O(''n'' log ''n'') time.

Equality saturation

Equality saturation is a technique for building optimizing compilers using e-graphs. It operates by applying a set of rewrites using e-matching until the e-graph is saturated, a timeout is reached, an e-graph size limit is reached, a fixed number of iterations is exceeded, or some other halting condition is reached. After rewriting, an optimal term is extracted from the e-graph according to some cost function, usually related to AST size or performance considerations.

Applications

E-graphs are used in

automated theorem proving Automated theorem proving (also known as ATP or automated deduction) is a subfield of automated reasoning and mathematical logic dealing with proving mathematical theorems by computer programs. Automated reasoning over mathematical proof was a majo ...

. They are a crucial part of modern SMT solvers such as Z3 and

CVC4 In computer science and mathematical logic, Cooperating Validity Checker (CVC) is a family of satisfiability modulo theories (SMT) solvers. The latest major versions of CVC are CVC4 and CVC5 (stylized cvc5); earlier versions include CVC, CVC Li ...

, where they are used to decide the

empty theory In mathematical logic, an uninterpreted function or function symbol is one that has no other property than its name and '' n-ary'' form. Function symbols are used, together with constants and variables, to form terms. The theory of uninterpreted ...

by computing the congruence closure of a set of equalities, and e-matching is used to instantiate quantifiers. In DPLL(T)-based solvers that use

conflict-driven clause learning In computer science, conflict-driven clause learning (CDCL) is an algorithm for solving the Boolean satisfiability problem (SAT). Given a Boolean formula, the SAT problem asks for an assignment of variables so that the entire formula evaluates to ...

(also known as non-chronological backtracking), e-graphs are extended to produce proof certificates. E-graphs are also used in the Simplify theorem prover of ESC/Java. Equality saturation is used in specialized

optimizing compiler An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory usage, storage size, and power consumption. Optimization is generally implemented as a sequence of op ...

s, e.g. for

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

and

linear algebra Linear algebra is the branch of mathematics concerning linear equations such as :a_1x_1+\cdots +a_nx_n=b, linear maps such as :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrix (mathemat ...

. Equality saturation has also been used for translation validation applied to the

LLVM LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...

toolchain. E-graphs have been applied to several problems in

program analysis In computer science, program analysis is the process of analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program optimization an ...

, including fuzzing, abstract interpretation, and library learning.

References

* * * * *

External links

The Egg Project

A Colab notebook explaining e-graphs
{{Program analysis Graph data structures