The sort-merge join (also known as merge join) is a
join algorithm and is used in the implementation of a
relational database management system
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...
.
The basic problem of a join algorithm is to find, for each distinct value of the join attribute, the set of
tuple
In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...
s in each relation which display that value. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time.
In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order. This can be achieved via an explicit sort operation (often an
external sort), or by taking advantage of a pre-existing ordering in one or both of the join relations.
The latter condition, called interesting order, can occur because an input to the join might be produced by an index scan of a tree-based index, another merge join, or some other plan operator that happens to produce output sorted on an appropriate key. Interesting orders need not be serendipitous: the optimizer may seek out this possibility and choose a plan that is suboptimal for a specific preceding operation if it yields an interesting order that one or more downstream nodes can exploit.
Complexity
Let
and
be relations where
.
fits in
pages memory and
fits in
pages memory. In the worst case, a sort-merge join will run in
I/O operations. In the case that
and
are not ordered the worst case time cost will contain additional terms of sorting time:
, which equals
(as
linearithmic
In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations p ...
terms outweigh the linear terms, see
Big O notation – Orders of common functions).
Pseudocode
For simplicity, the algorithm is described in the case of an
inner join
A join clause in the Structured Query Language ( SQL) combines columns from one or more tables into a new table. The operation corresponds to a join operation in relational algebra. Informally, a join stitches two tables and puts on the same ro ...
of two relations ''left'' and ''right''. Generalization to other join types is straightforward. The output of the algorithm will contain only rows contained in the ''left'' and ''right'' relation and duplicates form a
Cartesian product
In mathematics, specifically set theory, the Cartesian product of two sets and , denoted , is the set of all ordered pairs where is an element of and is an element of . In terms of set-builder notation, that is
A\times B = \.
A table c ...
.
function Sort-Merge Join(left: Relation, right: Relation, comparator: Comparator)
Since the comparison logic is not the central aspect of this algorithm, it is hidden behind a generic comparator and can also consist of several comparison criteria (e.g. multiple columns). The compare function should return if a row is ''less(-1)'', ''equal(0)'' or ''bigger(1)'' than another row:
function compare(leftRow: RelationRow, rightRow: RelationRow): number
Note that a relation in terms of this pseudocode supports some basic operations:
interface Relation
Simple C# implementation
Note that this implementation assumes the join attributes are unique, i.e., there is no need to output multiple tuples for a given value of the key.
public class MergeJoin
public class Relation
See also
*
Hash join
*
Nested loop join
References
External links
C# Implementations of Various Join Algorithms
{{DEFAULTSORT:Sort-Merge Join
Join algorithms
Articles with example pseudocode
Articles with example C Sharp code