HOME

TheInfoList



OR:

R-trees are
tree data structure In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be conn ...
s used for spatial access methods, i.e., for indexing multi-dimensional information such as
geographical coordinates The geographic coordinate system (GCS) is a spherical or ellipsoidal coordinate system for measuring and communicating positions directly on the Earth as latitude and longitude. It is the simplest, oldest and most widely used of the vario ...
,
rectangle In Euclidean plane geometry, a rectangle is a quadrilateral with four right angles. It can also be defined as: an equiangular quadrilateral, since equiangular means that all of its angles are equal (360°/4 = 90°); or a parallelogram containi ...
s or
polygon In geometry, a polygon () is a plane figure that is described by a finite number of straight line segments connected to form a closed ''polygonal chain'' (or ''polygonal circuit''). The bounded plane region, the bounding circuit, or the two toge ...
s. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" (to display them in a
navigation system A navigation system is a computing system that aids in navigation. Navigation systems may be entirely on board the vehicle or vessel that the system is controlling (for example, on the ship's bridge) or located elsewhere, making use of radio or othe ...
) or "find the nearest gas station" (although not taking roads into account). The R-tree can also accelerate
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
for various distance metrics, including
great-circle distance The great-circle distance, orthodromic distance, or spherical distance is the distance along a great circle. It is the shortest distance between two points on the surface of a sphere, measured along the surface of the sphere (as opposed to a ...
.


R-tree idea

The key idea of the data structure is to group nearby objects and represent them with their
minimum bounding rectangle In computational geometry, the minimum bounding rectangle (MBR), also known as bounding box (BBOX) or envelope, is an expression of the maximum extents of a two-dimensional object (e.g. point, line, polygon) or set of objects within its coordin ...
in the next higher level of the tree; the "R" in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation includes an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set. Similar to the
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
, the R-tree is also a balanced search tree (so all leaf nodes are at the same depth), organizes the data in pages, and is designed for storage on disk (as used in
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
s). Each page can contain a maximum number of entries, often denoted as M. It also guarantees a minimum fill (except for the root node), however best performance has been experienced with a minimum fill of 30%–40% of the maximum number of entries (B-trees guarantee 50% page fill, and
B*-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
s even 66%). The reason for this is the more complex balancing required for spatial data as opposed to linear data stored in B-trees. As with most trees, the searching algorithms (e.g., intersection, containment,
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
) are rather simple. The key idea is to use the bounding boxes to decide whether or not to search inside a subtree. In this way, most of the nodes in the tree are never read during a search. Like B-trees, R-trees are suitable for large data sets and
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
s, where nodes can be paged to memory when needed, and the whole tree cannot be kept in main memory. Even if data can be fit in memory (or cached), the R-trees in most practical applications will usually provide performance advantages over naive check of all objects when the number of objects is more than few hundred or so. However, for in-memory applications, there are similar alternatives that can provide slightly better performance or be simpler to implement in practice. To maintain in-memory computing for R-tree in a computer cluster where computing nodes are connected by a network, researchers have used RDMA ( Remote Direct Memory Access) to implement data-intensive applications under R-tree in a distributed environment. This approach is scalable for increasingly large applications and achieves high throughput and low latency performance for R-tree. The key difficulty of R-tree is to build an efficient tree that on one hand is balanced (so the leaf nodes are at the same height) on the other hand the rectangles do not cover too much empty space and do not overlap too much (so that during search, fewer subtrees need to be processed). For example, the original idea for inserting elements to obtain an efficient tree is to always insert into the subtree that requires least enlargement of its bounding box. Once that page is full, the data is split into two sets that should cover the minimal area each. Most of the research and improvements for R-trees aims at improving the way the tree is built and can be grouped into two objectives: building an efficient tree from scratch (known as bulk-loading) and performing changes on an existing tree (insertion and deletion). R-trees do not guarantee good worst-case performance, but generally perform well with real-world data. While more of theoretical interest, the (bulk-loaded)
Priority R-tree The Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004. The prioritized R-tree is essentially a hybrid between a k-d ...
variant of the R-tree is worst-case optimal, but due to the increased complexity, has not received much attention in practical applications so far. When data is organized in an R-tree, the neighbors within a given distance r and the
k nearest neighbors In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regre ...
(for any Lp-Norm) of all points can efficiently be computed using a spatial join. This is beneficial for many algorithms based on such queries, for example the
Local Outlier Factor In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data poin ...
. DeLi-Clu, Density-Link-Clustering is a
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
algorithm that uses the R-tree structure for a similar kind of spatial join to efficiently compute an
OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...
clustering.


Variants

*
Priority R-tree The Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004. The prioritized R-tree is essentially a hybrid between a k-d ...
*
R*-tree In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a ...
*
R+ tree R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found sig ...
* RR* tree *
Hilbert R-tree Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects. Th ...
*
X-tree In computer science tree data structures, an X-tree (for ''eXtended node tree'') is an index tree structure based on the R-tree used for storing data in many dimensions. It appeared in 1996, and differs from R-trees (1984), R+-trees (1987) an ...


Algorithm


Data layout

Data in R-trees is organized in pages that can have a variable number of entries (up to some pre-defined maximum, and usually above a minimum fill). Each entry within a non-
leaf node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...
stores two pieces of data: a way of identifying a
child node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be c ...
, and the
bounding box In geometry, the minimum or smallest bounding or enclosing box for a point set in dimensions is the box with the smallest measure (area, volume, or hypervolume in higher dimensions) within which all the points lie. When other kinds of measure ...
of all entries within this child node. Leaf nodes store the data required for each child, often a point or bounding box representing the child and an external identifier for the child. For point data, the leaf entries can be just the points themselves. For polygon data (that often requires the storage of large polygons) the common setup is to store only the MBR (minimum bounding rectangle) of the polygon along with a unique identifier in the tree.


Search

In
range searching In computer science, the range searching problem consists of processing a set ''S'' of objects, in order to determine which objects from ''S'' intersect with a query object, called the ''range''. For example, if ''S'' is a set of points correspond ...
, the input is a search rectangle (Query box). Searching is quite similar to searching in a B+ tree. The search starts from the root node of the tree. Every internal node contains a set of rectangles and pointers to the corresponding child node and every leaf node contains the rectangles of spatial objects (the pointer to some spatial object can be there). For every rectangle in a node, it has to be decided if it overlaps the search rectangle or not. If yes, the corresponding child node has to be searched also. Searching is done like this in a recursive manner until all overlapping nodes have been traversed. When a leaf node is reached, the contained bounding boxes (rectangles) are tested against the search rectangle and their objects (if there are any) are put into the result set if they lie within the search rectangle. For priority search such as
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
, the query consists of a point or rectangle. The root node is inserted into the priority queue. Until the queue is empty or the desired number of results have been returned the search continues by processing the nearest entry in the queue. Tree nodes are expanded and their children reinserted. Leaf entries are returned when encountered in the queue. This approach can be used with various distance metrics, including
great-circle distance The great-circle distance, orthodromic distance, or spherical distance is the distance along a great circle. It is the shortest distance between two points on the surface of a sphere, measured along the surface of the sphere (as opposed to a ...
for geographic data.


Insertion

To insert an object, the tree is traversed recursively from the root node. At each step, all rectangles in the current directory node are examined, and a candidate is chosen using a heuristic such as choosing the rectangle which requires least enlargement. The search then descends into this page, until reaching a leaf node. If the leaf node is full, it must be split before the insertion is made. Again, since an exhaustive search is too expensive, a heuristic is employed to split the node into two. Adding the newly created node to the previous level, this level can again overflow, and these overflows can propagate up to the root node; when this node also overflows, a new root node is created and the tree has increased in height.


Choosing the insertion subtree

The algorithm needs to decide in which subtree to insert. When a data object is fully contained in a single rectangle, the choice is clear. When there are multiple options or rectangles in need of enlargement, the choice can have a significant impact on the performance of the tree. The objects are inserted into the subtree that needs the least enlargement. A Mixture heuristic is employed throughout. What happens next is it tries to minimize the overlap (in case of ties, prefer least enlargement and then least area); at the higher levels, it behaves similar to the R-tree, but on ties again preferring the subtree with smaller area. The decreased overlap of rectangles in the
R*-tree In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a ...
is one of the key benefits over the traditional R-tree.


Splitting an overflowing node

Finally, the
X-tree In computer science tree data structures, an X-tree (for ''eXtended node tree'') is an index tree structure based on the R-tree used for storing data in many dimensions. It appeared in 1996, and differs from R-trees (1984), R+-trees (1987) an ...
can be seen as a R*-tree variant that can also decide to not split a node, but construct a so-called super-node containing all the extra entries, when it doesn't find a good split (in particular for high-dimensional data).


Deletion

Deleting an entry from a page may require updating the bounding rectangles of parent pages. However, when a page is underfull, it will not be balanced with its neighbors. Instead, the page will be dissolved and all its children (which may be subtrees, not only leaf objects) will be reinserted. If during this process the root node has a single element, the tree height can decrease.


Bulk-loading

* Nearest-X: Objects are sorted by their first coordinate ("X") and then split into pages of the desired size. * Packed
Hilbert R-tree Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects. Th ...
: variation of Nearest-X, but sorting using the Hilbert value of the center of a rectangle instead of using the X coordinate. There is no guarantee the pages will not overlap. * Sort-Tile-Recursive (STR): Another variation of Nearest-X, that estimates the total number of leaves required as l=\lceil \text / \text\rceil, the required split factor in each dimension to achieve this as s=\lceil l^\rceil, then repeatedly splits each dimensions successively into s equal sized partitions using 1-dimensional sorting. The resulting pages, if they occupy more than one page, are again bulk-loaded using the same algorithm. For point data, the leaf nodes will not overlap, and "tile" the data space into approximately equal sized pages. * Overlap Minimizing Top-down (OMT): Improvement over STR using a top-down approach which minimizes overlaps between slices and improves query performance. *
Priority R-tree The Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree. It was first proposed by Arge, De Berg, Haverkort and Yi, K. in an article from 2004. The prioritized R-tree is essentially a hybrid between a k-d ...


See also

*
Segment tree In computer science, a segment tree, also known as a statistic tree, is a tree data structure used for storing information about intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle ...
*
Interval tree In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to f ...
– A degenerate R-tree for one dimension (usually time). * K-d tree *
Bounding volume hierarchy A bounding volume hierarchy (BVH) is a tree structure on a set of geometric objects. All geometric objects, that form the leaf nodes of the tree, are wrapped in bounding volumes. These nodes are then grouped as small sets and enclosed within la ...
*
Spatial index A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most sp ...
*
GiST In computing, GiST or Generalized Search Tree, is a data structure and API that can be used to build a variety of disk-based search trees. GiST is a generalization of the B+ tree, providing a concurrent and recoverable height-balanced search tr ...


References


External links

* {{DEFAULTSORT:R-Tree