In computer science , a BTREE is a selfbalancing tree data
structure that keeps data sorted and allows searches, sequential
access, insertions, and deletions in logarithmic time . The
Btree
CONTENTS * 1 Overview * 1.1 Variants * 1.2 Etymology * 2
Btree
* 2.1 Time to search a sorted file
* 2.2 An index speeds the search
* 2.3 Insertions and deletions
* 2.4 Advantages of
Btree
* 3 Technical description * 3.1 Terminology * 3.2 Definition * 4 Best case and worst case heights * 5 Algorithms * 5.1 Search * 5.2 Insertion * 5.3 Deletion * 5.3.1 Deletion from a leaf node * 5.3.2 Deletion from an internal node * 5.3.3 Rebalancing after deletion * 5.4 Sequential access * 5.5 Initial construction * 6 In filesystems * 7 Variations * 7.1 Access concurrency * 8 See also * 9 Notes * 10 References * 10.1 Original papers * 11 External links OVERVIEW A
Btree
The number of branches (or child nodes) from a node will be one more
than the number of keys stored in the node. In a 23 Btree, the
internal nodes will store either one key (with two child nodes) or two
keys (with three child nodes). A
Btree
A
Btree
Btrees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. This usually occurs when the node data are in secondary storage such as disk drives . By maximizing the number of keys within each internal node , the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size of a full disk block or an analogous size in secondary storage. While 23 Btrees are easier to explain, practical Btrees using secondary storage need a large number of child nodes to improve performance. VARIANTS The term BTREE may refer to a specific design or it may refer to a
general class of designs. In the narrow sense, a
Btree
* In the
B+ tree
ETYMOLOGY
Rudolf Bayer and Ed McCreight invented the
Btree
Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the tracktotrack seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds. For simplicity, assume reading from disk takes about 10 milliseconds. Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds. The time won't be that bad because individual records are grouped together in a disk BLOCK. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read. To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up. AN INDEX SPEEDS THE SEARCH A significant improvement can be made with an index . In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index ). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads. The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an auxaux index that would need only 100 entries and would fit in one disk block. Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the auxaux index identifies the relevant block in auxindex. Reading and searching that auxindex block identifies the relevant block in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record. The auxiliary indices have turned the search problem from a binary search requiring roughly log2 N disk reads to one requiring only logb N disk reads where b is the blocking factor (the number of entries per block: b = 100 entries per block in our example; log100 1,000,000 = 3 reads). In practice, if the main database is being frequently searched, the auxaux index and much of the aux index may reside in a disk cache , so they would not incur a disk read. INSERTIONS AND DELETIONS If the database does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated. Deleting records from a database is relatively easy. The index can stay the same, and the record can just be marked as deleted. The database remains in sorted order. If there are a large number of deletions, then searching and storage become less efficient. Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record requires shifting all of the records down one. Such an operation is just too expensive to be practical. One solution is to leave some spaces. Instead of densely packing all the records in a block, the block can have some free space to allow for subsequent insertions. Those spaces would be marked as if they were "deleted" records. Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is that enough space is available nearby, such that a lot of blocks do not need to be reorganized. Alternatively, some outofsequence disk blocks may be used. ADVANTAGES OF BTREE USAGE FOR DATABASES The
Btree
* keeps keys in sorted order for sequential traversing * uses a hierarchical index to minimize the number of disk reads * uses partially full blocks to speed insertions and deletions * keeps the index balanced with a recursive algorithm In addition, a
Btree
TECHNICAL DESCRIPTION TERMINOLOGY The literature on Btrees is not uniform in its terminology (Folk & Zoellick 1992 , p. 362). Bayer & McCreight (1972) , Comer (1979) , and others define the ORDER
of
Btree
The term LEAF is also inconsistent. Bayer & McCreight (1972) considered the leaf level to be the lowest level of keys, but Knuth considered the leaf level to be one level below the lowest keys (Folk in other designs, the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a Btree. There are also unfortunate choices like using the variable k to represent the number of children when k could be confused with the number of keys. For simplicity, most authors assume there are a fixed number of keys
that fit in a node. The basic assumption is the key size is fixed and
the node size is fixed. In practice, variable length keys may be
employed (Folk therefore each internal node is at least half full. The
relationship between U and L implies that two halffull nodes can be
joined to make a legal node, and one full node can be split into two
legal nodes (if there’s room to push one element up into the
parent). These properties make it possible to delete and insert new
values into a
Btree
A
Btree
Some balanced trees store values only at leaf nodes, and use different kinds of nodes for leaf nodes and internal nodes. Btrees keep values in every node in the tree, and may use the same structure for all nodes. However, since leaf nodes never have children, the Btrees benefit from improved performance if they use a specialized structure. BEST CASE AND WORST CASE HEIGHTS Let h be the height of the classic Btree. Let n > 0 be the number of entries in the tree. Let m be the maximum number of children a node can have. Each node can have at most m−1 keys. It can be shown (by induction for example) that a
Btree
Let d be the minimum number of children an internal (nonroot) node can have. For an ordinary Btree, d=⌈m/2⌉. Comer (1979 , p. 127) and Cormen et al. (2001 , pp. 383–384) give
the worst case height of a
Btree
ALGORITHMS This article MAY BE CONFUSING OR UNCLEAR TO READERS. In particular, the discussion below uses "element", "value", "key", "separator", and "separation value" to mean essentially the same thing. The terms are not clearly defined. There are some subtle issues at the root and leaves. Please help us clarify the article . There might be a discussion about this on the talk page . (February 2012) (Learn how and when to remove this template message ) SEARCH Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search reduces its field of view to the child pointer (subtree) whose range includes the search value. A subtree's range is defined by the values, or keys, contained in its parent node. These limiting values are also known as separation values. Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest. INSERTION A B Tree insertion example with each iteration. The nodes of this B tree have at most 3 children (Knuth order 3). All insertions start at a leaf node. To insert a new element, search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps: * If the node contains fewer than the maximum legal number of elements, then there is room for the new element. Insert the new element in the node, keeping the node's elements ordered. * Otherwise the node is full, evenly split it into two nodes so: * A single median is chosen from among the leaf's elements and the new element. * Values less than the median are put in the new left node and values greater than the median are put in the new right node, with the median acting as a separation value. * The separation value is inserted in the node's parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node was the root), create a new root above this node (increasing the height of the tree). If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is U−1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number U−1 of elements into two legal nodes. If this number is odd, then U=2L and one of the new nodes contains (U−2)/2 = L−1 elements, and hence is a legal node, and the other contains one more element, and hence it is legal too. If U−1 is even, then U=2L−1, so there are 2L−2 elements in the node. Half of this number is L−1, which is the minimum number of elements allowed per node. An improved algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved algorithm, we must be able to send one element to the parent and split the remaining U−2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L−1, which accounts for why some textbooks impose this requirement in defining Btrees. DELETION There are two popular strategies for deletion from a Btree. * Locate and delete the item, then restructure the tree to retain its invariants, OR * Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring The algorithm below uses the former strategy. There are two special cases to consider when deleting an element: * The element in an internal node is a separator for its child nodes * Deleting an element may put its node under the minimum number of elements and children The procedures for these cases are in order below. Deletion From A Leaf Node * Search for the value to delete. * If the value is in a leaf node, simply delete it from the node. * If underflow happens, rebalance the tree as described in section "Rebalancing after deletion" below. Deletion From An Internal Node Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically described below: * Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator. * The previous step deleted an element (the new separator) from a leaf node. If that leaf node is now deficient (has fewer than the required number of nodes), then rebalance the tree starting from the leaf node. Rebalancing After Deletion Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the minimum number of nodes. That redistribution operation is called a ROTATION. If no sibling can spare an element, then the deficient node must be MERGED with a sibling. The merge causes the parent to lose a separator element, so the parent may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows: * If the deficient node's right sibling exists and has more than the minimum number of elements, then rotate left * Copy the separator from the parent to the end of the deficient node (the separator moves down; the deficient node now has the minimum number of elements) * Replace the separator in the parent with the first element of the right sibling (right sibling loses one node but still has at least the minimum number of elements) * The tree is now balanced * Otherwise, if the deficient node's left sibling exists and has more than the minimum number of elements, then rotate right * Copy the separator from the parent to the start of the deficient node (the separator moves down; deficient node now has the minimum number of elements) * Replace the separator in the parent with the last element of the left sibling (left sibling loses one node but still has at least the minimum number of elements) * The tree is now balanced * Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a sibling sandwiching their separator taken off from their parent * Copy the separator to the end of the left node (the left node may be the deficient node or it may be the sibling with the minimum number of elements) * Move all elements from the right node to the left node (the left node now has the maximum number of elements, and the right node – empty) * Remove the separator from the parent along with its empty right child (the parent loses an element) * If the parent is the root and now has no elements, then free it and make the merged node the new root (tree becomes shallower) * Otherwise, if the parent has fewer than the required number of elements, then rebalance the parent NOTE: The rebalancing operations are different for B+ trees (e.g., rotation is different because parent has copy of the key) and B*tree (e.g., three siblings are merged into two siblings). SEQUENTIAL ACCESS While freshly loaded databases tend to have good sequential behavior, this behavior becomes increasingly difficult to maintain as a database grows, resulting in more random I/O and performance challenges. INITIAL CONSTRUCTION In applications, it is frequently useful to build a
Btree
For example, if the leaf nodes have maximum size 4 and the initial collection is the integers 1 through 24, we would initially construct 4 leaf nodes containing 5 values each and 1 which contains 4 values: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 We build the next level up from the leaves by taking the last element from each leaf node except the last one. Again, each node except the last will contain one extra value. In the example, suppose the internal nodes contain at most 2 values (3 child pointers). Then the next level up of internal nodes would be: 5 10 15 20 1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19 21 22 23 24 This process is continued until we reach a level with only one node and it is not overfilled. In the example only the root level remains: 15 5 10 20 1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19 21 22 23 24 IN FILESYSTEMS In addition to its use in databases, the
Btree
Some operating systems require the user to allocate the maximum size of the file when the file is created. The file can then be allocated as contiguous disk blocks. In that case, to convert the file block address i {displaystyle i} into a disk block address, the operating system simply adds the file block address i {displaystyle i} to the address of the first disk block constituting the file. The scheme is simple, but the file cannot exceed its created size. Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical blocks to physical blocks is more involved.
MSDOS
TOPS20
Apple's filesystem
HFS+ , Microsoft's
NTFS
B*trees are used in the HFS and Reiser4 file systems . VARIATIONS ACCESS CONCURRENCY Lehman and Yao showed that all the read locks could be avoided (and
thus concurrent access greatly improved) by linking the tree blocks at
each level together with a "next" pointer. This results in a tree
structure where both insertion and search operations descend from the
root to the leaf. Write locks are only required as a tree block is
modified. This maximizes access concurrency by multiple users, an
important consideration for databases and/or other
Btree
United States Patent 5283894, granted in 1994, appears to show a way
to use a 'Meta Access Method' to allow concurrent
B+ tree
SEE ALSO *
B+tree
NOTES * ^ For FAT, what is called a "disk block" here is what the FAT documentation calls a "cluster", which is fixedsize group of one or more contiguous whole physical disk sectors . For the purposes of this discussion, a cluster has no significant difference from a physical sector. * ^ Two of these were reserved for special purposes, so only 4078 could actually represent disk blocks (clusters). REFERENCES * ^ Counted BTrees, retrieved 20100125 * ^ Knuth\'s video lectures from Stanford * ^ Video of the talk at CPM 2013 (24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, Germany, June 17–19, 2013), retrieved 20140117; see question asked by Martin FarachColton * ^ Seagate Technology LLC, Product Manual: Barracuda ES.2 Serial ATA, Rev. F., publication 100468393, 2008 , page 6 * ^ Bayer & McCreight (1972) avoided the issue by saying an index element is a (physically adjacent) pair of (x, a) where x is the key, and a is some associated information. The associated information might be a pointer to a record or records in a random access, but what it was didn't really matter. Bayer & McCreight (1972) states, "For this paper the associated information is of no further interest." * ^ If n is zero, then no root node is needed, so the height of an empty tree is not well defined. * ^ "Cache Oblivious Btrees". State University of New York (SUNY) at Stony Brook. Retrieved 20110117. * ^ Mark Russinovich . "Inside Win2K NTFS, Part 1". Microsoft Developer Network . Archived from the original on 13 April 2008. Retrieved 20080418. * ^ "Efficient locking for concurrent operations on Btrees". Portal.acm.org. doi :10.1145/319628.319663 . Retrieved 20120628. * ^ http://www.dtic.mil/cgibin/GetTRDoc?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf * ^ "Downloads  highconcurrencybtree  High Concurrency BTree code in C  GitHub Project Hosting". Retrieved 20140127. * ^ Lockless Concurrent B+Tree General * Bayer, R. ; McCreight, E. (1972), "Organization and Maintenance of
Large Ordered Indexes" (PDF), Acta Informatica, 1 (3): 173–189, doi
:10.1007/bf00288683
* Comer, Douglas (June 1979), "The Ubiquitous BTree", Computing
Surveys, 11 (2): 123–137, ISSN 03600300 , doi
:10.1145/356770.356776 .
* Cormen, Thomas ; Leiserson, Charles ; Rivest, Ronald ; Stein,
Clifford (2001),
Introduction to Algorithms
ORIGINAL PAPERS * Bayer, Rudolf ; McCreight, E. (July 1970), Organization and
Maintenance of Large Ordered Indices, Mathematical and Information
Sciences Report No. 20,
Boeing
EXTERNAL LINKS *
Btree
