computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

, a B-tree is a self-balancing

tree data structure In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be conn ...

that maintains sorted data and allows searches, sequential access, insertions, and deletions in

logarithmic time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...

. The B-tree generalizes the

binary search tree In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and ...

, allowing for

nodes In general, a node is a localized swelling (a "knot") or a point of intersection (a Vertex (graph theory), vertex). Node may refer to: In mathematics *Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two ...

with more than two children. Unlike other

self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donal ...

s, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...

s and

file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...

Origin

B-trees were invented by

Rudolf Bayer Rudolf Bayer (born 3 March 1939) is a German computer scientist. He is professor emeritus of Informatics at the Technical University of Munich where he had been employed since 1972. He is noted for inventing three data sorting structures: the B- ...

and Edward M. McCreight while working at Boeing Research Labs, for the purpose of efficiently managing index pages for large random-access files. The basic assumption was that indices would be so voluminous that only small chunks of the tree could fit in main memory. Bayer and McCreight's paper, ''Organization and maintenance of large ordered indices'', was first circulated in July 1970 and later published in ''

Acta Informatica ''Acta Informatica'' is a Peer review, peer-reviewed scientific journal publishing original research papers in computer science. The journal is known mostly for publications in theoretical computer science. One of the two 1988 papers awarded the ...

''. Bayer and McCreight never explained what, if anything, the ''B'' stands for: ''Boeing'', ''balanced'', ''broad'', ''bushy'', and ''Bayer'' have been suggested. McCreight has said that "the more you think about what the B in B-trees means, the better you understand B-trees."

Definition

According to Knuth's definition, a B-tree of order ''m'' is a tree which satisfies the following properties: # Every node has at most ''m'' children. # Every internal node has at least ⌈''m''/2⌉ children. # Every non-leaf node has at least two children. # All leaves appear on the same level. # A non-leaf node with ''k'' children contains ''k''−1 keys. Each internal node's keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: ''a''₁ and ''a''₂. All values in the leftmost subtree will be less than ''a''₁, all values in the middle subtree will be between ''a''₁ and ''a''₂, and all values in the rightmost subtree will be greater than ''a''₂. ;Internal nodes : Internal nodes (also known as inner nodes) are all nodes except for leaf nodes and the root node. They are usually represented as an ordered set of elements and child pointers. Every internal node contains a maximum of ''U'' children and a minimum of ''L'' children. Thus, the number of elements is always 1 less than the number of child pointers (the number of elements is between ''L''−1 and ''U''−1). ''U'' must be either 2''L'' or 2''L''−1; therefore each internal node is at least half full. The relationship between ''U'' and ''L'' implies that two half-full nodes can be joined to make a legal node, and one full node can be split into two legal nodes (if there's room to push one element up into the parent). These properties make it possible to delete and insert new values into a B-tree and adjust the tree to preserve the B-tree properties. ;The root node : The root node's number of children has the same upper limit as internal nodes, but has no lower limit. For example, when there are fewer than ''L''−1 elements in the entire tree, the root will be the only node in the tree with no children at all. ;Leaf nodes : In Knuth's terminology, the "leaf" nodes are the actual data objects / chunks. The internal nodes that are one level above these leaves are what would be called the "leaves" by other authors: these nodes only store keys (at most ''m''-1, and at least ''m''/2-1 if they are not the root) and pointers (one for each key) to nodes carrying the data objects / chunks. A B-tree of depth ''n''+1 can hold about ''U'' times as many items as a B-tree of depth ''n'', but the cost of search, insert, and delete operations grows with the depth of the tree. As with any balanced tree, the cost grows much more slowly than the number of elements. Some balanced trees store values only at leaf nodes, and use different kinds of nodes for leaf nodes and internal nodes. B-trees keep values in every node in the tree except leaf nodes.

Differences in terminology

The literature on B-trees is not uniform in its terminology. Bayer and McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in a non-root node. points out that terminology is ambiguous because the maximum number of keys is not clear. An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998) avoids the problem by defining the order to be the maximum number of children (which is one more than the maximum number of keys). The term leaf is also inconsistent. Bayer and McCreight (1972) considered the leaf level to be the lowest level of keys, but Knuth considered the leaf level to be one level below the lowest keys. There are many possible implementation choices. In some designs, the leaves may hold the entire data record; in other designs, the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a B-tree. For simplicity, most authors assume there are a fixed number of keys that fit in a node. The basic assumption is the key size is fixed and the node size is fixed. In practice, variable length keys may be employed.

Informal description

In B-trees, internal ( non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2–3 tree (sometimes referred to as a 2–3 B-tree), each internal node may have only 2 or 3 child nodes. Each internal node of a B-tree contains a number of keys. The keys act as separation values which divide its

subtree In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...

s. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys:

a_1

and

a_2

. All values in the leftmost subtree will be less than

a_1

, all values in the middle subtree will be between

a_1

and

a_2

, and all values in the rightmost subtree will be greater than

a_2

. Usually, the number of keys is chosen to vary between

d

and

2d

, where

d

is the minimum number of keys, and

d+1

is the minimum

degree Degree may refer to: As a unit of measurement * Degree (angle), a unit of angle measurement ** Degree of geographical latitude ** Degree of geographical longitude * Degree symbol (°), a notation used in science, engineering, and mathematics ...

branching factor In computing, tree data structures, and game theory, the branching factor is the number of children at each node, the outdegree. If this value is not uniform, an ''average branching factor'' can be calculated. For example, in chess, if a "node" i ...

of the tree. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be split or combined. If an internal node has

2d

keys, then adding a key to that node can be accomplished by splitting the hypothetical

2d+1

key node into two

d

key nodes and moving the key that would have been in the middle to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have

d

keys, then a key may be deleted from the internal node by combining it with its neighbor. Deleting the key would make the internal node have

d-1

keys; joining the neighbor would add

d

keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of

2d

keys. The number of branches (or child nodes) from a node will be one more than the number of keys stored in the node. In a 2–3 B-tree, the internal nodes will store either one key (with two child nodes) or two keys (with three child nodes). A B-tree is sometimes described with the parameters

(d+1)

—

(2d+1)

or simply with the highest branching order,

(2d+1)

. A B-tree is kept balanced after insertion by splitting a would-be overfilled node, of

2d+1

keys, into two

d

-key siblings and inserting the mid-value key into the parent. Depth only increases when the root is split, maintaining balance. Similarly, a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the

d

-key minimum for non-root nodes. A merger reduces the number of keys in the parent potentially forcing it to merge or redistribute keys with its siblings, and so on. The only change in depth occurs when the root has two children, of

d

and (transitionally)

d-1

keys, in which case the two siblings and parent are merged, reducing the depth by one. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root. B-trees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. This usually occurs when the node data are in

secondary storage Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer ...

such as

disk drives Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are consi ...

. By maximizing the number of keys within each

internal node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...

, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size of a full disk block or an analogous size in secondary storage. While 2–3 B-trees are easier to explain, practical B-trees using secondary storage need a large number of child nodes to improve performance.

Variants

The term B-tree may refer to a specific design or it may refer to a general class of designs. In the narrow sense, a B-tree stores keys in its internal nodes but need not store those keys in the records at the leaves. The general class includes variations such as the B+ tree, the B^* tree and the B^*+ tree. * In the B+ tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf node may include a pointer to the next leaf node to speed sequential access. * The B^* tree balances more neighboring internal nodes to keep the internal nodes more densely packed. This variant ensures non-root nodes are at least 2/3 full instead of 1/2. As the most costly part of operation of inserting the node in B-tree is splitting the node, B^*-trees are created to postpone splitting operation as long as they can. To maintain this, instead of immediately splitting up a node when it gets full, its keys are shared with a node next to it. This spill operation is less costly to do than split, because it requires only shifting the keys between existing nodes, not allocating memory for a new one. For inserting, first it is checked whether the node has some free space in it, and if so, the new key is just inserted in the node. However, if the node is full (it has keys, where is the order of the tree as maximum number of pointers to subtrees from one node), it needs to be checked whether the right sibling exists and has some free space. If the right sibling has keys, then keys are redistributed between the two sibling nodes as evenly as possible. For this purpose, keys from the current node, the new key inserted, one key from the parent node and keys from the sibling node are seen as an ordered array of keys. The array becomes split by half, so that lowest keys stay in the current node, the next (middle) key is inserted in the parent and the rest go to the right sibling. (The newly inserted key might end up in any of the three places.) The situation when right sibling is full, and left isn't is analogous. When both the sibling nodes are full, then the two nodes (current node and a sibling) are split into three and one more key is shifted up the tree, to the parent node. If the parent is full, then spill/split operation propagates towards the root node. Deleting nodes is somewhat more complex than inserting however. * The B^*+ tree combines the main B+ tree and B^* tree features together. * B-trees can be turned into

order statistic tree In computer science, an order statistic tree is a variant of the binary search tree (or more generally, a B-tree) that supports two additional operations beyond insertion, lookup and deletion: * Select(''i'') – find the ''ith smallest element st ...

s to allow rapid searches for the Nth record in key order, or counting the number of records between any two records, and various other related operations.

B-tree usage in databases

Time to search a sorted file

Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation. A

binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...

of a sorted table with records, for example, can be done in roughly comparisons. If the table had 1,000,000 records, then a specific record could be located with at most 20 comparisons: . Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a

seek time Higher performance in hard disk drives comes from devices which have better performance characteristics. These performance characteristics can be grouped into two categories: access time and data transfer time (or rate). Access time The ''access ...

and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds. For simplicity, assume reading from disk takes about 10 milliseconds. Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds. The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read. To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.

An index speeds the search

A B-tree index creates a multi-level tree structure that breaks a database down into fixed-size blocks or pages. Each level of this tree can be used to link those pages via an address location, allowing one page (known as a node, or internal page) to refer to another with leaf pages at the lowest level. One page is typically the starting point of the tree, or the "root". This is where the search for a particular key would begin, traversing a path that terminates in a leaf. Most pages in this structure will be leaf pages which ultimately refer to specific table rows. A significant improvement in performance can be made with a B-tree

index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...

. Because each node (or internal page) can have more than two children, a B-tree index will usually have a shorter height (the distance from the root to the farthest leaf) than a Binary Search Tree. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last six or so comparisons in the auxiliary index would be on the same disk block. The index could be searched in about eight disk reads, and the desired record could be accessed in 9 disk reads. The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block. Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. This blocking is the core idea behind the creation of the B-tree, where the disk blocks fill-out a hierarchy of levels to make up the index. Reading and searching the first (and only) block of the aux-aux index which is the root of the tree identifies the relevant block in aux-index in the level below. Reading and searching that aux-index block identifies the relevant block to read, until the final level, known as the leaf level, identifies a record in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record. The auxiliary indices have turned the search problem from a binary search requiring roughly disk reads to one requiring only disk reads where is the blocking factor (the number of entries per block: entries per block in our example; reads). In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk read. The B-tree remains the standard index implementation in almost all

relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...

s, and many nonrelational databases use them too.

Insertions and deletions

If the

does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated. Deleting records from a database is relatively easy. The index can stay the same, and the record can just be marked as deleted. The database remains in sorted order. If there are a large number of

lazy deletion In computer science, lazy deletion refers to a method of deleting elements from a hash table that uses open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is r ...

s, then searching and storage become less efficient. Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record requires shifting all of the records down one. Such an operation is just too expensive to be practical. One solution is to leave some spaces. Instead of densely packing all the records in a block, the block can have some free space to allow for subsequent insertions. Those spaces would be marked as if they were "deleted" records. Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is that enough space is available nearby, such that a lot of blocks do not need to be reorganized. Alternatively, some out-of-sequence disk blocks may be used.

Advantages of B-tree usage for databases

The B-tree uses all of the ideas described above. In particular, a B-tree: * keeps keys in sorted order for sequential traversing * uses a hierarchical index to minimize the number of disk reads * uses partially full blocks to speed up insertions and deletions * keeps the index balanced with a recursive algorithm In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions.

Best case and worst case heights

Let be the height of the classic B-tree (see for the tree height definition). Let be the number of entries in the tree. Let ''m'' be the maximum number of children a node can have. Each node can have at most keys. It can be shown (by induction for example) that a B-tree of height ''h'' with all its nodes completely filled has entries. Hence, the best case height (i.e. the minimum height) of a B-tree is: :

h_ = \lceil \log_ (n+1) \rceil -1

Let

d

be the minimum number of children an internal (non-root) node must have. For an ordinary B-tree,

d = \left\lceil m/2 \right\rceil.

Comer (1979) and Cormen et al. (2001) give the worst case height (the maximum height) of a B-tree as :

h_ = \left\lfloor \log_\frac \right\rfloor .

Algorithms

Search

Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search reduces its field of view to the child pointer (subtree) whose range includes the search value. A subtree's range is defined by the values, or keys, contained in its parent node. These limiting values are also known as separation values. Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.

Insertion

All insertions start at a leaf node. To insert a new element, search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps: # If the node contains fewer than the maximum allowed number of elements, then there is room for the new element. Insert the new element in the node, keeping the node's elements ordered. # Otherwise the node is full, evenly split it into two nodes so: ## A single median is chosen from among the leaf's elements and the new element that is being inserted. ## Values less than the median are put in the new left node and values greater than the median are put in the new right node, with the median acting as a separation value. ## The separation value is inserted in the node's parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node was the root), create a new root above this node (increasing the height of the tree). If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is ''U''−1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number ''U''−1 of elements into two legal nodes. If this number is odd, then ''U''=2''L'' and one of the new nodes contains (''U''−2)/2 = ''L''−1 elements, and hence is a legal node, and the other contains one more element, and hence it is legal too. If ''U''−1 is even, then ''U''=2''L''−1, so there are 2''L''−2 elements in the node. Half of this number is ''L''−1, which is the minimum number of elements allowed per node. An alternative algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way preemptively. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this algorithm, we must be able to send one element to the parent and split the remaining ''U''−2 elements into two legal nodes, without adding a new element. This requires ''U'' = 2''L'' rather than ''U'' = 2''L''−1, which accounts for why some textbooks impose this requirement in defining B-trees.

Deletion

There are two popular strategies for deletion from a B-tree. # Locate and delete the item, then restructure the tree to retain its invariants, OR # Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring The algorithm below uses the former strategy. There are two special cases to consider when deleting an element: # The element in an internal node is a separator for its child nodes # Deleting an element may put its node under the minimum number of elements and children The procedures for these cases are in order below.

Deletion from a leaf node

# Search for the value to delete. # If the value is in a leaf node, simply delete it from the node. # If underflow happens, rebalance the tree as described in section "Rebalancing after deletion" below.

Deletion from an internal node

Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically described below: # Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator. # The previous step deleted an element (the new separator) from a leaf node. If that leaf node is now deficient (has fewer than the required number of nodes), then rebalance the tree starting from the leaf node.

Rebalancing after deletion

Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the minimum number of nodes. That redistribution operation is called a rotation. If no sibling can spare an element, then the deficient node must be merged with a sibling. The merge causes the parent to lose a separator element, so the parent may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows: * If the deficient node's right sibling exists and has more than the minimum number of elements, then rotate left *# Copy the separator from the parent to the end of the deficient node (the separator moves down; the deficient node now has the minimum number of elements) *# Replace the separator in the parent with the first element of the right sibling (right sibling loses one node but still has at least the minimum number of elements) *# The tree is now balanced * Otherwise, if the deficient node's left sibling exists and has more than the minimum number of elements, then rotate right *# Copy the separator from the parent to the start of the deficient node (the separator moves down; deficient node now has the minimum number of elements) *# Replace the separator in the parent with the last element of the left sibling (left sibling loses one node but still has at least the minimum number of elements) *# The tree is now balanced * Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a sibling sandwiching their separator taken off from their parent *# Copy the separator to the end of the left node (the left node may be the deficient node or it may be the sibling with the minimum number of elements) *# Move all elements from the right node to the left node (the left node now has the maximum number of elements, and the right node – empty) *# Remove the separator from the parent along with its empty right child (the parent loses an element) *#* If the parent is the root and now has no elements, then free it and make the merged node the new root (tree becomes shallower) *#* Otherwise, if the parent has fewer than the required number of elements, then rebalance the parent :Note: The rebalancing operations are different for B+ trees (e.g., rotation is different because parent has copy of the key) and B^*-tree (e.g., three siblings are merged into two siblings).

Sequential access

While freshly loaded databases tend to have good sequential behavior, this behavior becomes increasingly difficult to maintain as a database grows, resulting in more random I/O and performance challenges.

Initial construction

A common special case is adding a large amount of ''pre-sorted'' data into an initially empty B-tree. While it is quite possible to simply perform a series of successive inserts, inserting sorted data results in a tree composed almost entirely of half-full nodes. Instead, a special "bulk loading" algorithm can be used to produce a more efficient tree with a higher branching factor. When the input is sorted, all insertions are at the rightmost edge of the tree, and in particular any time a node is split, we are guaranteed that no more insertions will take place in the left half. When bulk loading, we take advantage of this, and instead of splitting overfull nodes evenly, split them as ''unevenly'' as possible: leave the left node completely full and create a right node with zero keys and one child (in violation of the usual B-tree rules). At the end of bulk loading, the tree is composed almost entirely of completely full nodes; only the rightmost node on each level may be less than full. Because those nodes may also be less than ''half'' full, to re-establish the normal B-tree rules, combine such nodes with their (guaranteed full) left siblings and divide the keys to produce two nodes at least half full. The only node which lacks a full left sibling is the root, which is permitted to be less than half full.

In filesystems

In addition to its use in databases, the B-tree (or ) is also used in filesystems to allow quick random access to an arbitrary block in a particular file. The basic problem is turning the file block

i

address into a disk block (or perhaps to a

cylinder-head-sector Cylinder-head-sector (CHS) is an early method for giving addresses to each physical block of data on a hard disk drive. It is a 3D-coordinate system made out of a vertical coordinate ''head'', a horizontal (or radial) coordinate ''cylinder'', a ...

) address. Some operating systems require the user to allocate the maximum size of the file when the file is created. The file can then be allocated as contiguous disk blocks. In that case, to convert the file block address

i

into a disk block address, the operating system simply adds the file block address

i

to the address of the first disk block constituting the file. The scheme is simple, but the file cannot exceed its created size. Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical blocks to physical blocks is more involved.

MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few ope ...

, for example, used a simple File Allocation Table (FAT). The FAT has an entry for each disk block,For FAT, what is called a "disk block" here is what the FAT documentation calls a "cluster", which is a fixed-size group of one or more contiguous whole physical disk

sectors Sector may refer to: Places * Sector, West Virginia, U.S. Geometry * Circular sector, the portion of a disc enclosed by two radii and a circular arc * Hyperbolic sector, a region enclosed by two radii and a hyperbolic arc * Spherical sector, a p ...

. For the purposes of this discussion, a cluster has no significant difference from a physical sector. and that entry identifies whether its block is used by a file and if so, which block (if any) is the next disk block of the same file. So, the allocation of each file is represented as a

linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...

in the table. In order to find the disk address of file block

i

, the operating system (or disk utility) must sequentially follow the file's linked list in the FAT. Worse, to find a free disk block, it must sequentially scan the FAT. For MS-DOS, that was not a huge penalty because the disks and files were small and the FAT had few entries and relatively short file chains. In the

FAT12 File Allocation Table (FAT) is a file system developed for personal computers. Originally developed in 1977 for use on floppy disks, it was adapted for use on hard disks and other devices. It is often supported for compatibility reasons by ...

filesystem (used on floppy disks and early hard disks), there were no more than 4,080 Two of these were reserved for special purposes, so only 4078 could actually represent disk blocks (clusters). entries, and the FAT would usually be resident in memory. As disks got bigger, the FAT architecture began to confront penalties. On a large disk using FAT, it may be necessary to perform disk reads to learn the disk location of a file block to be read or written.

TOPS-20 The TOPS-20 operating system by Digital Equipment Corporation (DEC) is a proprietary OS used on some of DEC's 36-bit mainframe computers. The Hardware Reference Manual was described as for "DECsystem-10/DECSYSTEM-20 Processor" (meaning the DEC PDP- ...

(and possibly TENEX) used a 0 to 2 level tree that has similarities to a B-tree. A disk block was 512 36-bit words. If the file fit in a 512 (2⁹) word block, then the file directory would point to that physical disk block. If the file fit in 2¹⁸ words, then the directory would point to an aux index; the 512 words of that index would either be NULL (the block isn't allocated) or point to the physical address of the block. If the file fit in 2²⁷ words, then the directory would point to a block holding an aux-aux index; each entry would either be NULL or point to an aux index. Consequently, the physical disk block for a 2²⁷ word file could be located in two disk reads and read on the third. Apple's filesystem

HFS+ HFS Plus or HFS+ (also known as Mac OS Extended or HFS Extended) is a journaling file system developed by Apple Inc. It replaced the Hierarchical File System (HFS) as the primary file system of Apple computers with the 1998 release of Mac OS 8.1 ...

and

APFS Apple File System (APFS) is a proprietary file system developed and deployed by Apple Inc. for macOS Sierra (10.12.4) and later, iOS 10.3 and later, tvOS 10.2 and later, watchOS 3.2 and later, and all versions of iPadOS. It aims to fix c ...

, Microsoft's

NTFS New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fil ...

, AIX (jfs2) and some

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...

filesystems, such as

btrfs Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or simply by spelling it out) is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (not to be confused ...

and

Ext4 ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3. ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...

, use B-trees. B^*-trees are used in the HFS and

Reiser4 Reiser4 is a computer file system, successor to the ReiserFS file system, developed from scratch by Namesys and sponsored by DARPA as well as Linspire. Reiser4 was named after its former lead developer Hans Reiser. , the Reiser4 patch set is ...

DragonFly BSD DragonFly BSD is a free and open-source Unix-like operating system forked from FreeBSD 4.8. Matthew Dillon, an Amiga developer in the late 1980s and early 1990s and FreeBSD developer between 1994 and 2003, began working on DragonFly BSD in Ju ...

HAMMER A hammer is a tool, most often a hand tool, consisting of a weighted "head" fixed to a long handle that is swung to deliver an impact to a small area of an object. This can be, for example, to drive nails into wood, to shape metal (as w ...

file system uses a modified B+-tree.

Performance

A B-tree grows slower with growing data amount, than the linearity of a linked list. Compared to a

skip list In computer science, a skip list (or skiplist) is a probabilistic data structure that allows \mathcal(\log n) average complexity for search as well as \mathcal(\log n) average complexity for insertion within an ordered sequence of n elements. T ...

, both structures have the same performance, but the B-tree scales better for growing ''n''. A

T-tree In computer science a T-tree is a type of binary tree data structure that is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster, Oracle TimesTen and MobileLite. A T-tree is a balanced index tree data structure optim ...

, for main memory database systems, is similar but more compact.

Variations

Access concurrency

Lehman and Yao showed that all the read locks could be avoided (and thus concurrent access greatly improved) by linking the tree blocks at each level together with a "next" pointer. This results in a tree structure where both insertion and search operations descend from the root to the leaf. Write locks are only required as a tree block is modified. This maximizes access concurrency by multiple users, an important consideration for databases and/or other B-tree-based

ISAM ISAM (an acronym for indexed sequential access method) is a method for creating, maintaining, and manipulating computer files of data so that records can be retrieved sequentially or randomly by one or more keys. Indexes of key fields are mainta ...

storage methods. The cost associated with this improvement is that empty pages cannot be removed from the btree during normal operations. (However, see for various strategies to implement node merging, and source code at.) United States Patent 5283894, granted in 1994, appears to show a way to use a 'Meta Access Method' to allow concurrent B+ tree access and modification without locks. The technique accesses the tree 'upwards' for both searches and updates by means of additional in-memory indexes that point at the blocks in each level in the block cache. No reorganization for deletes is needed and there are no 'next' pointers in each block as in Lehman and Yao.

Parallel algorithms

Since B-trees are similar in structure to red-black trees, parallel algorithms for red-black trees can be applied to B-trees as well.

Notes

References

;General * * . * . Chapter 18: B-Trees. * * . Section 6.2.4: Multiway Trees, pp. 481–491. Also, pp. 476–477 of section 6.2.3 (Balanced Trees) discusses 2–3 trees.

Original papers

* . * .

External links

B-tree lecture
by David Scot Taylor, SJSU

(click "init")
B-tree and UB-tree on Scholarpedia
Curator: Dr Rudolf Bayer
B-Trees: Balanced Tree Data Structures
* ttp://opendatastructures.org/versions/edition-0.1g/ods-python/14_2_B_Trees.html Open Data Structures - Section 14.2 - B-Trees

Pat Morin Patrick Ryan Morin is a Canadian computer scientist specializing in computational geometry and data structures. He is a professor in the School of Computer Science at Carleton University. Education and career Morin was educated at Carleton Univers ...

Counted B-TreesB-Tree .Net, a modern, virtualized RAM & Disk implementation
Bulk loading * * * * {{DEFAULTSORT:B-Tree Computer-related introductions in 1971 Database index techniques