data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...

s, a range query consists of preprocessing some input data into a

to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

Definition

A range query

q_f(A,i,j)

on an array

A=_1,a_2,..,a_n /math> of ''n'' elements of some set , denoted A,n /math>, takes two indices 1\leq i\leq j\leq n, a function  defined over arrays of elements of  and outputs f(A,j = f(a_i,\ldots,a_j) .

For example, for f = \sum and A,n /math> an array of numbers, the range query \sum_ A computes \sum A,j = (a_i+\ldots + a_j), for any 1 \leq i  \leq j  \leq n . These queries may be answered in constant time and using O(n) extra space by calculating the sums of the first  elements of  and storing them into an auxiliary array , such that B /math> contains the sum of the first  elements of  for every 0\leq i\leq n . Therefore, any query might be answered by doing \sum A,j = B - B -1 /math>.

This strategy may be extended for every group operator  where the notion of f^is well defined and easily computable. Finally, this solution can be extended to two-dimensional arrays with a similar preprocessing.

Examples

Semigroup operators

When the function of interest in a range query is a semigroup operator, the notion of

f^

is not always defined, so the strategy in the previous section does not work. Andrew Yao showed that there exists an efficient solution for range queries that involve semigroup operators. He proved that for any constant , a preprocessing of time and space

\theta(c\cdot n)

allows to answer range queries on lists where is a semigroup operator in

\theta(\alpha_c(n))

time, where

\alpha_c

is a certain functional inverse of the

Ackermann function In computability theory, the Ackermann function, named after Wilhelm Ackermann, is one of the simplest and earliest-discovered examples of a total computable function that is not primitive recursive. All primitive recursive functions are total ...

. There are some semigroup operators that admit slightly better solutions. For instance when

f\in \

. Assume

f = \min

then

\min(A ..n

returns the index of the minimum element of

A ..n /math>. Then \min_(A) denotes the corresponding minimum range query. There are several data structures that allow to answer a range minimum query in O(1) time using a preprocessing of time and space O(n) . One such solution is based on the equivalence between this problem and the lowest common ancestor problem.

The Cartesian tree T_A of an array A,n /math> has as root a_i = \min\ and as left and right subtrees the Cartesian tree of A,i-1 /math> and the Cartesian tree of A +1,n /math> respectively. A range minimum query \min_(A) is the lowest common ancestor in T_A of a_i and a_j . Because the lowest common ancestor can be solved in constant time using a preprocessing of time and space O(n), range minimum query can as well. The solution when f = \max is analogous. Cartesian trees can be constructed in

linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

Mode

The '' mode'' of an array ''A'' is the element that appears the most in ''A''. For instance the mode of

A=,5,6,7,4 /math> is . In case of ties any of the most frequent elements might be picked as mode. A range mode query consists in preprocessing A,n /math> such that we can find the mode in any range of A,n /math>. Several data structures have been devised to solve this problem, we summarize some of the results in the following table. Recently Jørgensen et al. proved a lower bound on the

cell-probe model In computer science, the cell-probe model is a model of computation similar to the random-access machine, except that all operations are free except memory access. This model is useful for proving lower bounds of algorithms for data structure probl ...

\Omega\left(\tfrac\right)

for any data structure that uses cells.

Median

This particular case is of special interest since finding the

median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...

has several applications. On the other hand, the median problem, a special case of the

selection problem In computer science, a selection algorithm is an algorithm for finding the ''k''th smallest number in a list or array; such a number is called the ''k''th ''order statistic''. This includes the cases of finding the minimum, maximum, and median e ...

, is solvable in ''O''(''n''), using the median of medians algorithm. However its generalization through range median queries is recent. A range median query

\operatorname(A,i,j)

where ''A,i'' and ''j'' have the usual meanings returns the median element of

A,j /math>. Equivalently, \operatorname(A,i,j) should return the element of A,j /math> of rank \frac . Range median queries cannot be solved by following any of the previous methods discussed above including Yao's approach for semigroup operators. There have been studied two variants of this problem, the offline version, where all the ''k'' queries of interest are given in a batch, and a version where all the preprocessing is done up front. The offline version can be solved with O(n\log k + k \log n) time and O(n\log k) space.

The following pseudocode of the quickselect algorithm shows how to find the element of rank  in A,j /math> an unsorted array of distinct elements, to find the range medians we set r=\frac . rangeMedian(A, i, j, r) 

Procedure  partitions , using 's median, into two arrays  and , where the former contains
the elements of  that are less than or equal to the median  and the latter the rest of the elements of .  If we know that the number of elements of A,j /math> that
end up in  is  and this number is bigger than  then we should keep looking for the element of rank  in ; otherwise we should look for the element of rank (r-t) in . To find , it is enough to find the maximum index m\leq i-1 such that a_m is in  and the maximum index l\leq j such that a_l is in . Then t=l-m . The total cost for any query, without considering the partitioning part, is \log n since at most \log n recursion calls are done and only a constant number of operations are performed in each of them (to get the value of

fractional cascading In computer science, fractional cascading is a technique to speed up a sequence of binary searches for the same value in a sequence of related data structures. The first binary search in the sequence takes a logarithmic amount of time, as is standar ...

should be used). If a linear algorithm to find the medians is used, the total cost of preprocessing for range median queries is

n\log k

. The algorithm can also be modified to solve the online version of the problem.

Majority

Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore which is also known as the

Boyer–Moore majority vote algorithm The Boyer–Moore majority vote algorithm is an algorithm for finding the majority of a sequence of elements using linear time and constant space. It is named after Robert S. Boyer and J Strother Moore, who published it in 1981,. Originally publis ...

. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in

O(n)

time and using

O(1)

space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries proposed a more general version of Boyer and Moore's algorithm using

O \left ( n \log \left ( \frac \right ) \right )

comparisons to find all items in an array whose relative frequencies are greater than some threshold

0<\tau<1

. A range

\tau

-majority query is one that, given a subrange of a data structure (for example an array) of size

, R,

, returns the set of all distinct items that appear more than (or in some publications equal to)

\tau , R,

times in that given range. In different structures that support range

\tau

-majority queries,

\tau

can be either static (specified during preprocessing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given

\tau

there could be at most

O(1/\tau)

distinct ''candidates'' with relative frequencies at least

\tau

. By verifying each of these candidates in constant time,

O(1/\tau)

query time is achieved. A range

\tau

-majority query is decomposable in the sense that a

\tau

-majority in a range

R

with partitions

R_1

and

R_2

must be a

\tau

-majority in either

R_1

R_2

. Due to this decomposability, some data structures answer

\tau

-majority queries on one-dimensional arrays by finding the Lowest common ancestor (LCA) of the endpoints of the query range in a

Range tree In computer science, a range tree is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently, and is typically used in two or higher dimensions. Range trees were introduced by ...

and validating two sets of candidates (of size

O(1/\tau)

) from each endpoint to the lowest common ancestor in constant time resulting in

O(1/\tau)

query time.

Two-dimensional arrays

Gagie et al. proposed a data structure that supports range

\tau

-majority queries on an

m\times n

array

A

. For each query

\operatorname=(\operatorname, \tau)

in this data structure a threshold

0<\tau<1

and a rectangular range

\operatorname

are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to

\tau

are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a preprocessing threshold

\alpha

based on which it is constructed. During the preprocessing, a set of ''vertical'' and ''horizontal'' intervals are built on the

m \times n

array. Together, a vertical and a horizontal interval form a ''block.'' Each block is part of a ''superblock'' nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with

\frac

elements at most) is stored which consists of elements that have relative frequencies at least

\frac

(the preprocessing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least

\alpha

in a block must appear its set of candidates. Each

\tau

-majority query is first answered by finding the ''query block,'' or the biggest block that is contained in the provided query rectangle in

O(1)

time. For the obtained query block, the first

\frac

candidates are returned (without being verified) in

O(1/\tau)

time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the

O(1/\tau)

query time while returning no false positives. The cases in which the query block is smaller than

1/\alpha

are handled by storing

\log \left ( \frac \right )

different instances of this data structure of the following form:

\beta=2^, \;\; i\in \left \

where

\beta

is the preprocessing threshold of the

i

-th instance. Thus, for query blocks smaller than

1/\alpha

the

\lceil\log (1 / \tau)\rceil

-th instance is queried. As mentioned above, this data structure has query time

O(1/\tau)

and requires

O \left ( m n(H+1) \log^2 \left ( \frac \right ) \right )

bits of space by storing a Huffman-encoded copy of it (note the

\log(\frac)

factor and also see Huffman coding).

One-dimensional arrays

Chan et al. proposed a data structure that given a one-dimensional array

A

, a subrange

R

A

(specified at query time) and a threshold

\tau

(specified at query time), is able to return the list of all

\tau

-majorities in

O(1/\tau)

time requiring

O(n \log n)

words of space. To answer such queries, Chan et al. begin by noting that there exists a data structure capable of returning the ''top-k'' most frequent items in a range in

O(k)

time requiring

O(n)

words of space. For a one-dimensional array

A,..,n-1 /math>, let a one-sided top-k range query to be of form A ..i \text  0 \leq i \leq n-1 . For a maximal range of ranges A ..i \text  A ..j /math> in which the frequency of a distinct element e in A remains unchanged (and equal to f), a horizontal line segment is constructed. The x -interval of this line segment corresponds to,j /math> and it has a y -value equal to f .  Since adding each element to A changes the frequency of exactly one distinct element, the aforementioned process creates O(n) line segments.  Moreover, for a vertical line x=i all horizonal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with x -interval ell,r /math> corresponds to exactly one distinct element e in A, such that A

ell An ell (from Proto-Germanic *''alinō'', cognate with Latin ''ulna'') is a northwestern European unit of measurement, originally understood as a cubit (the combined length of the forearm and extended hand). The word literally means "arm", and ...

e. A top-k query can then be answered by shooting a vertical ray

x=i

and reporting the first

k

horizontal line segments that intersect it (remember from above that these line line segments are already sorted according to their frequencies) in

O(k)

time. Chan et al. first construct a

range tree In computer science, a range tree is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently, and is typically used in two or higher dimensions. Range trees were introduced by ...

in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from

A

. The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array

A

, a range tree can be constructed by dividing

A

into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires

O(n \log n)

words of space, because there are

O(\log n)

levels and each level

\ell

has

2^

nodes. Moreover, since at each level

\ell

of a range tree all nodes have a total of

n

elements of

A

at their subtrees and since there are

O(\log n)

levels, the space complexity of this range tree is

O(n \log n)

. Using this structure, a range

\tau

-majority query

A ..j /math> on A ..n-1 /math> with 0\leq i\leq  j \leq n is answered as follows. First, the lowest common ancestor (LCA) of leaf nodes i and j is found in constant time. Note that there exists a data structure requiring O(n) bits of space that is capable of answering the LCA queries in O(1) time. Let z denote the LCA of i and j, using z and according to the decomposability of range \tau -majority queries (as described above and in), the two-sided range query A ..j /math> can be converted into two one-sided range top-k queries (from z to i and j). These two one-sided range top-k queries return the top-(1/\tau) most frequent elements in each of their respective ranges in O(1/\tau) time. These frequent elements make up the set of ''candidates'' for \tau -majorities in A ..j /math> in which there are O(1/\tau) candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in ) that is able to determine in O(1) time whether or not a given subrange of an array A contains at least q instances of a particular element e .

Tree paths

Gagie et al. proposed a data structure which supports queries such that, given two nodes

u

and

v

in a tree, are able to report the list of elements that have a greater relative frequency than

\tau

on the path from

u

v

. More formally, let

T

be a labelled tree in which each node has a label from an alphabet of size

\sigma

. Let

label(u)\in,\dots,\sigma /math> denote the label of node u in T . Let P_denote the unique path from u to v in T in which middle nodes are listed in the order they are visited. Given T, and a fixed (specified during preprocessing) threshold 0<\tau<1, a query Q(u,v) must return the set of all labels that appear more than \tau, P_, times in P_.

To construct this data structure, first (\tau n) nodes are ''marked''. This can be done by marking any node that has distance at least \lceil 1 / \tau\rceil from the bottom of the three (height) and whose depth is divisible by \lceil 1 / \tau\rceil . After doing this, it can be observed that the distance between each node and its nearest marked ancestor is less than 2\lceil 1 / \tau\rceil . For a marked node x, \log(depth(x)) different sequences (paths towards the root) P_i(x) are stored, P_(x)=\left\langle  \operatorname(x),  \operatorname(x),  \operatorname^(x), \ldots,   \operatorname^(x)\right\rangle for 0\leq i \leq \log(depth(x)) where \operatorname(x) returns the label of the direct parent of node x . Put another way, for each marked node, the set of all paths with a power of two length (plus one for the node itself) towards the root is stored. Moreover, for each P_i(x), the set of all majority ''candidates'' C_i(x) are stored. More specifically, C_i(x) contains the set of all (\tau/2) -majorities in P_i(x) or labels that appear more than (\tau/2).(2^i+1) times in P_i(x) . It is easy to see that the set of candidates C_i(x) can have at most 2/\tau distinct labels for each i . Gagie et al. then note that the set of all \tau -majorities in the path from any marked node x to one of its ancestors z is included in some C_i(x) (Lemma 2 in) since the length of P_i(x) is equal to (2^i+1) thus there exists a P_i(x) for 0\leq i \leq \log(depth(x)) whose length is between d_ \text 2 d_where d_is the distance between x and z. The existence of such P_i(x) implies that a \tau -majority in the path from x to z must be a (\tau/2) -majority in P_i(x), and thus must appear in C_i(x) . It is easy to see that this data structure require O(n \log n) words of space, because as mentioned above in the construction phase O(\tau n) nodes are marked and for each marked node some  candidate sets are stored. By definition, for each marked node O(\log n) of such sets are stores, each of which contains O(1/\tau) candidates. Therefore, this data structure requires O(\log n \times (1/\tau) \times \tau n)=O(n \log n) words of space. Please note that each node x also stores count(x) which is equal to the number of instances of label(x) on the path from x to the root of T, this does not increase the space complexity since it only adds a constant number of words per node.

Each query between two nodes u and v can be answered by using the decomposability property (as explained above) of range \tau -majority queries and by breaking the query path between u and v into four subpaths. Let z be the lowest common ancestor of u and v, with x and y being the nearest marked ancestors of u and v respectively. The path from u to v is decomposed into the paths from u and v to x and y respectively (the size of these paths are smaller than 2\lceil 1 / \tau\rceil by definition, all of which are considered as candidates), and the paths from x and y to z (by finding the suitable C_i(x) as explained above and considering all of its labels as candidates). Please note that, boundary nodes have to be handled accordingly so that all of these subpaths are disjoint and from all of them a set of O(1/\tau) candidates is derived. Each of these candidates is then verified using a combination of the labelanc (x, \ell) query which returns the lowest ancestor of node x that has label \ell and the count(x) fields of each node. On a w -bit RAM and an alphabet of size \sigma, the labelanc (x, \ell) query can be answered in O\left(\log \log _ \sigma\right) time whilst having linear space requirements. Therefore, verifying each of the O(1/\tau) candidates in O\left(\log \log _ \sigma\right) time results in O\left((1/\tau)\log \log _ \sigma\right) total query time for returning the set of all \tau -majorities on the path from u to v .

References

External links

Open Data Structure - Chapter 13 - Data Structures for IntegersData Structures for Range Median Queries - Gerth Stolting Brodal and Allan Gronlund Jorgensen
{{CS-Trees Arrays