Array Database Management System
   HOME

TheInfoList



OR:

Array database management systems (array DBMSs) provide
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
services specifically for
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
s (also called
raster data upright=1, The Smiley, smiley face in the top left corner is a raster image. When enlarged, individual pixels appear as squares. Enlarging further, each pixel can be analyzed, with their colors constructed through combination of the values for ...
), that is: homogeneous collections of data items (often called
pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the smal ...
s,
voxel In 3D computer graphics, a voxel represents a value on a regular grid in three-dimensional space. As with pixels in a 2D bitmap, voxels themselves do not typically have their position (i.e. coordinates) explicitly encoded with their values. Ins ...
s, etc.), sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be
Big Data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.


Overview

In the same style as standard
database systems In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
do on sets, Array DBMSs offer scalable, flexible storage and flexible retrieval/manipulation on arrays of (conceptually) unlimited size. As in practice arrays never appear standalone, such an array model normally is embedded into some overall data model, such as the relational model. Some systems implement arrays as an analogy to tables, some introduce arrays as an additional attribute type. Management of arrays requires novel techniques, particularly due to the fact that traditional database tuples and objects tend to fit well into a single database page a unit of disk access on server, typically 4  KB while array objects easily can span several media. The prime task of the array storage manager is to give fast access to large arrays and sub-arrays. To this end, arrays get partitioned, during insertion, into so-called ''tiles'' or ''chunks'' of convenient size which then act as units of access during query evaluation. Array DBMSs offer
query languages Query languages, data query languages or database query languages (DQL) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL). Types Broadly, query language ...
giving declarative access to such arrays, allowing to create, manipulate, search, and delete them. Like with, e.g., SQL, expressions of arbitrary complexity can be built on top of a set of core array operations. Due to the extensions made in the data and query model, Array DBMSs sometimes are subsumed under the
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
category, in the sense of "not only SQL". Query
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
and
parallelization Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different fo ...
are important for achieving
scalability Scalability is the property of a system to handle a growing amount of work by adding resources to the system. In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
; actually, many array operators lend themselves well towards parallel evaluation, by processing each tile on separate nodes or cores. Important application domains of Array DBMSs include Earth, Space, Life, and Social sciences, as well as the related commercial applications (such as
hydrocarbon exploration Hydrocarbon exploration (or oil and gas exploration) is the search by petroleum geologists and geophysicists for deposits of hydrocarbons, particularly petroleum and natural gas, in the Earth using petroleum geology. Exploration methods Vis ...
in industry and
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
in business). The variety occurring can be observed, e.g., in geo data where 1-D environmental sensor time series, 2-D satellite images, 3-D x/y/t image time series and x/y/z geophysics data, as well as 4-D x/y/z/t climate and ocean data can be found.


History and status

The
relational data model The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tu ...
, which is prevailing today, does not directly support the array paradigm to the same extent as sets and tuples.
ISO ISO is the most common abbreviation for the International Organization for Standardization. ISO or Iso may also refer to: Business and finance * Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007 * Iso ...
SQL lists an array-valued attribute type, but this is only one-dimensional, with almost no operational support, and not usable for the application domains of Array DBMSs. Another option is to resort to
BLOB Blob may refer to: Science Computing * Binary blob, in open source software, a non-free object file loaded into the kernel * Binary large object (BLOB), in computer database systems * A storage mechanism in the cloud computing platform M ...
s ("binary large objects") which are the equivalent to files: byte strings of (conceptually) unlimited length, but again without any query language functionality, such as multi-dimensional subsetting. First significant work in going beyond BLOBs has been established with PICDMS. This system offers the precursor of a 2-D array query language, albeit still procedural and without suitable storage support. A first declarative query language suitable for multiple dimensions and with an algebra-based semantics has been published by Baumann, together with a scalable architecture. Another array database language, constrained to 2-D, has been presented by Marathe and Salem. Seminal theoretical work has been accomplished by Libkin et al.; in their model, called NCRA, they extend a nested relational calculus with multidimensional arrays; among the results are important contributions on array query complexity analysis. A map algebra, suitable for 2-D and 3-D spatial raster data, has been published by Mennis et al. In terms of Array DBMS implementations, the
rasdaman rasdaman ("raster data manager") is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently ...
system has the longest implementation track record of n-D arrays with full query support. Oracle GeoRaster offers chunked storage of 2-D raster maps, albeit without SQL integration.
TerraLib TerraLib is an open-source GIS software library that extends object-relational DBMS technology to handle spatiotemporal data types. The library supports different DBMS, including MySQL, PostgreSQL, and Oracle. Its vector data model is upwar ...
is an open-source GIS software that extends object-relational DBMS technology to handle spatio-temporal data types; while main focus is on vector data, there is also some support for rasters. Starting with version 2.0,
PostGIS PostGIS ( ) is an open source software program that adds support for geographic objects to the PostgreSQL object-relational database. PostGIS follows the Simple Features for SQL specification from the Open Geospatial Consortium (OGC). Technicall ...
embeds raster support for 2-D rasters; a special function offers declarative raster query functionality.
SciQL MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against l ...
is an array query language being added to the
MonetDB MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against lar ...
DBMS. SciDB is a more recent initiative to establish array database support. Like SciQL, arrays are seen as an equivalent to tables, rather than a new attribute type as in rasdaman and PostGIS. For the special case of sparse data,
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
data cubes are well established; they store cell values together with their location an adequate compression technique in face of the few locations carrying valid information at all and operate with SQL on them. As this technique does not scale in density, standard databases are not used today for dense data, like satellite images, where most cells carry meaningful information; rather, proprietary ad hoc implementations prevail in scientific data management and similar situations. Hence, this is where Array DBMSs can make a particular contribution. Generally, Array DBMSs are an emerging technology. While operationally deployed systems exist, like Oracle GeoRaster, PostGIS 2.0 and
rasdaman rasdaman ("raster data manager") is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently ...
, there are still many open research questions, including query language design and formalization, query optimization, parallelization and
distributed processing A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
, and scalability issues in general. Besides, scientific communities still appear reluctant in taking up array database technology and tend to favor specialized, proprietary technology.


Concepts

When adding arrays to databases, all facets of database design need to be reconsidered ranging from conceptual modeling (such as suitable operators) over storage management (such as management of arrays spanning multiple media) to query processing (such as efficient processing strategies).


Conceptual modeling

Formally, an array ''A'' is given by a (total or partial) function ''A'': ''X'' → ''V'' where ''X'', the ''domain'' is a ''d''-dimensional integer interval for some ''d''>0 and ''V'', called ''range'', is some (non-empty) value set; in set notation, this can be rewritten as . Each (''p'',''v'') in ''A'' denotes an array element or ''cell'', and following common notation we write ''A'' 'p''= ''v''. Examples for ''X'' include × (for XGA sized images), examples for ''V'' include for 8-bit greyscale images and × × for standard
RGB The RGB color model is an additive color model in which the red, green and blue primary colors of light are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three addi ...
imagery. Following established database practice, an array query language should be declarative and safe in evaluation. As iteration over an array is at the heart of array processing, declarativeness very much centers on this aspect. The requirement, then, is that conceptually all cells should be inspected simultaneously in other words, the query does not enforce any explicit iteration sequence over the array cells during evaluation. Evaluation safety is achieved when every query terminates after a finite number of (finite-time) steps; again, avoiding general loops and recursion is a way of achieving this. At the same time, avoiding explicit loop sequences opens up manifold optimization opportunities.


Array querying

As an example for array query operators the
rasdaman rasdaman ("raster data manager") is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently ...
algebra and query language can serve, which establish an expression language over a minimal set of array primitives. We begin with the generic core operators and then present common special cases and shorthands. The marray operator creates an array over some given domain extent and initializes its cells: marray index-range-specification values cell-value-expression where ''index-range-specification'' defines the result domain and binds an iteration variable to it, without specifying iteration sequence. The ''cell-value-expression'' is evaluated at each location of the domain. Example: “A cutout of array A given by the corner points (10,20) and (40,50).” marray p in 0:20,40:50values A This special case, pure subsetting, can be abbreviated as A 0:20,40:50 This subsetting keeps the dimension of the array; to reduce dimension by extracting slices, a single slicepoint value is indicated in the slicing dimension. Example: “A slice through an x/y/t timeseries at position t=100, retrieving all available data in x and y.” A :*,*:*,100 The wildcard operator ''*'' indicates that the current boundary of the array is to be used; note that arrays where dimension boundaries are left open at definition time may change size in that dimensions over the array's lifetime. The above examples have simply copied the original values; instead, these values may be manipulated. Example: “Array A, with a log() applied to each cell value.” marray p in domain(A) values log( A ) This can be abbreviated as: log( A ) Through a principle called ''induced operations'', the query language offers all operations the cell type offers on array level, too. Hence, on numeric values all the usual unary and binary arithmetic, exponential, and trigonometric operations are available in a straightforward manner, plus the standard set of Boolean operators. The condense operator aggregates cell values into one scalar result, similar to SQL aggregates. Its application has the general form: condense condense-op over index-range-specification using cell-value-expression As with ''marray'' before, the ''index-range-specification'' specifies the domain to be iterated over and binds an iteration variable to it again, without specifying iteration sequence. Likewise, ''cell-value-expression'' is evaluated at each domain location. The ''condense-op'' clause specifies the aggregating operation used to combine the cell value expressions into one single value. Example: "The sum over all values in A." condense + over p in sdom(A) using A A shorthand for this operation is: add_cells( A ) In the same manner and in analogy to SQL aggregates, a number of further shorthands are provided, including counting, average, minimum, maximum, and Boolean quantifiers. The next example demonstrates combination of ''marray'' and ''condense'' operators by deriving a histogram. Example: "A histogram over 8-bit greyscale image A." marray bucket in :255values count_cells( A = bucket ) The induced comparison, ''A=bucket'', establishes a Boolean array of the same extent as ''A''. The aggregation operator counts the occurrences of ''true'' for each value of ''bucket'', which subsequently is put into the proper array cell of the 1-D histogram array. Such languages allow formulating statistical and imaging operations which can be expressed analytically without using loops. It has been proven that the expressive power of such array languages in principle is equivalent to relational query languages with ranking.


Array storage

Array storage has to accommodate arrays of different dimensions and typically large sizes. A core task is to maintain spatial proximity on disk so as to reduce the number of disk accesses during subsetting. Note that an emulation of multi-dimensional arrays as nested lists (or 1-D arrays) will not per se accomplish this and, therefore, in general will not lead to scalable architectures. Commonly arrays are partitioned into sub-arrays which form the unit of access. Regular partitioning where all partitions have the same size (except possibly for boundaries) is referred to as ''chunking''. A generalization which removes the restriction to equally sized partitions by supporting any kind of partitioning is ''tiling''. Array partitioning can improve access to array subsets significantly: by adjusting tiling to the access pattern, the server ideally can fetch all required data with only one disk access. Compression of tiles can sometimes reduce substantially the amount of storage needed. Also for transmission of results compression is useful, as for the large amounts of data under consideration networks bandwidth often constitutes a limiting factor.


Query processing

A tile-based storage structure suggests a tile-by-tile processing strategy (in
rasdaman rasdaman ("raster data manager") is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently ...
called ''tile streaming''). A large class of practically relevant queries can be evaluated by loading tile after tile, thereby allowing servers to process arrays orders of magnitude beyond their main memory. Due to the massive sizes of arrays in scientific/technical applications in combination with often complex queries, optimization plays a central role in making array queries efficient. Both hardware and software parallelization can be applied. An example for heuristic optimization is the rule "maximum value of an array resulting from the cell-wise addition of two input images is equivalent to adding the maximum values of each input array". By replacing the left-hand variant by the right-hand expression, costs shrink from three (costly) array traversals to two array traversals plus one (cheap) scalar operation (see Figure, which uses the SQL/MDA query standard).


Application domains

In many – if not most – cases where some phenomenon is sampled or simulated the result is a rasterized data set which can conveniently be stored, retrieved, and forwarded as an array. Typically, the array data are ornamented with metadata describing them further; for example, geographically referenced imagery will carry its geographic position and the coordinate reference system in which it is expressed. The following are representative domains in which large-scale multi-dimensional array data are handled: *Earth sciences: geodesy / mapping, remote sensing, geology, oceanography, hydrology, atmospheric sciences, cryospheric sciences *Space sciences: Planetary sciences, astrophysics (optical and radio telescope observations, cosmological simulations) *Life sciences: gene data, confocal microscopy, CAT scans *Social sciences: statistical data cubes *Business: OLAP, data warehousing These are but examples; generally, arrays frequently represent sensor, simulation, image, and statistics data. More and more spatial and time dimensions are combined with ''abstract'' axes, such as sales and products; one example where such abstract axes are explicitly foreseen is the
Open Geospatial Consortium The Open Geospatial Consortium (OGC), an international voluntary consensus standards organization for geospatial content and location-based services, sensor web and Internet of Things, GIS data processing and data sharing. It originated in 1994 ...
(OGC) coverage model.


Standardization

Many communities have established data exchange formats, such as HDF,
NetCDF NetCDF (Network Common Data Form) is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata ...
, and
TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processin ...
. A de facto standard in the Earth Science communities is
OPeNDAP OPeNDAP is an acronym for "Open-source Project for a Network Data Access Protocol," an endeavor focused on enhancing the retrieval of remote, structured data through a Web-based architecture and a discipline-neutral Data Access Protocol (DAP). Widel ...
, a data transport architecture and protocol. While this is not a database specification, it offers important components that characterize a database system, such as a conceptual model and client/server implementations. A declarative geo raster query language,
Web Coverage Processing Service The Web Coverage Processing Service (WCPS) defines a language for filtering and processing of multi-dimensional raster coverages, such as sensor, simulation, image, and statistics data. The Web Coverage Processing Service is maintained by the Ope ...
(WCPS), has been standardized by the
Open Geospatial Consortium The Open Geospatial Consortium (OGC), an international voluntary consensus standards organization for geospatial content and location-based services, sensor web and Internet of Things, GIS data processing and data sharing. It originated in 1994 ...
(OGC). In June 2014, ISO/IEC JTC1 SC32 WG3, which maintains the SQL database standard, has decided to add multi-dimensional array support to SQL as a new column type,Chirgwin, R.
SQL fights back against NoSQL's big data cred with SQL/MDA spec
The Register, 26 Jun 2014
based on the initial array support available since the 2003 version of SQL. The new standard, adopted in Fall 2018, is named ''ISO 9075 SQL Part 15: MDA (Multi-Dimensional Arrays)''.


List of array DBMSs

* Oracle GeoRaster * MonetDB/SciQL *
PostGIS PostGIS ( ) is an open source software program that adds support for geographic objects to the PostgreSQL object-relational database. PostGIS follows the Simple Features for SQL specification from the Open Geospatial Consortium (OGC). Technicall ...
*
rasdaman rasdaman ("raster data manager") is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently ...
* SciDB


See also

* Data Intensive Computing


References

{{DEFAULTSORT:Array DBMS Database models Database management systems