HOME

TheInfoList



OR:

Data orientation is the representation of
tabular data A table is an arrangement of information or data, typically in rows and columns, or possibly in a more complex structure. Tables are widely used in communication, research, and data analysis. Tables appear in print media, handwritten notes, comp ...
in a linear memory model such as in-disk or in-memory. The two most common representations are column-oriented (columnar format) and row-oriented (row format). The choice of data orientation is a
trade-off A trade-off (or tradeoff) is a situational decision that involves diminishing or losing on quality, quantity, or property of a set or design in return for gains in other aspects. In simple terms, a tradeoff is where one thing increases, and anoth ...
and an architectural decision in
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s, query engines, and numerical simulations. As a result of these tradeoffs, row-oriented formats are more commonly used in
Online transaction processing Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them i ...
(OLTP) and column-oriented formats are more commonly used in
Online analytical processing In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
(OLAP). Examples of column-oriented formats include
Apache ORC Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the ...
,
Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing f ...
, Apache Arrow, formats used by BigQuery,
Amazon Redshift Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel (later acqui ...
and
Snowflake A snowflake is a single ice crystal that is large enough to fall through the Earth's atmosphere as snow.Knight, C.; Knight, N. (1973). Snow crystals. Scientific American, vol. 228, no. 1, pp. 100–107.Hobbs, P.V. 1974. Ice Physics. Oxford: C ...
. Predominant examples of row-oriented formats include CSV, formats used in most
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s, the in-memory format of
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
, and Apache Avro.


Description

Tabular data is two dimensional — data is modeled as rows and columns. However, computer systems represent data in a linear memory model, both in-disk and in-memory. Therefore, a table in a linear memory model requires mapping its two-dimensional scheme into a one-dimensional space. Data orientation is to the decision taken in this mapping. There are two prominent mappings: row-oriented and column-oriented.


Row-oriented

In a row-oriented database, also known as a rowstore, the elements of the table are stored linearly as I.e. each row of the table is located one after the other. In this orientation, values on the same row are close in space (e.g. similar address in an addressable space).


Examples

* CSV *
Postgres PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
in-disk and in-memory formats *
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
in-memory format * Apache Avro


Column-oriented

In a column-oriented database, also known as a columnstore, the elements of the table are stored linearly as I.e. each column of the table is located one after the other. In this orientation, values on the same column are close in space (e.g. similar address in an addressable space).


Examples

* BigQuery's in-memory and storage formats *
Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing f ...
*
Apache ORC Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the ...
* Apache Arrow * DuckDB in-memory format * Pandas in-memory format * R dataframes See list of column-oriented DBMSes for more examples.


Tradeoff

The data orientation is an important architectural decision of systems handling data because it results in important
tradeoff A trade-off (or tradeoff) is a situational decision that involves diminishing or losing on quality, quantity, or property of a set or design in return for gains in other aspects. In simple terms, a tradeoff is where one thing increases, and anoth ...
s in
performance A performance is an act or process of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Performance has evolved glo ...
and storage. Below are selected dimensions of this tradeoff.


Random access

Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of fewer page or cache misses when accessing the data.


Insert

Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column. This dimension is an important reason why row-oriented formats are more commonly used in
Online transaction processing Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them i ...
(OLTP), as it results in faster transactions in comparison to column-oriented.


Conditional access

Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection.


Compute performance

Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions.


Uncompressed size

Column-oriented benefits from smaller uncompressed size. This is the result of the possibility that this orientation offers to represent certain data types with dedicated encodings. For example, a table of 128 rows with a Boolean column requires 128 bytes a row-oriented format (one byte per Boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap). Another example is the use of
run-length encoding Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (consecutive occurrences of the same data value) are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather th ...
to encode a column.


Compressed size

Column-oriented benefits from smaller compressed size. This is the result of a higher homogeneity within a column than within multiple rows.


Conversion and interchange

Because both orientations represent the same data, it is possible to convert a row-oriented dataset to a column-oriented dataset and vice versa at the expense of compute. In particular, advanced query engines often leverage each orientation's advantages, and convert from one orientation to the other as part of their execution. As an example, an
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
query may # read data from
Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing f ...
(column-oriented) # load it into Spark internal in-memory format (row-oriented) # convert it to Apache Arrow for a specific computation (column-oriented) # write it to Apache Avro for streaming (row-oriented)


References

{{DEFAULTSORT:Data Orientation Database models