HOME

TheInfoList



OR:

Data orientation refers to how tabular data is represented in a linear memory model such as in-disk or
in-memory An in-memory database (IMDB, or main memory database system (MMDB) or memory resident database) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that e ...
.The two most common representations are column-oriented (columnar format) and row-oriented (row format). The choice of data orientation is a trade-off and a architectural decision in databases, query engines, and numerical simulations. As a result of these tradeoffs, row-oriented formats are more commonly used in Online transaction processing (OLTP) and column-oriented formats are more commonly used in Online analytical processing (OLAP). Examples of column-oriented formats include Apache ORC, Apache Parquet,
Apache Arrow Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for effic ...
, formats used by BigQuery, Amazon Redshift and Snowflake. Predominant examples of row-oriented formats include CSV, formats used in most
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s, in-memory format of
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
, and Apache Avro.


Description

Tabular data is two dimensional in nature - data is represented in rows and columns. However, modern operating systems logically represent data in a linear memory model, both in-disk and in-memory. Therefore, a table in a linear memory model requires projecting its two-dimensional items in a one-dimensional space. Data orientation refers to the decision taken in this projection. There are two prominent choices of orientation: row-oriented and column-oriented.


Row-oriented

In row-oriented, the elements of the table are stored linearly as I.e. each row of the table is located one after the other. In this orientation, values on the same row are close in space (e.g. similar address in an addressable space).


Examples

* CSV *
Postgres PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
in-disk and in-memory formats *
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
in-memory format * Apache Avro


Column-oriented

In column-oriented, the elements of the table are stored linearly as I.e. each column of the table is located one after the other. In this orientation, values on the same column are close in space (e.g. similar address in an addressable space).


Examples

* BigQuery's in-memory and storage formats * Apache Parquet * Apache ORC *
Apache Arrow Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for effic ...
* DuckDB in-memory format * Pandas in-memory format See list of column-oriented DBMSes for more examples.


Tradeoff

The data orientation is an important architectural decision of systems handling data because it results in important tradeoffs in
performance A performance is an act of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Management science In the work place ...
and
storage Storage may refer to: Goods Containers * Dry cask storage, for storing high-level radioactive waste * Food storage * Intermodal container, cargo shipping * Storage tank Facilities * Garage (residential), a storage space normally used to store car ...
. Below are selected dimensions of this tradeoff.


Random access

Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of less page or cache misses when accessing the data.


Insert

Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column. This dimension is an important reason why row-oriented formats are more commonly used in Online transaction processing (OLTP), as it results in faster transactions in comparison to column-oriented.


Conditional access

Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection.


Compute performance

Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions.


Uncompressed size

Column-oriented benefits from smaller uncompressed size. This is the result of the possibility that this orientation offers to represent certain data types with dedicated encodings. For example, a table of 128 rows with a boolean column requires 128 bytes a row-oriented format (one byte per boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap). Another example is the use of
run-length encoding Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original ...
to encode a column.


Compressed size

Column-oriented benefits from smaller compressed size. This is the result of a higher homogeneity within a column than within multiple rows.


Conversion and interchange

Because both orientations represent the same data, it is possible to convert a row-oriented dataset to a column-oriented dataset and vice-versa at the expense of compute. In particular, advanced query engines often leverage each orientation's advantages, and convert from one orientation to the other as part of their execution. As an example, an
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
query may # read data from Apache Parquet (column-oriented) # load it into Spark internal in-memory format (row-oriented) # convert it to
Apache Arrow Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for effic ...
for a specific computation (column-oriented) # write it to Apache Avro for streaming (row-oriented)


References

{{DEFAULTSORT:Data Orientation Database models