RCFile
   HOME

TheInfoList



OR:

Within
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
database management systems In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
, the RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on
computer clusters A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The comp ...
. It is designed for systems using the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns. RCFile is the result of research and collaborative efforts from
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...
,
The Ohio State University The Ohio State University, commonly called Ohio State or OSU, is a public land-grant research university in Columbus, Ohio. A member of the University System of Ohio, it has been ranked by major institutional rankings among the best public ...
, and the Institute of Computing Technology at the
Chinese Academy of Sciences The Chinese Academy of Sciences (CAS); ), known by Academia Sinica in English until the 1980s, is the national academy of the People's Republic of China for natural sciences. It has historical origins in the Academia Sinica during the Republ ...
.


Summary


Data storage format

For example, a table in a database consists of 4 columns (c1 to c4): To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the
row-group In tables and matrices, a column group or row group usually refers to a subset of columns or rows, respectively. Short names or notational names include col group or colgroup, and row group or rowgroup. They can have varying uses depending on con ...
size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group. Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as: Row Group 1 Row Group 2 11, 21, 31; 41, 51; 12, 22, 32; 42, 52; 13, 23, 33; 43, 53; 14, 24, 34; 44, 54;


Column data compression

Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.


Performance Benefits

Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row. RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store. For example, a database might have this table: This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). This two-dimensional format exists only in theory, in practice, storage hardware requires the data to be serialized into one form or another. In MapReduce-based systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks.


Row-oriented systems

The common solution to the storage problem is to serialize each row of data, like this; 001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000; Row-based systems are designed to efficiently return data for an entire row, or an entire record, in as few operations as possible. This matches use-cases where the system is attempting to retrieve all the information about a particular object, say the full information about one contact in a
rolodex A Rolodex is a rotating card file device used to store business contact information. Its name, a portmanteau of the words ''rolling'' and ''index'', has become somewhat genericized (usually as ''rolodex'') for any personal organizer performing thi ...
system, or the complete information about one product in an online shopping system. Row-based systems are not efficient at performing operations that apply to the entire data set, as opposed to a specific record. For instance, in order to find all the records in the example table that have salaries between 40,000 and 50,000, the row-based system would have to seek through the entire data set looking for matching records. While the example table shown above may fit in a single disk block, a table with even a few hundred rows would not, therefore multiple disk operations would be needed to retrieve the data.


Column-oriented systems

A column-oriented system serializes all of the values of a column together, then the values of the next column. For our example table, the data would be stored in this fashion; 10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004; The difference can be more clearly seen in this common modification: ...;Smith:001,Jones:002,004,Johnson:003;... Two of the records store the same value, "Jones", therefore it is now possible to store this in the column-oriented system only once instead of twice. For many common searches, like "find all the people with the last name Jones", the answer can now be retrieved in a single operation. Whether or not a column-oriented system will be more efficient in operation depends heavily on the operations being automated. Operations that retrieve data for objects would be slower, requiring numerous disk operations to assemble data from different columns to build up a whole-row record. However, such whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first names and last names from many rows in order to build a list of contacts is far more common than operations reading the data for home address.


Adoption

RCFile has been adopted in real-world systems for big data analytics. # RCFile became the default data placement structure in Facebook's production Hadoop cluster. By 2010 it was the world's largest Hadoop cluster, where 40 terabytes compressed data sets are added every day. In addition, all the data sets stored in HDFS before RCFile have also been transformed to use RCFile . # RCFile has been adopted in
Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
(since v0.4), which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world, including several Internet services, such as
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...
,
Taobao Taobao () is a Chinese online shopping platform. It is headquartered in Hangzhou and is owned by Alibaba. According to Alexa rank, it is the eighth most-visited website globally in 2021. Taobao.com was registered on April 21, 2003 by Alibaba Cl ...
, and
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
. # RCFile has been adopted in
Apache Pig Apache Pig is a high-level platform for creating programs that run on Hadoop, Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the ...
(since v0.7), which is another open source data processing system being widely used in many organizations, including several major Web service providers, such as
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
,
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
,
LinkedIn LinkedIn () is an American business and employment-oriented online service that operates via websites and mobile apps. Launched on May 5, 2003, the platform is primarily used for professional networking and career development, and allows job se ...
,
AOL AOL (stylized as Aol., formerly a company known as AOL Inc. and originally known as America Online) is an American web portal and online service provider based in New York City. It is a brand marketed by the current incarnation of Yahoo (2017â ...
, and
Salesforce.com Salesforce, Inc. is an American cloud-based software company headquartered in San Francisco, California. It provides customer relationship management (CRM) software and applications focused on sales, customer service, marketing automation, an ...
. # RCFile became the ''de facto'' standard data storage structure in Hadoop software environment supported by the
Apache HCatalog The Apache () are a group of culturally related Native American tribes in the Southwestern United States, which include the Chiricahua, Jicarilla, Lipan, Mescalero, Mimbreño, Ndendahe (Bedonkohe or Mogollon and Nednhi or Carrizaleño and ...
project (formerly known as Howl) that is the table and storage management service for Hadoop. RCFile is supported by the open source Elephant Bird library used in Twitter for daily data analytics. Over the following years, other Hadoop data formats also became popular. In February 2013, an Optimized Row Columnar (ORC) file format was announced by
Hortonworks Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software (primarily around Apache Hadoop) designed to manage big data and associated processing. Hortonworks software was used to b ...
. A month later, the
Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing f ...
format was announced, developed by
Cloudera Cloudera, Inc. is an American software company providing enterprise data management systems that make significant use of Apache Hadoop. As of January 31, 2021, the company had approximately 1,800 customers. History Cloudera, Inc. was formed on J ...
and
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
.


See also

*
Column (data store) A column of a distributed data store is a NoSQL object of the lowest level in a keyspace. It is a tuple (a key–value pair) consisting of three elements: * Unique name: Used to reference the column * Value: The content of the column. It can ha ...
*
Column-oriented DBMS A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to r ...
*
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
*
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
*
Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
*
Big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...


References

{{Reflist


External links


RCFile on the Apache Software Foundation website

Source Code

Hive website

Hive page on Hadoop Wiki
Data analysis software