Comma Delimited
   HOME

TheInfoList



OR:

A comma-separated values (CSV) file is a delimited
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...
that uses a
comma The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline ...
to separate values. Each line of the file is a data record. Each record consists of one or more
fields Fields may refer to: Music *Fields (band), an indie rock band formed in 2006 *Fields (progressive rock band), a progressive rock band formed in 1971 * ''Fields'' (album), an LP by Swedish-based indie rock band Junip (2010) * "Fields", a song by ...
, separated by commas. The use of the comma as a field separator is the source of the name for this
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
. A CSV file typically stores
tabular Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
data (numbers and text) in
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limit ...
, in which case each line will have the same number of fields. The CSV file format is not fully standardized. Separating fields with commas is the foundation, but commas in the data or embedded line breaks have to be handled specially. Some implementations disallow such content while others surround the field with
quotation mark Quotation marks (also known as quotes, quote marks, speech marks, inverted commas, or talking marks) are punctuation marks used in pairs in various writing systems to set off direct speech, a quotation, or a phrase. The pair consists of an ...
s, which yet again creates the need for escaping if quotation marks are present in the data. The term "CSV" also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons. These include
tab-separated values A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., a database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text ...
and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
. Alternative delimiter-separated files are often given a ".csv"
extension Extension, extend or extended may refer to: Mathematics Logic or set theory * Axiom of extensionality * Extensible cardinal * Extension (model theory) * Extension (predicate logic), the set of tuples of values that satisfy the predicate * E ...
despite the use of a non-comma field separator. This loose terminology can cause problems in
data exchange Data exchange is the process of taking data structured under a ''source'' schema and transforming it into a ''target'' schema, so that the target data is an accurate representation of the source data.A. Doan, A. Halevy, and Z. Ives.Principles of da ...
. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European locales in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.


Data exchange

CSV is a common
data exchange Data exchange is the process of taking data structured under a ''source'' schema and transforming it into a ''target'' schema, so that the target data is an accurate representation of the source data.A. Doan, A. Halevy, and Z. Ives.Principles of da ...
format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data between programs that natively operate on incompatible (often
proprietary {{Short pages monitor The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the
MIME type A media type (also known as a MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority, Internet Assigned Numbers Authority (IANA) is the official authority for t ...
"text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements: * MS-DOS-style lines that end with (CR/LF) characters (optional for the last line). * An optional header record (there is no sure way to detect whether it is present, so care is required when importing). * Each record ''should'' contain the same number of comma-separated fields. * Any field ''may'' be quoted (with double quotes). * Fields containing a line-break, double-quote or commas ''should'' be quoted. (If they are not, the file will likely be impossible to process correctly.) * ''If'' double-quotes are used to enclose fields, then a double-quote in a field ''must'' be represented by two double-quote characters. The format can be processed by most programs that claim to read CSV files. The exceptions are ''(a)'' programs may not support line-breaks within quoted fields, ''(b)'' programs may confuse the optional header with data or interpret the first data line as an optional header, and ''(c)'' double-quotes in a field may not be parsed correctly automatically.


OKF frictionless tabular data package

In 2011
Open Knowledge Foundation Open Knowledge Foundation (OKF) is a global, non-profit network that promotes and shares information at no charge, including both content and data. It was founded by Rufus Pollock on 20 May 2004 in Cambridge, UK. It is incorporated in England an ...
(OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1). The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.


W3C tabular data standard

In 2013 the
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
"CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats. The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations for modeling "Tabular Data", and enhancing CSV with
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
and
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
.


Basic rules

Many informal documents exist that describe "CSV" formats.
IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster or requirements and a ...
RFC 4180 (summarized above) defines the format for the "text/csv"
MIME type A media type (also known as a MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority, Internet Assigned Numbers Authority (IANA) is the official authority for t ...
registered with the
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Interne ...
. Rules typical of these and other "CSV" specifications and implementations are as follows:


Example

The above table of data may be represented in CSV format as follows: Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00 Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma): Year,Make,Model,Length 1997,Ford,E350,2.35 2000,Mercury,Cougar,2.38 Example of an analogous European CSV/ DSV file (where the decimal separator is a comma and the value separator is a semicolon): Year;Make;Model;Length 1997;Ford;E350;2,35 2000;Mercury;Cougar;2,38 The latter format is not RFC 4180 compliant. Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the
decimal mark A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The ch ...
or the practice of quoting all numbers that have a decimal mark.


Application support

Some applications use CSV as a
data interchange format Data Interchange Format (.dif) is a text file format used to import/export single spreadsheets between spreadsheet programs. Applications that still support the DIF format are Collabora Online, Excel, Microsoft Excel's implementation caused in ...
to enhance its
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...
, exporting and importing CSV. Others use CSV as an ''internal format''. As a data interchange format: the CSV file format is supported by almost all spreadsheets and database management systems, *
Spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
s including Apple
Numbers A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...
,
LibreOffice Calc LibreOffice Calc is the spreadsheet component of the LibreOffice software package. After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculation ...
, and
Apache OpenOffice Apache OpenOffice (AOO) is an open-source office productivity software suite. It is one of the successor projects of OpenOffice.org and the designated successor of IBM Lotus Symphony. It is a close cousin of LibreOffice, Collabora Online and N ...
Calc.
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro (comp ...
also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma).
LibreOffice Calc LibreOffice Calc is the spreadsheet component of the LibreOffice software package. After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculation ...
CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming. *
Relational databases A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...
, when using standard SQL, can export/import CSV by the COPY command. For example, on
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
is valid COPY TO t 'file.csv' CSV and COPY FROM t 'file.csv' CSV. * Many utility programs on
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
-style systems (such as
cut Cut may refer to: Common uses * The act of cutting, the separation of an object into two through acutely-directed force ** A type of wound ** Cut (archaeology), a hole dug in the past ** Cut (clothing), the style or shape of a garment ** Cut (ea ...
, paste,
join Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two top ...
,
sort Sort may refer to: * Sorting, any process of arranging items in sequence or in sets ** Sorting algorithm, any algorithm for arranging elements in lists ** Sort (Unix), a Unix utility which sorts the lines of a file ** Sort (C++), a function in the ...
,
uniq uniq is a utility command (computing), command on Unix, Plan 9 from Bell Labs, Plan 9, Inferno (operating system), Inferno, and Unix-like operating systems which, when fed a text file or Standard streams#Standard input (stdin), standard input, o ...
,
awk AWK (''awk'') is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems. The AWK lang ...
) can split files on a comma delimiter, and can therefore process simple CSV files. However, this method does not correctly handle commas or new lines within quoted strings. * Some code and text editors such as
Visual Studio Code Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include support for debugging, syntax highlighting, intelligent code complet ...
,
IntelliJ IntelliJ IDEA is an integrated development environment (IDE) written in Java for developing computer software written in Java, Kotlin, Groovy, and other JVM-based languages. It is developed by JetBrains (formerly known as IntelliJ) and is ava ...
,
Notepad++ Notepad++ is a text and source code editor for use with Microsoft Windows. It supports tabbed editing, which allows working with multiple open files in a single window. The product's name comes from the C postfix increment operator. Notepad++ ...
,
CudaText CudaText, from Bosnian-Croatian-Montenegrin-Serbian ''čuda'' ("wonders" or "miracles", IPA: uda, is a free open source cross-platform native GUI text and source code editor. CudaText supersedes its predecessor ''SynWrite'', no longer under ...
and others support syntax highlighting for CSV files, making them easier to read and edit As (main or optional) internal representation. Can be native or foreign, but differ from interchange format ("export/import only") because it is not necessary to create a copy in another format: * Some
Spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
s including
LibreOffice Calc LibreOffice Calc is the spreadsheet component of the LibreOffice software package. After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculation ...
offers this option, without enforcing user to adopt another format. * Some relational databases, when using standard SQL, offer ''foreign-data wrapper'' (FDW). For example, PostgreSQL offers the "CREATE FOREIGN TABLE" and "CREATE EXTENSION file_fdw to configure any variant of CSV. * Databases like
Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
offer the option to express CSV or .csv.gz as an internal table format. * The
emacs Emacs , originally named EMACS (an acronym for "Editor MACroS"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, s ...
editor can operate on CSV files using csv-nav mode. CSV format is supported by libraries available for many
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s. Most provide some way to specify the field delimiter,
decimal separator A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The cho ...
, character encoding, quoting conventions, date format, etc.


Software and row limits

Each software that works with CSV has its limits on the maximum number of rows CSV files can have. Below is a list of common software and its limitations: * Microsoft Excel: 1,048,576 row limit; * Apple Numbers: 1,000,000 row limit; * Google Sheets: 5,000,000 cell limit (the product of columns and rows); * OpenOffice and LibreOffice: 1,048,576 row limit; * Text Editors (such as
WordPad WordPad is the basic word processor that has been included with almost all versions of Microsoft Windows from Windows 95 onwards. It is more advanced than Windows Notepad, and simpler than Microsoft Word and Microsoft Works (last updated in 2007) ...
,
TextEdit TextEdit is an open-source word processor and text editor, first featured in NeXT's NeXTSTEP and OPENSTEP. It is now distributed with macOS since Apple Inc.'s acquisition of NeXT, and available as a GNUstep application for other Unix-like oper ...
, Vim, etc.): no row or cell limit; * Databases (COPY command and FDW): no row or cell limit.


See also

*
Tab-separated values A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., a database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text ...
*
Comparison of data-serialization formats This is a comparison of data serialization file format, formats, various ways to convert complex object (computer science), objects to sequences of bits. It does not include markup languages used exclusively as document file formats. Overview S ...
*
Delimiter-separated values Formats that use delimiter-separated values (also DSV)DSV stands for ''Delimiter Separated Values'' store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet program ...
*
Delimiter collision A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
*
Flat-file database A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain ...
* Simple Data Format *
Substitute character In computer data, a substitute character (␚) is a control character that is used to pad transmitted data in order to send it in blocks of fixed size, or to stand in place of a character that is recognized to be invalid, erroneous or unreprese ...
,
Null character The null character (also null terminator) is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646 (or ASCII), the C0 control code, the Universal Coded Ch ...
, invisible comma U+2063


References


Further reading

* (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.) {{Data Exchange Spreadsheet file formats Delimiter-separated format Open formats