Formats that use delimiter-separated values (also DSV)
[DSV stands for ''Delimiter Separated Values'' ] store two-dimensional arrays of data by separating the values in each row with specific
delimiter
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
characters
Character or Characters may refer to:
Arts, entertainment, and media Literature
* ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk
* ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
. Most
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
and
spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in
data exchange
Data exchange is the process of taking data structured under a ''source'' schema and transforming it into a ''target'' schema, so that the target data is an accurate representation of the source data.A. Doan, A. Halevy, and Z. Ives.Principles of da ...
among many applications.
A delimited text file is a
text file
A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...
used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter.
Compared to the kind of
flat file
A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain ...
that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length.
Delimited formats
Any character may be used to separate the values, but the most common delimiters are the
comma
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline ...
,
tab, and
colon.
The
vertical bar
The vertical bar, , is a glyph with various uses in mathematics, computing, and typography. It has many names, often related to particular meanings: Sheffer stroke (in logic), pipe, bar, or (literally the word "or"), vbar, and others.
Usage
...
(also referred to as ''pipe'') and
space
Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually consider ...
are also sometimes used.
Column headers are sometimes included as the first line, and each subsequent line is a row of data. The lines are separated by
newline
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
s.
For example, the following fields in each record are delimited by commas, and each record by newlines:
"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
"15 April","Muniz, Alvin ""Hank""","A"
Note the use of the
double quote to enclose each field. This prevents the comma in the actual field value (Bloggs, Fred; Doe, Jane; etc.) from being interpreted as a field separator. This necessitates a way to "
escape
Escape or Escaping may refer to:
Computing
* Escape character, in computing and telecommunication, a character which signifies that what follows takes an alternative interpretation
** Escape sequence, a series of characters used to trigger some so ...
" the field wrapper itself, in this case the double quote; it is customary to double the double quotes actually contained in a field as with those surrounding "Hank". In this way, any
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
text including newlines can be contained in a field.
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
includes several
control character
In computing and telecommunication, a control Character (computing), character or non-printing character (NPC) is a code point (a number) in a character encoding, character set, that does not represent a written symbol. They are used as in-band ...
s that are intended to be used as delimiters. They are:
28 for File Separator,
29 for Group Separator,
30 for Record Separator, and
31 for Unit Separator. Use of these characters has not achieved widespread adoption; some systems have replaced their control properties with more accepted controls such as
CR/LF
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
and TAB.
Uses and applications
Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most
spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
programs and
statistical packages, sometimes even without the user designating which delimiter has been used.
Despite that each of those applications has its own
database design and its own
file format
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.
Some file formats ...
(for example, accdb or xlsx), they can all map the fields in a DSV file to their own
data model
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
and format.
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding
delimiter collision
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field. Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
(in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply is not sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (, ), and many other characters are also used, it can be quite challenging to find a character that is not being used elsewhere.
See also
*
Comma-separated values
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
*
Delimiter
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
*
Tab-separated values
A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., a database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text ...
Notes and references
Further reading
*{{cite web , title=IBM DB2 Administration Guide - LOAD, IMPORT, and EXPORT File Formats , publisher=
IBM , url=https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm , access-date=2016-12-12 , url-status=live , archive-url=https://web.archive.org/web/20161213014111/https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm , archive-date=2016-12-13 (Has file descriptions of delimited ASCII (.DEL) and non-delimited ASCII (.ASC) files for data transfer.)
Delimiter-separated format