HOME

TheInfoList



OR:

A delimiter is a sequence of one or more
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
s for specifying the boundary between separate, independent regions in
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limi ...
, mathematical expressions or other data streams. An example of a delimiter is the
comma The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
character, which acts as a ''field delimiter'' in a sequence of
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separa ...
. Another example of a delimiter is the time gap used to separate letters and words in the transmission of
Morse code Morse code is a method used in telecommunication to encode text characters as standardized sequences of two different signal durations, called ''dots'' and ''dashes'', or ''dits'' and ''dahs''. Morse code is named after Samuel Morse, one ...
. In mathematics, delimiters are often used to specify the scope of an operation, and can occur both as isolated symbols (e.g., colon in "1 : 4") and as a pair of opposing-looking symbols (e.g., angled brackets in \langle a, b \rangle). Delimiters represent one of various means of specifying boundaries in a
data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has be ...
. Declarative notation, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains. describing the method in Hollerith notation under the Fortran programming language.


Overview

Delimiters may be characterized as field and record delimiters, or as bracket delimiters.


Field and record delimiters

Field delimiters separate data fields. Record delimiters separate groups of fields. p. 141 For example, the CSV file format uses a comma as the delimiter between
fields Fields may refer to: Music *Fields (band), an indie rock band formed in 2006 *Fields (progressive rock band), a progressive rock band formed in 1971 * ''Fields'' (album), an LP by Swedish-based indie rock band Junip (2010) * "Fields", a song by ...
, and an
end-of-line Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
indicator as the delimiter between
records A record, recording or records may refer to: An item or collection of data Computing * Record (computer science), a data structure ** Record, or row (database), a set of fields in a database related to one entity ** Boot sector or boot record, ...
:
fname,lname,age,salary
nancy,davolio,33,$30000
erin,borakova,28,$25250
tony,raphael,35,$28700
This specifies a simple flat file database
table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
using the CSV file format.


Bracket delimiters

Bracket delimiters, also called block delimiters, region delimiters, or balanced delimiters, mark both the start and end of a region of text. p. 319 Common examples of bracket delimiters include:


Conventions

Historically, computing platforms have used certain delimiters by convention. The following tables depict a few examples for comparison. Programming languages (''See also'', Comparison of programming languages (syntax)). Field and Record delimiters (''See also'',
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because o ...
,
Control character In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the ...
).


Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions. describing solutions for embedded-delimiter problems p. 472. In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character. In most file types there is both a field delimiter and a record delimiter, both of which are subject to collision. In the case of
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separa ...
files, for example, field collision can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"), and record delimiter collision would occur whenever a field contained multiple lines. Both record and field delimiter collision occur frequently in text files. In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of security vulnerabilities and exploits. Malicious users can take advantage of delimiter collision in languages such as SQL and
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript ...
to deploy such well-known attacks as
SQL injection In computing, SQL injection is a code injection technique used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker). SQL inj ...
and
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability m ...
, respectively.


Solutions

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ''ad hoc'' approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions. Other, more formal conventions are therefore applied as well.


ASCII delimited text

The ASCII and Unicode character sets were designed to solve this problem by the provision of non-printing characters that can be used as delimiters. These are the range from ASCII 28 to 31. The use of ASCII 31
Unit separator The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
as a field separator and ASCII 30
Record separator The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
solves the problem of both field and record delimiters that appear in a text data stream.


Escape character

One method for avoiding delimiter collision is to use
escape character In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the ju ...
s. From a language design standpoint, these are adequate, but they have drawbacks: * text can be rendered unreadable when littered with numerous escape characters, a problem referred to as leaning toothpick syndrome (due to use of \ to escape / in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s, leading to sequences such as "\/\/"); * text becomes difficult to parse through regular expression * they require a mechanism to "escape the escapes" when not intended as escape characters; and * although easy to type, they can be cryptic to someone unfamiliar with the language. * they do not protect against injection attacks


Escape sequence

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in
string literal A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
s that include a doublequote (") character. For example in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
, the code: print "Nancy said \x22Hello World!\x22 to the crowd."; ### use \x22 produces the same output as: print "Nancy said \"Hello World!\" to the crowd."; ### use escape char One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also:
character entity reference Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
, numeric character reference).


Dual quoting delimiters

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a single quote (') or a double quote (") to specify a string literal. For example, in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
: print 'Nancy said "Hello World!" to the crowd.'; produces the desired output without requiring escapes. This approach, however, only works when the string does not contain ''both'' types of quotation marks.


Padding quoting delimiters

In contrast to escape sequences and escape characters, padding delimiters provide yet another way to avoid delimiter collision.
Visual Basic Visual Basic is a name for a family of programming languages from Microsoft. It may refer to: * Visual Basic .NET (now simply referred to as "Visual Basic"), the current version of Visual Basic launched in 2002 which runs on .NET * Visual Basic (c ...
, for example, uses double quotes as delimiters. This is similar to escaping the delimiter. print "Nancy said ""Hello World!"" to the crowd." produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used. The code to print the above source code would look more confusing: print "print ""Nancy said """"Hello World!"""" to the crowd."""


Configurable alternative quoting delimiters

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision. For example, in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
: print qq^Nancy doesn't want to say "Hello World!" anymore.^; print qq@Nancy doesn't want to say "Hello World!" anymore.@; print qq(Nancy doesn't want to say "Hello World!" anymore.); all produce the desired output through use o
quote operators
which allow any convenient character to act as a delimiter. Although this method is more flexible, few languages support it. Perl and
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sap ...
are two that do. In Ruby, these are indicated as ''general delimited strings''. p. 11


Content boundary

A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation. p. 26 The delimiter is frequently generated from a random sequence of characters that is statistically improbable to occur in the content. This may be followed by an identifying mark such as a UUID, a
timestamp A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second. Timestamps do not have to be based on some absolut ...
, or some other distinguishing mark. Alternatively, the content may be scanned to guarantee that a delimiter does not appear in the text. This may allow the delimiter to be shorter or simpler, and increase the human readability of the document. (''See e.g.'',
MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Messa ...
,
Here document In computing, a here document (here-document, here-text, heredoc, hereis, here-string or here-script) is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is al ...
s).


Whitespace or indentation

Some programming and computer languages allow the use of whitespace delimiters or
indentation __FORCETOC__ In the written form of many languages, an indentation or indent is an empty space at the beginning of a line to signal the start of a new paragraph. Many computer languages have adopted this technique to designate "paragraphs" or ...
as a means of specifying boundaries between independent regions in text. Describes whitespace delimiters. p. 258.


Regular expression syntax

In specifying a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
, alternate delimiters may also be used to simplify the syntax for match and substitution operations in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
. page 472. For example, a simple match operation may be specified in Perl with the following syntax: $string1 = 'Nancy said "Hello World!" to the crowd.'; # specify a target string print $string1 =~ m/ eiou/; # match one or more vowels The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision: $string1 = 'Nancy said "http://Hello/World.htm" is not a valid address.'; # target string print $string1 =~ m@http://@; # match using alternate regular expression delimiter print $string1 =~ m; # same as previous, but different delimiter print $string1 =~ m!http://!; # same as previous, but different delimiter.


Here document

A
Here document In computing, a here document (here-document, here-text, heredoc, hereis, here-string or here-script) is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is al ...
allows the inclusion of arbitrary content by describing a special end sequence. Many languages support this including
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group. ...
, bash scripts,
ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sap ...
and
perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offi ...
. A here document starts by describing what the end sequence will be and continues until that sequence is seen at the start of a new line.Perl operators and precedence
/ref> Here is an example in perl: print < This code would print:
It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.
By using a special end sequence all manner of characters are allowed in the string.


ASCII armor

Although principally used as a mechanism for text encoding of binary data,
ASCII armoring A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary da ...
is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances.(an example usage of ASCII armoring in encryption applications)(an example usage of ASCII armoring in encryption applications) This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such as
base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bin ...
, to ensure that delimiter or other significant characters do not appear in transmitted data. The purpose is to prevent multilayered escaping, i.e. for doublequotes. This technique is used, for example, in
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...
's ASP.NET web development technology, and is closely associated with the "VIEWSTATE" component of that system.(describes the use of Base64 encoding and VIEWSTATE inside HTML source code)


= Example

= The following simplified example demonstrates how this technique works in practice. The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself: This first code fragment is not well-formed, and would therefore not work properly in a "real world" deployed system. To store arbitrary text in an HTML attribute,
HTML entities In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...
can be used. In this case "&quot;" stands in for the double-quote: Alternatively, any encoding could be used that doesn't include characters that have special meaning in the context, such as base64: Or
percent-encoding Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI. Although it is known as ''URL encoding'', it is also used mo ...
: This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.


See also

* CDATA * Decimal separator *
Delimiter-separated values Formats that use delimiter-separated values (also DSV)DSV stands for ''Delimiter Separated Values'' store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet program ...
*
Escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
*
String literal A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
*
Tab-separated values A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., a database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the tex ...


References

{{reflist


External links


Data File Metaformats
from
The Art of Unix Programming ''The Art of Unix Programming'' by Eric S. Raymond is a book about the history and culture of Unix programming from its earliest days in 1969 to 2003 when it was published, covering both genetic derivations such as BSD and conceptual ones such ...
by Eric Steven Raymond
What is delimiter
by Margaret Rouse. Markup languages Pattern matching Programming constructs String (computer science)