HOME

TheInfoList



OR:

A Canonical S-expression (or csexp) is a binary encoding form of a subset of general
S-expression In computer programming, an S-expression (or symbolic expression, abbreviated as sexpr or sexp) is an expression in a like-named notation for nested list (tree-structured) data. S-expressions were invented for and popularized by the programming la ...
(or sexp). It was designed for use in
SPKI Simple public key infrastructure (SPKI, pronounced ''spoo-key'') was an attempt to overcome the complexity of traditional X.509 public key infrastructure. It was specified in two Internet Engineering Task Force (IETF) Request for Comments (RFC) s ...
to retain the power of S-expressions and ensure
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
for applications such as
digital signatures A digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. A valid digital signature, where the prerequisites are satisfied, gives a recipient very high confidence that the message was created b ...
while achieving the compactness of a binary form and maximizing the speed of parsing. The particular subset of general S-expressions applicable here is composed of atoms, which are byte strings, and parentheses used to delimit lists or sub-lists. These S-expressions are fully recursive. While S-expressions are typically encoded as text, with spaces delimiting atoms and quotation marks used to surround atoms that contain spaces, when using the canonical encoding each atom is encoded as a length-prefixed byte string. No whitespace separating adjacent elements in a list is permitted. The length of an atom is expressed as an
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
decimal number followed by a ":".


Example

The sexp (this "Canonical S-expression" has 5 atoms) becomes the csexp (4:this22:Canonical S-expression3:has1:55:atoms) No quotation marks are required to escape the space character internal to the atom "Canonical S-expression", because the length prefix clearly points to the end of the atom. There is no whitespace separating an atom from the next element in the list.


Properties

* Uniqueness of canonical encoding: Forbidding whitespace between list elements and providing just one way of encoding atoms ensures that every S-expression has exactly one encoded form. Thus, we can decide whether two S-expressions are equivalent by comparing their encodings. * Support for binary data: Atoms can be any binary string. So, a cryptographic hash value or a public key modulus that would otherwise have to be encoded in
base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
or some other printable encoding can be expressed in csexp as its binary bytes. * Support for type-tagging encoded information: A csexp includes a non-S-expression construct for indicating the encoding of a string, when that encoding is not obvious. Any atom in csexp can be prefixed by a single atom in square brackets such as " :JPEG or " 4:text/plain;charset=utf-8.


Interpretation and restrictions

While csexps generally permit empty lists, empty atoms, and so forth, certain uses of csexps impose additional restrictions. For example, csexps as used in
SPKI Simple public key infrastructure (SPKI, pronounced ''spoo-key'') was an attempt to overcome the complexity of traditional X.509 public key infrastructure. It was specified in two Internet Engineering Task Force (IETF) Request for Comments (RFC) s ...
have one limitation compared to csexps in general: every list must start with an atom, and therefore there can be no empty lists. Typically, a list's first atom is treated as one treats an element name in
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
.


Comparisons to other encodings

There are other encodings in common use: #
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
#
ASN.1 Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and ...
#
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
(and
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Exte ...
that includes "JSON as an official subset", with the superset, meant to be more
human-readable A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans. In computing, ''human-readable'' data is often encoded as ASCII or Unicode text, rather than as binary data. In most c ...
.) Generally, csexp has a parser one or two decimal orders of magnitude smaller than that of either XML or ASN.1. This small size and corresponding speed give csexp its main advantage. In addition to the parsing advantage, there are other differences.


csexp vs. XML

csexp and XML differ in that csexp is a data-representation format, while XML includes a data-representation format and also a schema mechanism. Thus, XML can be "configured" for particular kinds of data, which conform to some grammar (say,
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
,
ATOM Every atom is composed of a nucleus and one or more electrons bound to the nucleus. The nucleus is made of one or more protons and a number of neutrons. Only the most common variety of hydrogen has no neutrons. Every solid, liquid, gas, and ...
, SVG,
MathML Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content. It aims at integrating mathematical formulae into World Wide Web ...
, or new ones as needed). It has languages for defining document grammars: DTD is defined by the XML standard itself, while
XSD XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium ( W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item con ...
,
RelaxNG In computing, RELAX NG (REgular LAnguage for XML Next Generation) is a XML schema, schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but R ...
, and
Schematron Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a structural schema language expressed in XML using a small number of elements and XPath. In many implementations ...
are commonly used with XML for additional features, and XML can also work with no schema. csexp data can of course be operated on by schemas implemented at a higher level, but provides no such mechanism itself. In terms of characters and bytes, a csexp "string" may have any byte sequence whatsoever (because of the length prefix on each atom), while XML (like regular Scheme S-expressions, JSON, and literals in programming languages), requires alternate representations for a few characters (such as "<" and most control characters). This, however, has no effect on the range of structures and semantics that can be represented. XML also provides mechanisms to specify how a given byte sequence is intended to be interpreted: Say, as a
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
string, a
JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...
file, or an integer; csexp leaves such distinctions to external mechanisms. At the most basic level, both csexp and XML represent trees (as do most other external representations). This is not surprising, since XML can be described as a differently-punctuated form for LISP-like S-expressions, or vice versa. However, XML includes additional semantics, which are commonly achieved in csexp by various conventions rather than as part of the language. First, every XML element has a name (csexp applications commonly use the first child of each expression for this). Second, XML provide datatyping, firstly via the schema grammar. A schema can also, however, distinguish integers, strings, data objects with types (e.g. JPEG) and (especially with
XSD XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium ( W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item con ...
) other types). An XML element may also have ''attributes'', a construct that csexp does not share. To represent XML data in csexp, one must choose a representation for such attributes; an obvious one is to reserve the second item in each S-expression for a list of (name value) pairs, analogous to the
LISP A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
association list In computer programming and particularly in Lisp, an association list, often referred to as an alist, is a linked list in which each list element (or node) comprises a key and a value. The association list is said to ''associate'' the value with ...
. The XML ''ID'' and ''IDREF'' attributes have no equivalent in csexp, but can be easily implemented by a csexp application program. Finally, an XML element may contain comments and/or processing instructions. csexp has no specific equivalents, but they are trivial to represent, merely by reserving a name for each. For example, naming them "*COM" and "*PI" (the "*" prevents ever colliding with XML element type names): (4:*COM15:Text of comment) (3:*PI6:target11:font="helv") Both csexp and XML are fully recursive. The first atom in a csexp list, by convention roughly corresponds to an XML element type name in identifying the "type" of the list. However, in csexp this can be any atom in any encoding (e.g., a JPEG, a Unicode string, a
WAV Waveform Audio File Format (WAVE, or WAV due to its filename extension; pronounced "wave") is an audio file format standard, developed by IBM and Microsoft, for storing an audio bitstream on PCs. It is the main format used on Microsoft Wind ...
file, …), while XML element names are identifiers, constrained to certain characters, like programming-language identifiers. csexp's method is obviously more general; on the other hand, Identifying what encoding such an item is in, and thus how to interpret it, is determined only by a particular user's conventions, meaning that a csexp application must build such conventions for itself, in code, documentation, and so forth. Similarly, csexp atoms are binary (consisting of a length prefix followed by totally arbitrary ''bytes''), while XML is designed to be human-readable (while arguably less so than
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
or
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Exte ...
) so arbitrary bytes in XML must be encoded somehow (for example, a bitmapped image can be included using
base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
). This means that storing large amounts of non-readable information in uncompressed XML takes more space; on the other hand, it will survive translation between alternate
character sets Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
(including transmission through network hosts that may apply differing character sets, line-end conventions, etc.). It has been suggested that XML "merges" a sequence of strings within one element into a single string, while csexp allows a sequence of atoms within a list and those atoms remain separate from one another; but this is incorrect.The SAX interface to XML ''allows'' an XML ''parser'' to break up (single) text strings any way it likes. Some implementations (incorrectly) return multiple ''lines'' as single text nodes, which may have led to this common misunderstanding. Exactly like S-expressions and csexp, XML has a notion of a "sequence of strings" only if the "strings" are separated somehow: <s>String A</s><s>String B</s> versus <s>String AString B</s> ("String A" "String B") versus ("String AString B") (8:String A8:String B) versus (16:String AString B)


csexp vs. ASN.1

ASN.1 is a popular binary encoding form. However, it expresses only syntax (data types), not semantics. Two different structures each a SEQUENCE of two INTEGERS have identical representations on the wire (barring special tag choices to distinguish them). To parse an ASN.1 structure, one must tell the parser what set of structures one is expecting and the parser must match the data type being parsed against the structure options. This adds to the complexity of an ASN.1 parser. A csexp structure carries some indication of its own semantics (encoded in element names), and the parser for a csexp structure does not care what structure is being parsed. Once a wire-format expression has been parsed into an internal tree form (similar to XML's DOM), the consumer of that structure can examine it for conformance to what was expected. An XML document without a schema works just like csexp in this respect, while an XML document with them can work more like ASN.1.


External links


Ron Rivest's sexp page

Ron Rivest's IETF-draft describing S-expressions



Notes and references

{{Reflist Lisp (programming language) Data serialization formats