Data exchange is the process of taking data structured under a ''source''
schema
The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms.
Schema may refer to:
Science and technology
* SCHEMA ...
and transforming it into a ''target'' schema, so that the target data is an accurate representation of the source data.
[A. Doan, A. Halevy, and Z. Ives.]
Principles of data integration
, Morgan Kaufmann,s 2012 pp. 276 Data exchange allows data to be shared between different computer programs.
It is similar to the related concept of
data integration
Data integration involves combining data residing in different sources and providing users with a unified view of them.
This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
except that data is actually restructured (with possible loss of content) in data exchange. There may be no way to transform an
instance given all of the constraints. Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case a "best" choice of solutions has to be identified and justified.
Single-domain data exchange
In some domains, a few dozen different source and target schema (proprietary data formats) may exist. An "exchange" or "interchange format" is often developed for a single domain, and then necessary routines (mappings) are written to (indirectly) transform/translate each and every source schema to each and every target schema by using the interchange format as an intermediate step.
That requires a lot less work than writing and debugging the hundreds of different routines that would be required to directly translate each and every source schema directly to each and every target schema.
Examples of these transformative interchange formats include:
*
Standard Interchange Format for geospatial data;
*
Data Interchange Format
Data Interchange Format (.dif) is a text file format used to import/export single spreadsheets between spreadsheet programs.
Applications that still support the DIF format are Collabora Online, Excel, Microsoft Excel's implementation caused in ...
for spreadsheet data;
* Open Document Format for spreadsheets, charts, presentations and word processing documents;
*
GPS eXchange Format or
Keyhole Markup Language
Keyhole Markup Language (KML) is an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers. KML was developed for use with Google Earth, which was originally named Ke ...
for describing GPS data;
and
*
GDSII
GDSII stream format (GDSII), is a binary database file format which is the de facto industry standard for Electronic Design Automation data exchange of integrated circuit or IC layout artwork. It is a binary file format representing planar geom ...
for integrated circuit layout.
Data exchange languages
A data interchange (or exchange) language/format is a language that is domain-independent and can be used for data from any kind of discipline.
They have "evolved from being markup and display-oriented to further support the encoding of metadata that describes the structural attributes of the information."
Practice has shown that certain types of
formal language
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
The alphabet of a formal language consists of sy ...
s are better suited for this task than others, since their specification is driven by a formal process instead of particular software implementation needs. For example,
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
is a
markup language that was designed to enable the creation of dialects (the definition of domain-specific sublanguages).
However, it does not contain domain-specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as
parser
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
s, schema
validator
A validator is a computer program used to check the validity or syntactical correctness of a fragment of code or document. The term is commonly used in the context of validating HTML,Tittel, Ed, and Mary C. Burmeister. HTML 4 for Dummies. --For d ...
s, and transformation tools.
Popular languages used for data exchange
The following is a partial list of popular generic languages used for data exchange in multiple domains.
Yes
!-- Please verify definitions for the column headers of the table! -->
Nomenclature
* Schemas – Whether the language definition is available in a computer interpretable form
* Flexible – Whether the language enables extension of the semantic expression capabilities without modifying the schema
* Semantic verification – Whether the language definition enables semantic verification of the correctness of expressions in the language
* Dictionary-Taxonomy – Whether the language includes a dictionary and a taxonomy (subtype-supertype hierarchy) of concepts with inheritance
* Synonyms and homonyms – Whether the language includes and supports the use of synonyms and homonyms in the expressions
* Dialecting – Whether the language definition is available in multiple natural languages or dialects
* Web or ISO standard – Organization that endorsed the language as a standard
* Transformations – Whether the language includes a translation to other standards
* Lightweight – Whether a lightweight version is available, in addition to a full version
* Human-readable – Whether expressions in the language are
human-readable
A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans.
In computing, ''human-readable'' data is often encoded as ASCII or Unicode text, rather than as binary data. In most c ...
(readable by humans without training)
* Compatibility – Which other tools are possible to use or required when using the language
Notes:
# RDF is a schema-flexible language.
# The schema of XML contains a very limited grammar and vocabulary.
# Available as an extension.
# In the default format, not the compact syntax.
# The syntax is fairly simple (the language was designed to be human-readable); the dialects may require
domain knowledge
Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
.
# The standardized fact types are denoted by standardized English phrases, which interpretation and use needs some training.
# The
Parse dialect is used to specify, validate, and transform dialects.
# The English version includes a Gellish English Dictionary-Taxonomy that also includes standardized fact types (= kinds of relations).
XML for data exchange
The popularity of
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
for data exchange on the
World Wide Web
The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet.
Documents and downloadable media are made available to the network through web ...
has several reasons. First of all, it is closely related to the preexisting standards
Standard Generalized Markup Language (SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example,
XHTML
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
While HTML, prior ...
has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers.
YAML for data exchange
YAML
YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Ext ...
is a language that was designed to be human-readable (and as such to be easy to edit with any standard text editor). Its notion often is similar to
reStructuredText
reStructuredText (RST, ReST, or reST) is a file format for textual data used primarily in the Python programming language community for technical documentation.
It is part of the Docutils project of the Python Doc-SIG (Documentation Special Inte ...
or a Wiki syntax, who also try to be readable both by humans and computers. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way.
REBOL for data exchange
REBOL is a language that was designed to be human-readable and easy to edit using any standard text editor. To achieve that it uses a simple free-form syntax with minimal punctuation and a rich set of datatypes. REBOL datatypes like URLs, emails, date and time values, tuples, strings, tags, etc. respect the common standards. REBOL is designed to not need any additional meta-language, being designed in a metacircular fashion. The metacircularity of the language is the reason why, e.g., the Parse dialect used (not exclusively) for definitions and transformations of REBOL dialects is also itself a dialect of REBOL.
REBOL was used as a source of inspiration for JSON.
Gellish for data exchange
Gellish English
Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language ("Generic Engineering Language", giving it the name, "Gellish") bu ...
is a formalized subset of natural English, which includes a simple grammar and a large extensible
English Dictionary-Taxonomy that defines the general and domain specific terminology (terms for concepts), whereas the concepts are arranged in a subtype-supertype hierarchy (a taxonomy), which supports inheritance of knowledge and requirements. The Dictionary-Taxonomy also includes standardized fact types (also called relation types). The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with
SQL,
RDF/XML
RDF/XML is a syntax,[RDF/XML Syntax Specification](_blank)
OWL
Owls are birds from the order Strigiformes (), which includes over 200 species of mostly solitary and nocturnal birds of prey typified by an upright stance, a large, broad head, binocular vision, binaural hearing, sharp talons, and feathers a ...
and various other meta-languages. The Gellish standard is a combination of ISO 10303-221 (AP221) and ISO 15926.
See also
*
Atom (file format)
*
Lightweight markup language
*
RSS
RSS ( RDF Site Summary or Really Simple Syndication) is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format. Subscribing to RSS feeds can allow a user to keep track of many di ...
*
Comma-separated values
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
(CSV)
References
{{Computer language
Data management