markup language
A markup language is a Encoding, text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate au ...
s and the
digital humanities
Digital humanities (DH) is an area of scholarly activity at the intersection of computing or Information technology, digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanitie ...
, overlap occurs when a document has two or more structures that interact in a non-
hierarchical
A hierarchy (from Greek: , from , 'president of sacred rites') is an arrangement of items (objects, names, values, categories, etc.) that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an importan ...
manner.
A document with overlapping markup cannot be represented as a
tree
In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, e.g., including only woody plants with secondary growth, only ...
.
This is also known as concurrent markup.
Overlap happens, for instance, in
poetry
Poetry (from the Greek language, Greek word ''poiesis'', "making") is a form of literature, literary art that uses aesthetics, aesthetic and often rhythmic qualities of language to evoke meaning (linguistics), meanings in addition to, or in ...
feet
The foot (: feet) is an anatomical structure found in many vertebrates. It is the terminal portion of a limb which bears weight and allows locomotion. In many animals with feet, the foot is an organ at the terminal part of the leg made up of ...
and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.
History
The problem of non-hierarchical structures in documents has been recognised since 1988; resolving it against the dominant paradigm of text as a single hierarchy (an ''ordered hierarchy of content objects'' or ''OHCO'') was initially thought to be merely a technical issue, but has, in fact, proven much more difficult.
In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists".
Markup overlap continues to be a primary issue in the digital study of theological texts in 2019, and is a major reason for the field retaining specialised markup formats—the Open Scripture Information Standard and the
Theological Markup Language The Theological Markup Language (ThML) is a "royalty-free" XML-based format created in 1998 by the Christian Classics Ethereal Library (CCEL) to create electronic theological texts. Other formats such as STEP and Logos Library System (LLS) were fou ...
—rather than the inter-operable
Text Encoding Initiative
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...
-based formats common to the rest of the
digital humanities
Digital humanities (DH) is an area of scholarly activity at the intersection of computing or Information technology, digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanitie ...
.
Properties and types
A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter.
Contiguous overlap can always be represented as a linear document with milestones (typically co-indexed start- and end-markers), without the need for fragmenting a (logical) component into multiple physical ones. Non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (''self-overlap'').
A scheme may have a ''privileged'' hierarchy.
Some
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means;
these are said to be ''non-privileged''.
identifies a tripartite classification of instances of overlap: 1. "Variation of content and structure", 2. "Overlay of multiple perspectives or markup sets", and 3. "Overlap of individual start and end tags within a single markup perspective";
additionally, some apparent instances of overlap are in fact schema definition problems, which can be resolved hierarchically.
He contends that type 1 is best resolved by a system of multiple documents external to the markup, but types 2 and 3 require dealing with internally.
Approaches and implementations
identifies several criteria for judging solutions to the overlap problem:
* readability and maintainability,
* tool support and compatibility with XML,
* possible validation schemes, and
* ease of processing.
Tag soup is, strictly speaking, not overlapping markup—it is malformed
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
, which is a non-overlapping language, and may be ill-defined.
Some
web browsers
A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scree ...
attempted to represent overlapping start and end tags with non-hierarchical
Document Object Model
The Document Object Model (DOM) is a cros s-platform and language-independent API that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with ...
s (DOM), but this was not standardised across all browsers and was incompatible with the innately hierarchical nature of the DOM.
HTML5
HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy.
With
XHTML
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
While HTML, pr ...
and
SGML
The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.
The HTML standard defines a
paragraph
A paragraph () is a self-contained unit of discourse in writing dealing with a particular point or idea. Though not required by the orthographic conventions of any language with a writing system, paragraphs are a conventional means of organizing ...
concept which can cause overlap with other elements and can be non-contiguous.
SGML
The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any.
DTD validation is only defined for each individual hierarchy with CONCUR. Validation across hierarchies is not defined by the standard. CONCUR cannot support self-overlap, and it interacts poorly with some of SGML's abbreviatory features.
This feature has been poorly supported by tools and has seen very little actual use;
using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.
Within hierarchical languages
There are several approaches to representing overlap in a non-overlapping language.
The
Text Encoding Initiative
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...
, as an XML-based markup scheme, cannot directly represent overlapping markup.
All four of the below approaches are suggested.
The Open Scripture Information Standard is another XML-based scheme, designed to mark up the
Bible
The Bible is a collection of religious texts that are central to Christianity and Judaism, and esteemed in other Abrahamic religions such as Islam. The Bible is an anthology (a compilation of texts of a variety of forms) originally writt ...
.
It uses empty milestone elements to encode non-privileged components.
To illustrate these approaches, marking up the sentences and lines of a fragment of ''
Richard III
Richard III (2 October 1452 – 22 August 1485) was King of England from 26 June 1483 until his death in 1485. He was the last king of the Plantagenet dynasty and its cadet branch the House of York. His defeat and death at the Battle of Boswor ...
'' by
William Shakespeare
William Shakespeare ( 23 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
will be used as a running example. Where there is a privileged hierarchy, the lines will be used.
Multiple documents
''Multiple documents'' can each provide different internally consistent hierarchies. The advantage of this approach is that each document is simple and can be processed with existing tools, but requires maintenance of redundant content and it can be difficult to cross-reference between different views. With multiple documents, the overlap can be analysed with data comparison and
delta encoding
Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta comp ...
techniques, and, in an XML context, specific XML tree differencing algorithms are available.
recommends this approach for encoding multiple variants of a single text and to accept the duplication of the parts which do not vary, rather than attempting to create a structure that represents all of the variation present;
further, he suggests that this alignment be performed automatically, and that misalignment is rare in practice.
Example, with lines marked up:
I, by attorney, bless thee from thy mother,Who prays continually for Richmond's good.So much for that.—The silent hours steal on,And flaky darkness breaks within the east.
With sentences marked up:
I, by attorney, bless thee from thy mother,
Who prays continually for Richmond's good.So much for that.—The silent hours steal on,
And flaky darkness breaks within the east.
Milestones
''Milestones'' are empty elements that mark the beginning and end of a component, typically using the XML ID mechanism to indicate which "begin" element goes with which "end" element. Milestones can be used to embed a non-privileged structure within a hierarchical language, In their basic form they can only represent contiguous overlap. Generic XML can of course parse the milestone elements, but do not understand their special meaning and so cannot easily process or validate the non-privileged structure.
Milestone have the advantage that the markup for overlapping elements is located right at the relevant boundaries, like other markup. This is an advantage for maintainability and readability. CLIX is an example of such an approach.
Example:
I, by attorney, bless thee from thy mother,Who prays continually for Richmond's good.So much for that.—The silent hours steal on,And flaky darkness breaks within the east.
Punctuation
Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
and spaces have been identified as a type of milestone-style 'crypto-overlap' or 'pseudo-markup', as the boundaries of words, clauses, sentences and the like do not necessarily align with the formal markup boundaries hierarchically.
It is also possible to use more complex milestones to represent non-contiguous structures. For example, TAGML's "suspend" and "resume" semantic can be expressed using milestones, for example by adding an attribute to indicate whether each milestone represents a start, suspend, resume, or end point. Re-ordering and even self-overlap can be achieved similarly, by annotating each milestone with a "next chunk" reference.
Joins
''Joins'' are pointers within a privileged hierarchy to other components of the privileged hierarchy, which may be used to reconstruct a non-privileged component akin to following a
linked list
In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whi ...
. A single non-privileged element is ''segmented'' into several ''partial'' elements within the privileged hierarchy; the partial elements themselves do not represent a single unit in the non-privileged hierarchy, which can be misleading and make processing difficult. While this approach can support some discontiguous structures, it is not able to re-order elements. A slightly different approach can, however, express re-ordering by expressing the join away from the content, at the cost of directness and maintainability.
Join-based representations can introduce the possibility of cycles between elements; detecting and rejecting these adds complexity to implementations.
Example:
I, by attorney, bless thee from thy mother,Who prays continually for Richmond's good.So much for that.—The silent hours steal on,And flaky darkness breaks within the east.
Stand-off markup
''Stand-off markup'' is similar to using joins, except that there may be no privileged hierarchy: each part of the document is given a label (or might be referred to by an offset), and the document structure is expressed by pointing to the content from markup that 'stands off' from the content (possibly in an entirely different file), and might contain no content itself. The TEI guidelines identify the unity of the elements as a primary advantage of stand-off markup over joins, in addition to the ability to produce and distribute annotations separately from the text, possibly even by different authors applying markup to a read-only document, allowing collaborative approaches to markup by a divide and conquer strategy.
Example:
I, by attorney, bless thee from thy mother,Who prays continually for Richmond's good.So much for that.—The silent hours steal on,And flaky darkness breaks within the east.
...
It has been claimed that separating markup and text can result in overall simplification and increased maintainability, and by 2017, `` e current state of the art to epresent(...) linguistically annotated data is to use a graph-based representation serialized as standoff XML as a pivot format´´, i.e., that standoff was the most widely accepted approach to address the overlapping markup challenge.
Standoff formalisms have been the basis for an ISO standard for linguistic annotation, they have been successfully applied for developing corpus management systems, and (as of April 2020) they are actively being developed in the TEI. One published example of a successful stand-off annotation scheme was developed as part of a bitext natural language documentation project focused on the preservation of low-resource or endangered languages.Xia, F., Lewis, W.D., Goodman, M.W. et al. Enriching a massively multilingual database of interlinear glossed text. Lang Resources & Evaluation 50, 321–349 (2016). https://doi.org/10.1007/s10579-015-9325-4
Challenges
Representing overlapping markup within hierarchical languages is challenging, for reasons of redundancy and/or complexity. In the 2000s to 2010s, standoff formalisms were generally accepted as the most promising approach here, but a disadvantage of standoff is that validation is very challenging.
Standoff formalisms are not natively supported by database management systems, so that (by 2017) it was suggested to ``use ... standoff XML as a pivot format (...) and relational data bases for querying.´´ In practical applications, this requires complicated architectures and/or labor-intense transformation between pivot format and internal representation. As a result, maintenance is problematic. This has been a motivation to develop corpus management systems on the basis of graph data bases and for using established graph-based formalisms as pivot formats.
Special-purpose languages
For implementing the above-mentioned strategies, either existing markup languages (such as the TEI) can be extended or special-purpose languages can be designed.
Historical formalisms
* LMNL is a non-hierarchical markup language first described in 2002 by Jeni Tennison and Wendell Piez, annotating ranges of a document with properties and allowing self-overlap. CLIX, which originally stood for 'Canonical LMNL In XML', provides a method for representing any LMNL document in a milestone-style XML document. It also has another XML serialisation, xLMNL.
* MECS was developed by the
University of Bergen
The University of Bergen () is a public university, public research university in Bergen, Norway. As of 2021, the university had over 4,000 employees and 19,000 students. It was established by an act of parliament in 1946 consolidating several sci ...
's Wittgenstein Archive. However, it had several problems: it allowed some non-sensical documents of overlapping elements, it could not support self-overlap, and it did not have the capacity to define a DTD-like grammar. The theory of General Ordered-Descendant Directed Acyclic Graphs (GODDAGs), while not strictly a markup language itself, is a general data model for non-hierarchical markup. ''Restricted GODDAGs'' were designed specifically to match the semantics of MECS; general GODDAGs may be non-contiguous and need a more powerful language. TexMECS is a successor to MECS, which has a formal grammar and is designed to represent every GODDAG and nothing that is not a GODDAG.
* XCONCUR (previously MuLaX) is a melding-together of XML and SGML's CONCUR, and also contains a validation language, XCONCUR-CL, and a SAX-like API.
* Marinelli, Vitali and Zacchiroli provide algorithms to convert between restricted GODDAGs, ECLIX, LMNL, parallel documents in XML, contiguous stand-off markup and TexMECS.
None of these formalisms seem to be maintained anymore. Consensus community seems to be to employ standoff XML or graph-based formalisms.
Actively maintained standoff XML languages
* GrAF-XML, standoff-XML serialization of the Linguistic Annotation Framework (LAF), used, e.g., for the American National Corpus
* PAULA-XML, standoff-XML serialization of the data model underlying the corpus management system ANNIS and the converter suite SALT
* NAF (NLP Annotation Format / Newsreader Annotation Format), standoff XML format originally developed in the NewsReader project (FP7, 2013-2015), currently used by NLP tools such as FreeLing (with support for English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, etc.), and EusTagger (with support for Basque, English, Spanish).
* The Charles Harpur Critical Archive is encoded using 'multi-version documents' (MVD) to represent the variant versions of documents and as a means of indicating additions, deletions and revisions using a tactical combination of multiple documents and stand-off ranges within an underlying graph-based model. MVD is presented as an application file format, requiring specialised tools to view or edit.
* A standoff XML scheme was developed by the ''Odin'', ''Intent'', and ''XigtEdit'' collaboration, which is focused on a large dataset of Interlinear Glossed Text (IGT) for supporting natural language resource and documentation projects.
Standoff approaches have two parts, commonly called the "content" and the "annotations." These can be expressed in unrelated representations. Simple standoff annotations per se, involve no more than a list of (location, type) pairs. Thus, in a few applications standoff annotations are expressed in CSV,
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
Web Annotation
Web annotation can refer to online annotations of web resources such as web pages or parts of them, or a set of World Wide Web Consortium, W3C W3C recommendation, standards developed for this purpose. The term can also refer to the creations of an ...
) or graph formalisms grounded in string URIs (see below). However, representing and validating content in such representations is much more difficult and much less common.
Graph-based formalisms
Standoff markup employs a data model based on directed graphs, thus complicating its representation when grounding markup information in a tree. Representing overlapping hierarchies in a graph eliminates this challenge. Standoff annotations can thus be more adequately represented as generalised directed
multigraph
In mathematics, and more specifically in graph theory, a multigraph is a graph which is permitted to have multiple edges (also called ''parallel edges''), that is, edges that have the same end nodes. Thus two vertices may be connected by mor ...
s and use formalisms and technologies developed for this purpose, most notably those based on the Resource Description Framework (RDF).
EARMARK is an early RDF/
OWL
Owls are birds from the order Strigiformes (), which includes over 200 species of mostly solitary and nocturnal birds of prey typified by an upright stance, a large, broad head, binocular vision, binaural hearing, sharp talons, and feathers a ...
representation that encompasses General Ordered-Descendant Directed Acyclic Graphs (GODDAGs). The theory of GODDAGs, while not strictly a markup language itself, is a general data model for non-hierarchical markup.
RDF is a semantic data model that is linearization-independent, and it provides different linearisations, including an XML format (
RDF/XML
RDF/XML is a syntax,RDF/XML Syntax Specification
RDFa
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within web documents. The Resource Descript ...
), a JSON format (
JSON-LD
JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding linked data using JSON. One goal for JSON-LD was to require as little effort as possible from developers to transform their existing JSON to JSON-LD. JSON-LD allows data ...
), and binary formats designed to facilitate querying or processing (RDF-HDT, RDF-Thrift). RDF is semantically equivalent to graph-based data models underlying standoff markup; it does not require special-purpose technology for storing, parsing and querying. Multiple interlinked RDF files representing a document or a corpus constitute an example of Linguistic Linked Open Data.
An established technique to link arbitrary graphs with an annotated document is to use URI
fragment identifier
In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier poi ...
s to refer to parts of a text and/or document, see overview under
Web annotation
Web annotation can refer to online annotations of web resources such as web pages or parts of them, or a set of World Wide Web Consortium, W3C W3C recommendation, standards developed for this purpose. The term can also refer to the creations of an ...
. The
Web Annotation
Web annotation can refer to online annotations of web resources such as web pages or parts of them, or a set of World Wide Web Consortium, W3C W3C recommendation, standards developed for this purpose. The term can also refer to the creations of an ...
standard provides format-specific `selectors' as an additional means, e.g., offset-, string-match- or XPath-based selectors.
Native RDF vocabularies capable to represent linguistic annotations include:
* Web Annotation
* NLP Interchange Format (NIF)
* LAPPS Interchange Format (LIF)
Related vocabularies include
* POWLA, an OWL2/DL serialization of PAULA-XML
* RDF-NAF, an RDF serialization of the NLP Annotation Format
In early 2020, W3C Community Group LD4LT has launched an initiative to harmonize these vocabularies and to develop a consolidated RDF vocabulary for linguistic annotations on the web.