Structured document
   HOME

TheInfoList



OR:

A structured document is an
electronic document An electronic document is any electronic media content (other than computer programs or system files) that is intended to be used in either an electronic form or as printed output. Originally, any computer data were considered as something inter ...
where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" (or "code sample" or "quatrain") rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.


Overview

Structured documents generally focus on labeling things that can be used for a variety of processing purposes, not merely formatting. For example, explicit labeling of "chapter title" or "emphasis" is far more useful to systems for the visually impaired, than merely "Helvetica bold 24" or "italic". In the same way, meaningful labeling of the many items on a technical information sheet enables far better integration with databases, search systems, online catalogs, and so on. Structured documents generally support at least hierarchical structures, for example lists, not merely list items; sections, not merely section headings; and so on. This is in stark contrast to formatting-oriented systems. High-end systems also support multiple independent and/or overlapping sets of components. Structured document systems commonly permit creating explicit rules defining component types and how they may be combined. Such a set of rules is called a "schema" by analogy with database schemas. Several formal languages exist for specifying them, such as
XSD XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium ( W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item con ...
,
Relax NG In computing, RELAX NG (REgular LAnguage for XML Next Generation) is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also ...
, and
Schematron Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a structural schema language expressed in XML using a small number of elements and XPath. In many implementations ...
. A structured document which obeys the rules of the schema is commonly called "valid according to that schema". Some systems also support documents with component of arbitrary types and combinations, but still with syntactic rules for how those components are identified. Lie and Saarela noted the " Standard Generalized Markup Language (SGML) has pioneered the concept of structured documents", although earlier systems such as Scribe,
Augment Augment or augmentation may refer to: Language *Augment (Indo-European), a syllable added to the beginning of the word in certain Indo-European languages * Augment (Bantu languages), a morpheme that is prefixed to the noun class prefix of nouns ...
, and FRESS provided many structured-document features and capabilities, and SGML's offspring
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
is now favored. One very widely used representation for structured documents is
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
, a schema defined and described by the
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
. However, HTML has not only tags for meaning-oriented components such as paragraph, title, and code; but also format-oriented ones such as italic, bold, and most table. In practice, HTML is sometimes used as a structured document system, but often used as a formatting language. Many domains use structured documents via domain-specific schemas they have co-operatively developed, such as JATS for journal publishing, TEI for literary documents, UBL and EDI for business interchange, XTCE for spacecraft telemetry,
REST Rest or REST may refer to: Relief from activity * Sleep ** Bed rest * Kneeling * Lying (position) * Sitting * Squatting position Structural support * Structural support ** Rest (cue sports) ** Armrest ** Headrest ** Footrest Arts and enter ...
for Web interfaces, and countless more. All these cases use specific schemas based on
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
.


Structural semantics

In writing structured documents the focus is on encoding the logical structure of a document, with less or even no explicit work devoted to its presentation to humans by printed pages or screens (in some cases, no such use is even expected). Structured documents can easily be processed by computer systems to extract and present derivative forms of the document. In most Wikipedia articles for example, a table of contents is automatically generated from the different heading tags in the body of the document. Because the SGML conversion of the
Oxford English Dictionary The ''Oxford English Dictionary'' (''OED'') is the first and foundational historical dictionary of the English language, published by Oxford University Press (OUP). It traces the historical development of the English language, providing a co ...
explicitly distinguished the many different meanings which attach to the print version's use of italics, search tools can retrieve entries based on etymology, quotations, and many other features of interest. When HTML provides structural rather than merely formatting information, visually impaired users can be easily given a more useful reading interface. When travel companies provide itineraries as structured documents rather than just displays, user tools can easily extract the necessary facts and pass them on to calendar or other applications. In
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
a part of the logical structure of a document may be the document body; , containing a first level heading;

, and a paragraph;

.

Structured document

A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting.

One of the most attractive features of structured documents is that they can be reused in many contexts and presented in various ways on mobile phones, TV screens, speech synthesisers, and any other device which can be programmed to process them.


Other semantics

Other meaning can be ascribed to text which isn't "structural" in quite the same sense as larger objects, but is still considered "document structure" because it expresses claims about the scope and nature or
ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exi ...
of portions of a document, rather than instructions about its presentation. In the
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
fragment above, the element means that the enclosed text is emphatic. In visual terms this commonly rendered via bold, just like ; but a speech interface would instead likely use voice inflection. The term
semantic markup Semantic HTML is the use of HTML markup to reinforce the semantics, or meaning, of the information in web pages and web applications rather than merely to define its presentation or look. Semantic HTML is processed by traditional web browsers a ...
excludes markup like which directly expresses no meaning other than an instruction to a visual display (although an intelligent agent may be able to discern a structural meaning lurking behind the tag). The "strong" tag is "descriptive" or "structural" in that it is intended to label an abstract, quasi-linguistic property of its content, rather than to describe the appropriate presentation in some particular medium. Some other structural tags in HTML include , ,
, , , , , , and . Other schemas such as
DocBook DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation. As a semantic languag ...
and TEI have far larger selections. The anchor tag is used for another slightly different kind of structure, namely the interconnection or cross-reference structure, rather than the interval section division. This is most definitely structure, and in fact it is possible to create alternate markup for documents that expresses the same particular structures in either way (for example, using
transclusion In computer science, transclusion is the inclusion of part or all of an electronic document into one or more other documents by reference via hypertext. Transclusion is usually performed when the referencing document is displayed, and is normal ...
to represent section contents, rather than navigational hyperlink presentations).
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
from early on has also had tags which express presentational semantics, such as bold () or ''italic'' (), or to alter font sizes or which had other effects on the presentation. Modern versions of markup languages discourage such markup in favor of descriptive markup which is mapped to particular presentations via style sheets, a method pioneered by systems such as Scribe and FRESS. Different style sheets can be attached to any markup, semantic or presentational, to produce different presentations, although mapping an tag name "italic" to boldface presentation is not entirely intuitive.


Context and intent

In principle, just what constitutes "structure" vs. non-structure can vary. In a book specifically about typography, tagging something as "italic" or "bold" may well be the whole point. For example, a discussion of when to use particular styles will likely want to give examples and counter-examples, which would no longer make sense if the rendering is not in sync with the prose. Similarly, a particular edition of a document may be of interest not only for its content but for its typographic practice as well, in which case describing that practice is not only desirable but necessary. This problem is not unique to document structure, however; it also arises in grammar when discussing grammar, and in many other cases.


See also

*
Document processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current ...
*
Machine-readable document A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from machine-readable data by virtue of having sufficient structure to provide the necessary context to support the bu ...
*
Overlapping markup In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non- hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concur ...
* Structured writing


References

{{DEFAULTSORT:Structured Document Electronic documents