Standard versions
SGML is anHistory
SGML descended fromDocument validity
SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of ISO 8879 (from the public draft):A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.A type-valid SGML document is defined by the standard as
An SGML document in which, for each document instance, there is an associated document type declaration (DTD) to whose DTD that instance conforms.A tag-valid SGML document is defined by the standard as
An SGML document, all of whose document instances are fully tagged. There need not be a document type declaration associated with any of the instances. Note: If there is a document type declaration, the instance can be parsed with or without reference to it.
Terminology
''Tag-validity'' was introduced in SGML (ENR+WWW) to supportSyntax
An SGML document may have three parts: # the SGML Declaration, # the Prologue, containing a DOCTYPE declaration with the various ''markup declarations'' that together make a Document Type Definition (DTD), and # the instance itself, containing one top-most element and its contents. An SGML document may be composed from many entities (discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the ''concrete syntax'' of the document. Although full SGML allows implicit markup and some other kinds of tags, theOptional features
SGML generalizes and supports a wide range of markup languages as found in the mid 1980s. These ranged from terse Wiki-like syntaxes to RTF-like bracketed languages toConcrete and abstract syntaxes
The usual (default) SGML ''concrete syntax'' resembles this example, which is the defaulttypically something likethis
:e
prefix denotes an end tag: :xmp.Hello, world:exmp.
. According to the reference syntax, letter case (upper- or lower-case) is not distinguished in tag names, so the three tags <quote>
, <QUOTE>
, and <quOtE>
are equivalent. (A concrete syntax might change this rule via the NAMECASE NAMING declarations.)
Markup minimization
SGML has features for reducing the number of characters required to mark up a document, which must be enabled in the SGML Declaration. SGML processors need not support every available feature, thus allowing applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid structures. XML is intolerant of syntax omissions, and does not require a DTD for checking well-formedness.OMITTAG
Both start tags and end tags may be omitted from a document instance, provided: # the OMITTAG feature is enabled in the SGML Declaration, # the DTD indicates that the tags are permitted to be omitted, # (for start tags) the element has no associated required (#REQUIRED
) attributes, and
# the tag can be unambiguously inferred by context.
For example, if OMITTAG YES is specified in the SGML Declaration (enabling the OMITTAG feature), and the DTD includes the following declarations:
EMPTY
as defined in the DTD:
SHORTREF
Tags can be replaced with delimiter strings, for a terser markup, via the SHORTREF feature. This markup style is now associated withSHORTTAG
SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks—either double" "
(LIT) or single ' '
(LITA)—so that the previous markup example could be written:
typically something likethis>
</>
in <ITALICS>this</>
"inherits" its value from the nearest previous full start tag, which, in this example, is <ITALICS>
(in other words, it closes the most recently opened item). The expression is thus equivalent to <ITALICS>this</ITALICS>
.
NET
Another feature is the ''NET'' (Null End Tag) construction:<ITALICS/this/
, which is structurally equivalent to <ITALICS>this</ITALICS>
.
Other features
Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:slash ( / ) stands for the NET-enabling "start-tag close" (NESTC), and the second slash stands for the NET. NOTE: XML defines NESTC with a /, and NET with an > (angled bracket)—hence the corresponding construct in XML appears as. The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:"> (and "&#RE;&#RS;" is a short-reference delimiter in the concrete syntax), then:is equivalent to: first line second line first line second line
Formal characterization
SGML has many features that defied convenient description with the popular formal automata theory and the contemporaryparser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...technology of the 1980s and the 1990s. The standard warns in Annex H: A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser, notes and specifies various differences. There appears to be no definitive classification of full SGML against a known class offormal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe .... Plausible classes may include tree-adjoining grammars and adaptive grammars. XML is described as being generally parsable like a two-level grammar for non-validated XML and aConway Conway may refer to: Places United States * Conway, Arkansas * Conway County, Arkansas * Lake Conway, Arkansas * Conway, Florida * Conway, Iowa * Conway, Kansas * Conway, Louisiana * Conway, Massachusetts * Conway, Michigan * Conway Townshi ...-style pipeline of coroutines ( lexer,parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ..., validator) for valid XML. The SGML productions in the ISO standard are reported to be LL(3) or LL(4). XML-class subsets are reported to be expressible using a W-grammar. According to one paper, probably considered at an '' information set'' orparse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...level rather than a character or delimiter level: The SGML standard does not define SGML with formal data structures, such asparse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...s; however, an SGML document is constructed of a rooted directed acyclic graph (RDAG) of physical storage units known as " entities", which is parsed into a RDAG of structural units known as "elements". The physical graph is loosely characterized as an ''entity tree'', but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an ''element tree'', but the ID/IDREF markup allows arbitrary arcs. The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities. The SGML standard describes it in terms of ''maps'' and ''recognition modes'' (s9.6.1). Each entity, and each element, can have an associated ''notation'' or ''declared content type'', which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated ''delimiter map'' (and ''short reference map''), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a state machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner, while the tokenizer relates to the recognition modes. Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to ''recognize'' lexical structures, and actively — to ''generate'' missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize. SGML uses the term ''validation'' for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML ''without'' a DTD (e.g. simple XML), is a grammar or a language; SGML ''with'' a DTD is a metalanguage. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism ''is'' a metalanguage. SGML has an abstract syntax implemented by many possible concrete syntaxes; however, this is not the same usage as in an abstract syntax tree and as in a concrete syntax tree. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by John McCarthy.
Derivatives
XML
TheW3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...XML (Extensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the World Wide Web. In addition to disabling many SGML options present in the reference syntax (such as omitting tags and nested subdocuments) XML adds a number of additional restrictions on the kinds of SGML syntax. For example, despite enabling SGML shortened tag forms, XML does not allow unclosed start or end tags. It also relied on many of the additions made by the WebSGML Annex. XML currently is more widely used than full SGML. XML has lightweight internationalization based onUnicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char .... Applications of XML includeXHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, pr ..., XQuery, XSLT, XForms, XPointer, JSP, SVG, RSS,Atom Atoms are the basic particles of the chemical elements. An atom consists of a atomic nucleus, nucleus of protons and generally neutrons, surrounded by an electromagnetically bound swarm of electrons. The chemical elements are distinguished fr ..., XML-RPC, RDF/XML, andSOAP Soap is a salt (chemistry), salt of a fatty acid (sometimes other carboxylic acids) used for cleaning and lubricating products as well as other applications. In a domestic setting, soaps, specifically "toilet soaps", are surfactants usually u ....
HTML
While HTML (Hyper Text Markup Language) was developed partially independently and in parallel with SGML, its creator,Tim Berners-Lee Sir Timothy John Berners-Lee (born 8 June 1955), also known as TimBL, is an English computer scientist best known as the inventor of the World Wide Web, the HTML markup language, the URL system, and HTTP. He is a professorial research fellow a ..., intended it to be an application of SGML. The design of HTML was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application; however, the HTML markup language has many legacy- and exception-handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 – SGML. The charter for the 2006 revival of theWorld Wide Web Consortium The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML. Although HTML syntax closely resembles SGML syntax with the default ''reference concrete syntax'',HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...abandons any attempt to define HTML as an SGML application, explicitly defining its own parsing rules, which more closely match existing implementations and documents. It does, however, define an alternativeXHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, pr ...serialization, which conforms to XML and therefore to SGML as well.
OED
The second edition of the ''
Oxford English Dictionary The ''Oxford English Dictionary'' (''OED'') is the principal historical dictionary of the English language, published by Oxford University Press (OUP), a University of Oxford publishing house. The dictionary, which published its first editio ...'' (OED) is entirely marked up with an SGML-based markup language using theLEXX ''Lexx'' (also known as ''LEXX: The Dark Zone Stories'' and ''Tales from a Parallel Universe'') is a science fiction television series created by Lex Gigeroff and brothers Paul Donovan (writer), Paul and Michael Donovan (producer), Michael Dono ...text editor. The third edition is marked up as XML.
Others
Other document markup languages are partly related to SGML and XML, but—because they cannot be parsed or validated or otherwise processed using standard SGML and XML tools—they are not considered either SGML or XML languages; the Z Format markup language for typesetting and documentation is an example. Several modern programming languages support tags as primitive token types, or now support Unicode andregular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...pattern-matching. An example is the Scala programming language.
Applications
Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the World Wide Web. The following list is of pre-XML SGML applications. *Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and ma ...(TEI) is an academic consortium that designs, maintains, and develops technical standards for digital-format textual representation applications. *DocBook DocBook is a Semantics (computer science), semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of docume ...is a markup language originally created as an SGML application, designed for authoring technical documentation; DocBook currently is an XML application. * CALS (Continuous Acquisition and Life-cycle Support) is a US Department of Defense (DoD) initiative for electronically capturing military documents and for linking related data and information. * HyTime defines a set of hypertext-oriented element types that allow SGML document authors to build hypertext and multimedia presentations. *EDGAR Edgar is a commonly used masculine English given name, from an Anglo-Saxon name ''Edgar'' (composed of ''wikt:en:ead, ead'' "rich, prosperous" and ''Gar (spear), gar'' "spear"). Like most Anglo-Saxon names, it fell out of use by the Late Midd ...(Electronic Data-Gathering, Analysis, and Retrieval) system effects automated collection, validation, indexing, acceptance, and forwarding of submissions, by companies and others, who are legally required to file data and information forms with the US Securities and Exchange Commission (SEC). * LinuxDoc. Documentation for Linux packages has used the LinuxDoc SGML DTD and Docbook XML DTD. * AAP DTD is a document type definition forscientific Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...documents, defined by the Association of American Publishers. * ISO 12083, a successor to AAP DTD, is an international SGML standard for document interchange between authors and publishers. * SGMLguid was an early SGML document type definition created, developed and used atCERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ....
Open-source implementations
Significant open-source implementations of SGML have included:
ASP-SGML
ARC-SGML
by Standard Generalized Markup Language Users', 1991, C language
SGMLS
by James Clark, 1993, C language
Project YAO
by Yuan-ze Institute of Technology, Taiwan, with Charles Goldfarb, 1994.
SP and Jade
by James Clark, C++ language SP and Jade, the associated DSSSL processors, are maintained by th
OpenJade
project, and are common parts of Linux distributions. A general archive of SGML software and materials resides a
SUNET
The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.
See also
* Organization for the Advancement of Structured Information Standards (OASIS) * S-expression * DSSSL – a Scheme-based processing language similar to XSL *LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latices are found in nature, but synthetic latices are common as well. In nature, latex is found as a wikt:milky, milky fluid, which is present in 10% of all floweri ...* List of general purpose markup languages * SGML entity * Tag omission
References
External links
Overview of SGML Resources
at W3C's website.
SC34 Committee Records
Charles Babbage Institute – Collection on the development of SGML and other standards influential in the development of current XML tools; documents include early drafts of SGML administrative materials, documentation, working group papers, and standards for computer languages.
SGML Syntax Summary by Charles Goldfarb
in SGML and HTML Explained, Martin Bryan (1997) (the original URL is broken at http://www.is-thought.co.uk/book/sgml-4.htm#Fig4-2)
Wayne Wohler, IBM Corporation, 1994.
*[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=16645 ISO/IEC 9070:1991 – Information technology – SGML support facilities – Registration procedures for public text owner identifiers] {{Authority control Data modeling languages ISO standards Markup languages Technical communication SGML,