The Text Encoding Initiative (TEI) is a
text-centric community of practice
A community of practice (CoP) is a group of people who "share a concern or a passion for something they do and learn how to do it better as they interact regularly". The concept was first proposed by cognitive anthropologist Jean Lave and educat ...
in the
academic field
An academy (Attic Greek: Ἀκαδήμεια; Koine Greek Ἀκαδημία) is an institution of secondary or tertiary higher learning (and generally also research or honorary membership). The name traces back to Plato's school of philosophy, f ...
of
digital humanities
Digital humanities (DH) is an area of scholarly activity at the intersection of computing or Information technology, digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanitie ...
, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI
technical standard
A technical standard is an established norm or requirement for a repeatable technical task which is applied to a common and repeated use of rules, conditions, guidelines or characteristics for products or related processes and production methods, ...
, a
journal
A journal, from the Old French ''journal'' (meaning "daily"), may refer to:
*Bullet journal, a method of personal organization
*Diary, a record of what happened over the course of a day or other period
*Daybook, also known as a general journal, a ...
, a
wiki
A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pu ...
, a
GitHub
GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
repository and a
toolchain
In software, a toolchain is a set of programming tools that is used to perform a complex software development task or to create a software product, which is typically another computer program or a set of related programs. In general, the tools form ...
.
TEI guidelines
The ''TEI Guidelines'' collectively define a type of
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
format, and are the defining output of the community of practice. The format differs from other well-known
open formats
Open or OPEN may refer to:
Music
* Open (band), Australian pop/rock band
* The Open (band), English indie rock band
* ''Open'' (Blues Image album), 1969
* ''Open'' (Gotthard album), 1999
* ''Open'' (Cowboy Junkies album), 2001
* ''Open'' (YF ...
for text (such as
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
and
OpenDocument
The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed wi ...
) in that it is primarily semantic rather than presentational; the semantics and interpretation of every tag and attribute are specified.
There are some 500 different textual components and concepts
(,
,
,
,
,
etc.); each is grounded in one or more academic disciplines and examples are given.
Technical details
The standard is split into two parts, a discursive textual description with extended examples and discussion and set of tag-by-tag definitions. Schemata in most of the modern formats (
DTD,
RELAX NG
In computing, RELAX NG (REgular LAnguage for XML Next Generation) is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also ...
and
W3C Schema
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
) are generated automatically from the tag-by-tag definitions. A number of tools support the production of the guidelines and the application of the guidelines to specific projects.
A number of special tags are used to circumvent restrictions imposed by the underlying
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
; to allow representation of characters that do not qualify for Unicode inclusion
and to allow overcome the required strict linearity.
Most users of the format do not use the complete range of tags, but produce a customisation using a project-specific subset of the tags and attributes defined by the Guidelines. The TEI defines a sophisticated customization mechanism known as ODD for this purpose. In addition to documenting and describing each TEI tag, an ODD specification specifies its content model and other usage constraints, which may be expressed using
schematron
Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a structural schema language expressed in XML using a small number of elements and XPath.
In many implementations ...
.
''TEI Lite'' is an example of such a customization. It defines an XML-based
file format
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.
Some file formats ...
for exchanging texts. It is a manageable selection from the extensive set of elements available in the full TEI Guidelines.
As an XML-based format, TEI cannot directly deal with
overlapping markup
In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner.
A document with overlapping markup cannot be represented as a tree.
This is also known as concurre ...
and non-hierarchical structures. A variety of options to represent this sort of data is suggested by the guidelines.
Examples
The text of the TEI guidelines is rich in examples. There is also a samples page on the TEI wiki, which gives examples of real-world projects that expose their underlying TEI.
Prose tags
TEI allows texts to be marked up syntactically at any level of granularity, or mixture of granularities. For example, this paragraph (p) has been marked up into sentences (s) and clauses (cl).
It was about the beginning of September, 1664,
that I, among the rest of my neighbours,
heard in ordinary discourse
that the plague was returned again to Holland;
for it had been very violent there, and particularly at
Amsterdam and Rotterdam, in the year 1663,
whither, they say, it was brought,
some said from Italy, others from the Levant, among some goods
which were brought home by their Turkey fleet;
others said it was brought from Candia;
others from Cyprus.
It mattered not from whence it came;
but all agreed it was come into Holland again.
Verse
TEI has tags for marking up verse. This example (taken from the French translation of the TEI Guidelines) shows a sonnet.
Les amoureux fervents et les savants austères
Aiment également, dans leur mûre saison,
Les chats puissants et doux, orgueil de la maison,
Qui comme eux sont frileux et comme eux sédentaires.
Amis de la science et de la volupté
Ils cherchent le silence et l'horreur des ténèbres ;
L'Érèbe les eût pris pour ses coursiers funèbres,
S'ils pouvaient au servage incliner leur fierté.
Ils prennent en songeant les nobles attitudes
Des grands sphinx allongés au fond des solitudes,
Qui semblent s'endormir dans un rêve sans fin ;
Leurs reins féconds sont pleins d'étincelles magiques,
Et des parcelles d'or, ainsi qu'un sable fin,
Étoilent vaguement leurs prunelles mystiques.
Choice tag
The tag is used to represent sections of text that might be encoded or tagged in more than one possible way. In the following example, based on one in the standard, is used twice, once to indicate an original and a corrected number, and once to indicate an original and regularised spelling.
Lastly, That, upon his solemn oath to observe all the above
articles, the said man-mountain shall have a daily allowance of
meat and drink sufficient for the support of
1724
1728
of our subjects,
with free access to our royal person, and other marks of our
favour
favor
.
ODD
One Document Does it all ("ODD") is a
literate programming
Literate programming is a programming paradigm introduced in 1984 by Donald Knuth in which a computer program is given as an explanation of its logic in a natural language, such as English, interspersed (embedded) with snippets of macros and t ...
language for
XML schema
An XML schema is a description of a type of Extensible Markup Language, XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed ...
s.
In literate-programming style, ODD documents combine human-readable documentation and machine-readable models using the Documentation Elements module of the Text Encoding Initiative. Tools generate
localised and internationalised HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
,
ePub
EPUB is an e-book file format that uses the ".epub" file extension. The term is short for ''electronic publication'' and is sometimes styled ''ePub''. EPUB is supported by many e-readers, and compatible software is available for most smartphones ...
, or
PDF
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
human-readable output and
DTDs,
W3C XML Schema
XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item cont ...
,
Relax NG
In computing, RELAX NG (REgular LAnguage for XML Next Generation) is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also ...
Compact Syntax, or Relax NG XML Syntax machine-readable output.
The Roma web application is built around the ODD format and can use it to generate schemas in
DTD,
W3C XML Schema
XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item cont ...
,
Relax NG
In computing, RELAX NG (REgular LAnguage for XML Next Generation) is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also ...
Compact Syntax, or Relax NG XML Syntax formats, as used by many XML validation tools and services.
ODD is the format used internally by the Text Encoding Initiative for the TEI
technical standard
A technical standard is an established norm or requirement for a repeatable technical task which is applied to a common and repeated use of rules, conditions, guidelines or characteristics for products or related processes and production methods, ...
. Although ODD files generally describe the difference between a customized XML format and the full TEI model, ODD also can be used to describe XML formats that are entirely separate from the TEI. One example of this is the
W3C's Internationalization Tag Set
The Internationalization Tag Set (ITS) is a set of attributes and elements designed to provide internationalization and localization support in XML documents.
The ITS specification identifies concepts (called "ITS data categories") which are impor ...
which uses the ODD format to generate schemas and document its vocabulary.
TEI customizations
TEI customizations are specializations of the TEI XML specification for use in particular fields or by specific communities.
*
EpiDoc
EpiDoc is an international community that produces guidelines and tools for encoding in TEI XML scholarly and educational editions of ancient documents, especially inscriptions and papyri.
The EpiDoc Guidelines were originally proposed as a rec ...
(Epigraphic Documents)
* Charters Encoding Initiative
* Medieval Nordic Text Archive (Menota)
Customization in the TEI is done through the ODD mechanism mentioned above. In truth since its P5 version, all so-called 'TEI Conformant' uses of the TEI Guidelines are based on a TEI customization documented in a TEI ODD file. Even when users choose one of the off-the-shelf pre-generated schemas to validate against, these have been created from freely available customization files.
Projects
The format is used by many projects worldwide. Practically all projects are associated with one or more universities. Some well-known projects that encode texts using TEI include:
History
Prior to the creation of TEI, humanities scholars had no common standards for encoding electronic texts in a manner that would serve their academic goals (
Hockey
Hockey is a term used to denote a family of various types of both summer and winter team sports which originated on either an outdoor field, sheet of ice, or dry floor such as in a gymnasium. While these sports vary in specific rules, numbers o ...
1993, p. 41). In 1987, a group of scholars representing fields in humanities, linguistics, and computing convened at Vassar College to put forth a set of guidelines known as the “Poughkeepsie Principles”. These guidelines directed the development of the first TEI standard, "P1".
* 1987 – Work started by the
Association for Computers and the Humanities The Association for Computers and the Humanities (ACH) is the primary international professional society for digital humanities. ACH was founded in 1978. According to the official website, the organization "support and disseminate research and culti ...
, the
Association for Computational Linguistics
The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language proces ...
, and the
Association for Literary and Linguistic Computing
The European Association for Digital Humanities (EADH), formerly known as the Association for Literary and Linguistic Computing (ALLC), is a digital humanities organisation founded in London in 1973.History of Humanities Computing, in: A Compani ...
on what would become the TEI. This culminated in the ''Closing statement of the Vassar Planning Conference''.
* 1994 – TEI P3 released, co-edited by
Lou Burnard
Lou Burnard (born 1946 in Birmingham, England) is an internationally recognised expert in digital humanities, particularly in the area of Markup language, text encoding and digital libraries. He was assistant director of Oxford University Computi ...
(at
Oxford University
Oxford () is a city in England. It is the county town and only city of Oxfordshire. In 2020, its population was estimated at 151,584. It is north-west of London, south-east of Birmingham and north-east of Bristol. The city is home to the ...
) and
Michael Sperberg-McQueen
C. Michael Sperberg-McQueen is an American markup language specialist. He was co-editor of the Extensible Markup Language (XML) 1.0 spec (1998), and chair of the XML Schema working group.
He was also instrumental in the Text Encoding Initiative ...
(then at the
University of Illinois at Chicago
The University of Illinois Chicago (UIC) is a Public university, public research university in Chicago, Illinois. Its campus is in the Near West Side, Chicago, Near West Side community area, adjacent to the Chicago Loop. The second campus esta ...
, later at the
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
).
* 1999 – TEI P3 updated.
* 2002 – TEI P4 released, moving from SGML to XML; adoption of
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
, which XML parsers are required to support.
* 2007 – TEI P5 released, including integration with the
xml:lang
and
xml:id
attributes from the W3C (these had previously been attributes in the TEI namespace), regularization of local pointing attributes to use the hash (as used in HTML) and unification of the ptr and xptr tags. Together these changes with many more new additions make P5 more regular and bring it closer to current xml practice as promoted by the W3C and as used by other XML variants. Maintenance and feature update versions of TEI P5 have been released at least twice a year since 2007.
* 2011 – TEI P5 v2.0.1 released with support for
genetic editing
Genetic editing ( French ''critique génétique''; German ''genetische Kritik'') is an approach to scholarly editing in which an exemplar is seen as derived from a dossier of other manuscripts and events. The derivation can be through physical cut ...
(among many other additions the genetic editing features allow encoding of texts without interpretation as to their specific semantics).
* 2017 – TEI was awarded the
Antonio Zampolli Prize from the Alliance of Digital Humanities Organizations.
References
External links
TEI Consortium Web site with a list o
TEI projects an
wikiJournal of the TEITEI Lite: An Introduction to Text Encoding for InterchangeTEI @ Oxford(hosted at
Oxford University
Oxford () is a city in England. It is the county town and only city of Oxfordshire. In 2020, its population was estimated at 151,584. It is north-west of London, south-east of Birmingham and north-east of Bristol. The city is home to the ...
) with development and backup versions of much of the core content.
TEI GitHub site(hosted at
GitHub
GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
) with repository and issue tracker
Larger list of TEI ProjectsWhat is the TEI?(Introductory overview by Lou Burnard)
{{Authority control
Digital humanities
XML-based standards
Markup languages
Metadata standards
Data modeling languages
Textual scholarship